Even if you are working somewhere remotely close to data technologies, you may have heard about ”Data Lakes”? Couple years ago, when we first heard the term, we visualized petabytes of data in one place and subsequently had the question – how is data lake different from data warehouse? Or isn’t data lake the new data warehouse 2.0?
Thats where we started researching and found this definition from the same person who came up with the term ‘Data Lake’. James Dixon, the founder and CTO of Pentaho, describes a data lake as –“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Analysts and consumers of traditional Data Warehouse were more dependent on developers and technologists to do the cleansing and transformation of data, while the new generation of data consumers are more enthusiastic about rolling up their sleeves and deep diving in the lake to uncover the hidden facts. In essence, these new generation data hackers need this big playground where they can play with the data in its native format. With new technological innovation like readily available compute and storage via. cloud and more technically skilled consumers (for example, data scientists), applying this new concept of data lakes helps organizations become nimble and more data oriented.
So the big question is – if data lake or data warehouse can suffice the need of an organization individually or do you need both?
Our take is ‘both’, keeping in mind the capabilities of these two platforms as they stand today. As things evolve, the idea, need and implementation of these platforms will change too. Today’s reality is that an appliance based data warehouse and a data lake are both optimized for different purposes, and the goal is to use each one for what they were designed to do.
Here are some things to consider while having these discussions in your organization –
- Types of data and consumer needs – “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” On the contrary, a data warehouse only stores data that has been modeled/structured.A data warehouse follow a schema on write approach while data lake follows a schema on read approach.
- Storage consideration – Data warehouse use appliance and they are expensive while data lake is supposed to be built on hadoop which is designed to be installed on low-cost commodity hardware.
- Quick and Dirty/Slow and Exact– Data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it. A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
- Security – Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when.
- Cohorts of User base- For a long time, the rally cry has been BI and analytics for everyone! We’ve built the data warehouse and invited “everyone” to come, but have they come? On average, 20-25% of them have. Is it the same cry for the data lake? Will we build the data lake and invite everyone to come? Not if you’re smart. A data lake, at this point in its maturity, is best suited for the data scientists.
According to many market researches, the data volumes are exploding, more data has been created in the past two years than in the entire previous history of the human race. This data has value and we need storage and technology to persist years and years of data, do analytics, visualization and predictions. And to make this happen, most organizations are leaning towards a hybrid architecture, a more evolutionary model that has a smooth transition than a revolutionary model with disruptions. And this is also going to be the next installment of this topic (How to connect data warehouse and data lake?)
Shweta Sinha leads the Data Warehouse and Devops efforts at Premera Blue Cross. She specializes in the creating and scaling data science and data engineering platform and infrastructure.
Try deep learning using MATLAB