It’s a term that probably rings a bell with storage and data analytics practitioners and is right up the alley for people working in Big Data technology platform Hadoop also. However, Data Lake is not a new concept but a storage mechanism that has gained currency over the last two years. And it is not entirely an offshoot of a Hadoop oriented storage / repository that gathers data from enterprise applications.
But first let’s define our term. For starters, in tech jargon, Data Lake is an object based repository that stores large amount of raw data right from structure, semi-structured and unstructured data in its native format. As opposed to a data warehouse that stores data in files or folders, Data Lake is marked by its horizontal flat architecture where each data element is assigned a unique identifier and is tagged with extended metatags. Many tech bloggers have attributed tagging methodology being similar to that of Twitter.
What is the major differentiator for Data Lake and Enterprise Data Warehouses (EDW)?
The major differentiator between Data Lake and EDW is that a data lake is fed data in its absolute raw, native form straight from data sources without any standardization, remodeling and alteration. Which means that raw data provides a number of ways to be queried, stored, derive insights from all types of data as opposed to EDW where data has to conform to its own predefined schema. Since it has to conform to an enterprise data model, EDW is capable of answering only a limited number of questions.
Simply put – Data Lake supports all data types, stores all data and provides faster insights. Hence, it has a horizontal scalable framework that can process all variety of data.
What led to the rise of Data Lake?
The term Data Lake had its first day in the sun when James Dixon, CTO of Pentaho, Florida-based company known primarily for its business analytics suite of open source products first mentioned in his now much-famed blog entry. For the interest of our readers, we have reproduced it here, “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.”
The analogy hit the mark with most early Hadoop adopters. Dixon found out later in his interactions with industry veterans deploying Data Lake, that it made it possible for enterprises in making the data more accessible and useful. What most industry insiders believe is that Data Lake helps in tackling spiraling data volumes, helps gain new business insights by storing large amounts of data in the chosen format, and then make it easy for processing through big data analytics. The trend is also spurred on by cloud computing which makes it possible for companies to build data lakes at a huge scale.
Why Data Lake is well-received?
- Stores a variety of data
- Evolves data management architecture
- Instead of supplanting it augments traditional EDW strategy
Hadoop and Data Lake are a marriage made in heaven
While Data Lake can definitely be aligned with other relational database architecture, its gained popularity with Hadoop primarily because of Hadoop is an open source platform and according to Hadoop adopters, it provides a less expensive repository for analytics data. Another bonus point was the fact that a Hadoop Data Lake architecture can also be used to complement an entire data warehouse rather than supplant it entirely. Moreover, information from a Hadoop Data Lake can be analyzed in its raw format and data can be extracted and processed through MapReduce, Spark and other data processing frameworks.
Data Lake as a Service in Indian ecosystem
One of the biggest Indian companies to employ Data Lake as a Services is HCL. HCL’s data lake offering comprises processing huge volumes of raw data from disparate sources, low cost of storing data and reducing downtime with preventive techniques.
1) HCL’s Data Lake application has been seen in the aviation industry for an American aircraft manufacturer providing preventive maintenance and reliability by finding areas where huge data sets could be optimized and manage it as part of Operations & Performance Analytics Group.
Another use case is for a leading air fleet manufacturer where Predictive analytics was applied for better engine health. The challenge was to find out factors affecting engine health that led to huge losses for the fleet manufacturer. With predictive maintenance analytics, engine health was enhanced by 25%.
2) Another sector where Data Lake mechanism has been put to work is Retail banking, bringing together data from varied sources. Persistent Systems has offices spread across the globe and has a branch in Bangalore. It provides Data Lake implementations and architecture to various sectors such as Retail, Banking and healthcare. The following use case is from Retail Banking where the lack of agility of a traditional data warehouse and data silos was overcome by providing a common data model through Data Lake. The Data Lake helped in KYC providing a complete overview of customer profile, their spending and saving patterns and helped in customer segmentation.
End result was:
- Helped in optimizing where the bank’s resources should be spent
- Built a solution to better understand customer issues which were not being funneled down the channel
Enterprises looking to invest in data lakes architecture and implementation should adapt the architecture to their specific industry, and look for co-existing Data Lake alongside enterprise data warehouse. One of the key takeaways is to define Data Governance capabilities as well.