We are living in the age of digital transformation when voluminous data is getting created in diverse forms and shape. Every business is trying to derive more and more value out of the available data.
One of the focus areas in data modernization is the addition of Data Lakes in the new scheme of things.
So, what is a Data Lake?
A data lake is a collection of data, not a platform for data. These are usually managed on Hadoop, less often on RDBMS. A common myth states that data lakes require open source Apache Hadoop or a vendor distribution of Hadoop. It’s true that majority of data lake implementations are on Hadoop and these are called Hadoop based data lakes. However few data lakes are deployed atop RDBMS and these are called relational data lakes.
Salient features of data lakes
- Handles large volumes of diverse (structured, semi/ un structured) data
- Mostly detailed source data
- Raw material for discovering entities and facts
- Data prep on demand
- Data repurposed later, as need arise
- Typically schema on read
- Persists data in its original raw state
- Integrates into multiple enterprise data ecosystems and architectures
Although there are many potential benefits of data lake but my focus will be more on the key barriers to data lakes.
Barriers to Data Lakes and how to overcome them
a) Data lake design:
In most of the scenarios, data warehouse architects design the data lake and get carried away with traditional approaches/ principles of data warehouse design. A data lake is not a data warehouse and a zone is not heavily structured like a subject area or a dimension.
Expect few zones, and within each zone, data is still in the raw format or slightly standardized. Typical zones are data landing, data staging, data domains (Example: HR or Customer data), departmental domains (Example: data used by marketers), analytics archives and analytics sandboxes. Once, the zones are decided, design data flow for moving data from one zone to another. Expect to revisit how data is organized in your data lake. In case of restructuring the data, the data should leave the lake and go to a more structured environment, such as a data warehouse of mart. One of the functions of the lake is to feed other databases.
When a data lake is not managed and governed properly, it deteriorates into a data swamp. It becomes nearly impossible to navigate, trust and leverage the disorganized data store for organizational benefit.
This risk can be easily mitigated by bringing in proper collaborative data governance, curation and stewardship.
Data Governance: Data governance is usually enforced via people and processes. From people perspective data governance takes the form of a board or committee, having mix of data management professionals (who create enterprise standards of data) and business managers (who serve as data owners, stewards and curators with focus on compliance). All these people collaborate to establish and enforce policies that ensures data is compliant, secured, standardized and trusted.
Implementers (Technical teams) of Data Lake must work with their enterprise governance board so that the lake and its data complies with the established policies.
Data Stewardship: Data is an asset to the lake and it should be curated by a data steward who is responsible for driving improvements in the data. Best data stewards are business people (non-technical staff) because they can prioritize based on business need and keep data management work aligned with business goals. Priority should be given to metadata, data quality and data lineage.
Cyber attackers are now organized and well equipped with the tools and technology to rapidly extract high value data assets from enterprises.
Such risks and liabilities can be alleviated by implementing multi layers of security.
- Data Lake needs standard protection in the forms of authentication and authorization.
- It is useful to record an audit trail of access by users and tools. Operational metadata can enable such audits.
- Unlike the user centric or application centric security mentioned above, data centric security layer operates on or near the data to cleanse, block and de-identify sensitive (personally identifiable information) or high value data. That way, when the data is stolen, the thief has nothing to sell or commit a crime with.
d) Availability of Technical resources:
There are very few data management professionals available who have prior experience with data lakes and Hadoop. The people who are available tend to command rather high salaries.
For these reasons, organizations should cross-train existing employees in these skills instead of hiring new folks in the team. This strategy works out well as it increases the value of employees and they are more engaged and committed.
As with any emerging technology, it will take time before data lakes reach to their full potential. But those who can start the journey now – strategically and with a long-term vision – stand to create an enormous competitive edge with the competitors that will be difficult to diminish in the years to come.
Try deep learning using MATLAB