Do Data Lakes Live Up to Their Promise?
The data lake concept arrived 10 years ago as the answer to common complaints about information silos and the volumes and complex varieties of information. It was a way to bring all the data together from multiple business applications and data systems into one centralized place, in whatever form it arrived, without the need for processing or structuring. The data lake was supposed to be the dream of fast-tracking structured data and unstructured data, such as videos and documents, into a one-stop repository shop for business insights, realized.
The data lake could hold vast amounts of raw data until needed for any use across the enterprise, even for use cases not yet identified. Unlike a data warehouse, which both stores and processes data, data lakes were designed to save time and provide flexibility, allowing consumers to just get the data in there and do what they need with the pure data down the line.
Ten years later it's time to ask: are data lakes delivering on the promise? Data professionals are calling data lakes "dead" or "bad." This article examines the current state of data lakes for the modern content manager and what to look for when deciding if data lakes are right for you or if a new concept like the data lakehouse is a better answer.
Enterprise Content in the Data Lake Architecture
A data lake isn't a technology product that you acquire; it's an architecture or approach to storing and organizing data. You can have one data lake or several, depending on the organization. It can be in the cloud, on-premises or a hybrid.
Someone asked recently if they should get a data warehouse or a data lake. These two ideas are complementary, not an either-or choice. The data lake holds data in its pure form, while data warehouses store processed, cleansed and structured data.
Content or unstructured data usually comes into the data lake without a model to structure the content and without metadata to describe it. The data is not consistent, standardized or trustworthy. In order for search engines to be able to search the content and for analytics applications to extract insights from it, the architecture may need to supplement the data lake with services that simplify ingestion of content into the data lake and tag the content with metadata.
Related Article: Don't Be Afraid of the Dark: Bring Dark Data Into the Light
Everyone in the Pool: The ‘Let’s Keep Everything’ Approach to Lifecycle Management
Data lakes provide a fast data stream in, without requiring a model to profile just what it is and how important it is. Therefore data lakes can encourage a “keep everything” approach to managing content throughout its lifecycle. But this stops being so much of a lake and becomes more of a data dumping ground.
Organizations shouldn't and can’t govern all of their data. They should classify, label and then appropriately handle critical documents and de-clutter the non-critical content from the active space.
Even critical records should not be kept forever as a default retention strategy. If those records are not disposed of at the end of their retention period, the company is exposing itself to real legal risk. How will you or a workflow find the records for disposition if they aren’t tagged? How do you know if you’re exposing private information to users without authorized access if you don’t know what’s there?
Don't treat data lakes as a vortex you throw everything into forever and hope for the best. There’s so much value locked away in that data and it’s unlikely to be found again once it’s a drip in the lake.
Examine the kind of data being ingested into the data lake. Is it something we need only for a limited period of time, i.e., readings in a point of time, or is it something that will tell stories to decision makers? Is the cost of sourcing data from all those diverse places and keeping it worth it?
Learning Opportunities
Related Article: When it Comes to Content Management, Master the Essentials First
Is There Hope for Data Lakes?
So is the data lake concept dead? The data lake market is predicted to grow from $7.9 billion in 2019 to $20.1 billion by 2024, with a shift to cloud-based data platforms. Adoption is also expected to grow, especially in the banking industry since pooling data centrally makes it easier to regulate.
To get return on the investment and change that come with a leap into the data lake, you'll need to carefully plan:
- Which content should be ingested.
- How to extract its value for the end user (e.g., data requirements for analytics are very different compared to the requirements for machine learning).
- How to profile it for appropriate handling, lineage tracking and findability (a catalog).
- How to train users in the importance and proper application of metadata.
- How it can integrate with other technologies.
- What skill sets are required by developers and architects to use the data lake.
- Retention and lifecycle management.
- Governance over access to and safe handling of private data.
Related Article: Data Ingestion Best Practices
Is a Data Lake Right for Me?
Some data lake projects fail because organizations think data lakes are the answer to all of their data problems. Like any new concept or buzzword, organizations must first analyze their use cases to see if data lakes are right for them. Data lakes were designed to store large volumes of structured and unstructured data. But if most of your data is structured, then a database may be a better fit.
If data is used for pre-defined reports and queries, then a data warehouse where data is packaged and ready for analytics is a better option. For machine learning experiments where results are not pre-defined, data lakes make better sense.
A data lakehouse is a new data management concept that combines the capabilities of data lakes and data warehouses, allowing it to serve both analytics and machine learning use cases. By combining the two concepts, data management teams avoid duplication, technology costs and security headaches.
No matter what you call it, valuable data assets need to be governed, packaged for consumer experience and meticulously catalogued for your business to realize true value.
Learn how you can join our contributor community.
About the Author
Andrea Malick is a Research Director in the Data and Analytics practice at Info-Tech, focused on building best practices knowledge in the Enterprise Information Management domain, with corporate and consulting leadership in content management (ECM) and governance.
Andrea has been launching and leading information management and governance practices for 15 years, in multinational organizations and medium sized businesses.
Connect with Andrea Malick: