A data lake can provide us with the technology to cope with the challenges of various data formats arriving in massive amounts, too fast and diverse for a classic data pipeline resulting in a data warehouse. As a the data warehouse is optimised for analysis of structured data, the inflow of unstructured data strings, entire documents, JSONs with n levels of nesting, binaries, etc… is simply too much for a data warehouse.
A data lake is an environment that manages
any type of data from any type of source or process in a transparent way for
the business. In tandem with a data catalogue, a lake provides data governance
and facilitates data wrangling, trusted
analytical capabilities as well as self-service analytics to name a few.
If we zoom in on these capabilities, we can
list these as the basic requirements for a minimum viable product:
- Automated discovery, cataloguing and classification of ingested data;
- Collaborative options for evaluating the ingested data;
- Governance of quality, reliability, security and privacy aspects as well as lifecycle management;
- Facilitates data preparation for analytical purposes in projects as well as for unsupervised and spontaneous self-service analytics;
- Provides the business end users with an intuitive search and discovery platform;
- Archives data where and when necessary.
Some vendors launch the term “data
marketplace” to stress the self-service aspects of a data lake. But this
position depends on the analytical maturity of the organisation. If introduced
too early it may provide further substantiation for the claim that:
“Analytics is a process of ingesting,
transforming and preparing data for publication and analysis to end up in Excel
sheets, used a “proof” for a management hypothesis”.
What makes a data lake ready for use?
Meta data: data describing the data in the
lake: its providence, the data format(s), the business and technical
definitions,…;
Governance: business and IT control over
meaning, application and quality of data as well as information security and
data privacy regulation;
Cataloguing: either by machine learning or
precooked categories and rule engines, data is sorted and ordered according to
meaningful categories for the business.
Structuring: data increases in meaning if
relationships with other concepts are modelled in hierarchies, taxonomies and
ontologies;
Tagging: both governed and ungoverned tags
(i.e. user tags) dramatically improve the usability of the ingested data. If
these tags are evaluated on practical use by the user community they become
part of a continuous quality improvement process;
Hierarchies: identical to tagging, there
may be governed and personal hierarchies in use;
Taxonomies: systematic hierarchies, based
on scientific methods;
Ontologies: a set of concepts and
categories in a subject area or data domain that shows their properties and the
relations between them to model the way the organisation sees the world.