woensdag 9 juni 2021

Managing a Data Lake Project Part I: A Data Lake and its Capabilities

 A data lake can provide us with the technology to cope with the challenges of various data formats arriving in massive amounts, too fast and diverse for a classic data pipeline resulting in a data warehouse. As a the data warehouse is optimised for analysis of structured data, the inflow of unstructured data strings, entire documents, JSONs with n levels of nesting, binaries, etc… is simply too much for a data warehouse.

A data lake is an environment that manages any type of data from any type of source or process in a transparent way for the business. In tandem with a data catalogue, a lake provides data governance and facilitates data wrangling,  trusted analytical capabilities as well as self-service analytics to name a few.

If we zoom in on these capabilities, we can list these as the basic requirements for a minimum viable product:

  • Automated discovery, cataloguing and classification of ingested data;
  • Collaborative options for evaluating the ingested data;
  • Governance of quality, reliability, security and privacy aspects as well as lifecycle management;
  • Facilitates data preparation for analytical purposes in projects as well as for unsupervised and spontaneous self-service analytics;
  • Provides the business end users with an intuitive search and discovery platform;
  • Archives data where and when necessary.

 

Generic data processing map
Data comes from events that lead to business processes as well as from outside events that may become part of the business processes

Some vendors launch the term “data marketplace” to stress the self-service aspects of a data lake. But this position depends on the analytical maturity of the organisation. If introduced too early it may provide further substantiation for the claim that:

“Analytics is a process of ingesting, transforming and preparing data for publication and analysis to end up in Excel sheets, used a “proof” for a management hypothesis”.

What makes a data lake ready for use?

Meta data: data describing the data in the lake: its providence, the data format(s), the business and technical definitions,…;

Governance: business and IT control over meaning, application and quality of data as well as information security and data privacy regulation;

Cataloguing: either by machine learning or precooked categories and rule engines, data is sorted and ordered according to meaningful categories for the business.

Structuring: data increases in meaning if relationships with other concepts are modelled in hierarchies, taxonomies and ontologies;

Tagging: both governed and ungoverned tags (i.e. user tags) dramatically improve the usability of the ingested data. If these tags are evaluated on practical use by the user community they become part of a continuous quality improvement process;

Hierarchies: identical to tagging, there may be governed and personal hierarchies in use;

Taxonomies: systematic hierarchies, based on scientific methods;

Ontologies: a set of concepts and categories in a subject area or data domain that shows their properties and the relations between them to model the way the organisation sees the world.


Geen opmerkingen:

Een reactie posten