In Part I A Data Lake and its Capabilities
we already hinted towards a business case but in this blog we make it a little
bit more explicit.
A recap from Part I: the data lake capabilities
The business case for a data lake has many
aspects, some of them present sufficient rationale on their own but that
depends of course on the actual situation and context of your organisation.
Therefore, I mention about eleven rationales, but feel free to add yours in the
comments.
We are mixing on-premise data with Cloud
based systems which causes new silos
The Cloud providers deliver software for
easy switching your on-premise applications and databases to Cloud versions. But
there are cases where this isn’t possible to do this in one fell swoop:
- Some applications require refactoring
before moving them to the Cloud;
- Some are under such strict info security
constraints that even the best Cloud security they can’t be relied on. I know
of retailer who keeps his excellent
logistic system in something close to a bunker!
- Sometimes the budget or the available
skills are insufficient to support a 100 % Cloud environment, etc…
This provides already a very compelling business case for a governed data lake: a catalog that manages lineage and meaning will make the transition smoother and safer.
Master data is a real pain in siloed data storage, as is governance...
A governed data lake can improve master
data processes by involving the end users in evaluating intuitively what’s in the
data store. Using both predefined data quality rules and machine learning to
detect anomalies and implicit relationships in the data as well as defining the
golden record for objects like CUSTOMER, PRODUCT, REGION,… the data lake can
unlock data even in technical and physical silos.
We now deal with new data processing and storage technologies other than ETL and relational databases: NoSQL, Hadoop, Spark or Kafka to name a few
NoSQL has many advantages for certain
purposes but from a governance point of view it is a nightmare: any data
format, any level of nesting and any undocumented business process can be
captured by a NoSQL database.
Streaming (unstructured) data is unfit for
a classical ETL process which supports structured data analysis so we need to
combine the flexibility of a data lake ingestion process with the governance
capabilities of a data catalogue or else we will end up with a data swamp.
We don't have the time, nor the resources to analyse up front what data are useful for analysis and what data are notThere is a shortage of experienced data
scientists. Initiatives like applications to support data citizens may soften
the pain here and there but let’s face it, most organisations lack the
capabilities for continuous sandboxing to discover what data in what form can
be made meaningful. It’s easier to accept indiscriminately all data to move
into the data lake and let the catalogue do some of the heavy lifting.
We need to scale horizontally to cope with massive unpredictable bursts of of data
Larger online retailers, event
organisations, government e-services and other public facing organisations can
use the data lake as a buffer for ingesting massive amounts of data and sort
out its value in a later stage.
We need to make a rapid and intuitive connection between business concepts and data that contribute, alter, define or challenge these concepts
This has been my mission for about three
decades: to bridge the gap between business and IT and as far as “classical”
architectures go, this craft was humanly possible. But in the world of NoSQL,
Hadoop and Graph databases this would be an immense task if not supported by a
data catalogue.
Consequently, we need to facilitate self-service data wrangling, data integration and data analysis for the business users
A governed data lake ensures trust in the data, trust in what business can and can't do. This can speed up data literacy in the organisation by an order of magnitude.
We need to get better insight in the value and the impact of data we create, collect and store.
Reuse of well-catalogued data will enable
this: end users will contribute to the evaluation of data and automated meta-analysis
of data in analytics will reinforce the use of the best data available in the
lake. Data lifecycle management becomes possible in a diverse data environment.
We need to avoid fines like those stipulated in the GDPR from the EU which can amount up to 4% of annual turnover!
Data privacy regulations need functionality
to support “security by design” which is delivered in a governed data lake.
Data pseudonimisation, data obfuscation or anonimisation come in handy when these
functions are linked to security roles and user rights.
We need a clear lineage of the crucial data to comply with stringent laws for publicly listed companies
Sarbanes Oxley and Basel III are examples
of legislation that require accountability at all levels and in all business
processes. Data lineage is compulsory in these legal contexts.
But more
than all of the above IT based arguments, there is one compelling business case
for C-level management: speeding up the decision cycle time and the
organisation’s agility in the market.
Whether
this market is a profit generating market or a non-profit market where the
outcomes are beneficial to society, speeding up decisions by tightening the
integration between concepts and data is the main benefit of a governed data
lake.
Anyone who
has followed the many COVID-19 kerfuffles, the poor response times and the
quality of the responses to the pandemic sees the compelling business case:
- Rapid meta-analysis
of peer reviewed research papers;
- Social
media reporting on local outbreaks and incidents;
- Second use
opportunities from drug repurposing studies;
- Screening
and analysing data from testing, vaccinations, diagnoses, death reports,…
I am sure
medical professionals can come up with more rationales for a data lake, but you
get the gist of it.
So, why is
there a need for a special project management approach to a data lake introduction?
That is the theme of Part III. But first,
let me have your comments on this blogpost.