Remember the old days when the data warehouse was the only source of the facts and answered almost any business question, provided the data were available in the source systems? Today, more and more data is beyond our control. “Control” in the sense of precooked structures, well documented and well governed data objects. More and more data is generated from sources beyond our control. And only the data lake can facilitate comprehensive analytics.
To make
clear how the architecture of a data lake drives the project approach, it is
necessary we review the three major data warehouse architectures and their
project approach before we present the new methods needed in a data lake
environment.
The Kimball architecture and its project approach
Ralph Kimball’s star schema approach is the most used -and as far as I am concerned- the most pragmatic low-threshold approach to data warehousing. Each dimension is constructed with an enterprise view and shared in the appropriate data marts. And each data mart represents a business process. For project managers, this means that an enterprise scan is needed to define the dimensions, followed by a study on the combination “information value times feasibility” to pick the order of execution.
The Lindstedt architecture and its project approach
The great
advantage of a data vault is its flexibility to adapt to new situations, new
data sources and other changes in the data landscape. Like the Kimball method,
it focuses on business processes and models these in a highly normalised way
using hashes to “freeze” temporal links between objects and their
attributes. What this means to the project
approach is obvious: we postpone the materialisation of a queryable schema
until we are sure about the data persistence. In many of the projects we
managed, a seamless transition from a data vault to star schema was made. For
project managers, this means a heavy focus on the business process and a
flexible way of representing all the processes and delivering queryable data
whenever the need for it was expressed by the business.
The Corporate Information Factory architecture from Inmon and its project approach
The Inmon
approach is something completely different from the previous methods. As of the
early 1990s Inmon has made his case for a corporate information factory (CIF)
that would take every data source in scope, build a target model in the third Normal
Form (3NF) and once this Herculean task was competed it was finally time to
deliver. In his method functional data marts would provide extracts from the
CIF. Think of an HR data mart, a marketing data mart, a finance data mart, etc…
No need to say this can only work in very stable environments where the
external factors don’t influence too much the approach to analytics. In all the
projects me and my colleagues have been involved this was the Never-ending
Project. Please don’t go there. And if, by any chance there is a business case
for this approach, allow for sufficient time and resources. You will need it.
A data lake
project is a completely different story form the previous three: no more up
front analysis of concepts, objects, entities and attributes that contribute to
these concepts before building the data stores.
In a
nutshell, a data lake project is about looking for cheap and simple storage
like S3 on Amazon Web Services or ALS on Azure, making sure the ingestion data
pipelines are in place to receive all sorts of data and once these data are in
place, making sure they are ready for exploitation. For project managers, this means a totally different
project management flow. Contrary to the three previous architectures, there is
no synching between business and tech: after a high level business analysis,
the technical track will provide for data storage, data access and data
cataloguing to make it exploitable for the business.