Remember the old days when the data warehouse was the only source of the facts and answered almost any business question, provided the data were available in the source systems? Today, more and more data is beyond our control. “Control” in the sense of precooked structures, well documented and well governed data objects. More and more data is generated from sources beyond our control. And only the data lake can facilitate comprehensive analytics.
To make clear how the architecture of a data lake drives the project approach, it is necessary we review the three major data warehouse architectures and their project approach before we present the new methods needed in a data lake environment.
The Kimball architecture and its project approach
Ralph Kimball’s star schema approach is the most used -and as far as I am concerned- the most pragmatic low-threshold approach to data warehousing. Each dimension is constructed with an enterprise view and shared in the appropriate data marts. And each data mart represents a business process. For project managers, this means that an enterprise scan is needed to define the dimensions, followed by a study on the combination “information value times feasibility” to pick the order of execution.
The Lindstedt architecture and its project approach
The great advantage of a data vault is its flexibility to adapt to new situations, new data sources and other changes in the data landscape. Like the Kimball method, it focuses on business processes and models these in a highly normalised way using hashes to “freeze” temporal links between objects and their attributes. What this means to the project approach is obvious: we postpone the materialisation of a queryable schema until we are sure about the data persistence. In many of the projects we managed, a seamless transition from a data vault to star schema was made. For project managers, this means a heavy focus on the business process and a flexible way of representing all the processes and delivering queryable data whenever the need for it was expressed by the business.
The Corporate Information Factory architecture from Inmon and its project approach
The Inmon approach is something completely different from the previous methods. As of the early 1990s Inmon has made his case for a corporate information factory (CIF) that would take every data source in scope, build a target model in the third Normal Form (3NF) and once this Herculean task was competed it was finally time to deliver. In his method functional data marts would provide extracts from the CIF. Think of an HR data mart, a marketing data mart, a finance data mart, etc… No need to say this can only work in very stable environments where the external factors don’t influence too much the approach to analytics. In all the projects me and my colleagues have been involved this was the Never-ending Project. Please don’t go there. And if, by any chance there is a business case for this approach, allow for sufficient time and resources. You will need it.
A data lake project is a completely different story form the previous three: no more up front analysis of concepts, objects, entities and attributes that contribute to these concepts before building the data stores.
In a nutshell, a data lake project is about looking for cheap and simple storage like S3 on Amazon Web Services or ALS on Azure, making sure the ingestion data pipelines are in place to receive all sorts of data and once these data are in place, making sure they are ready for exploitation. For project managers, this means a totally different project management flow. Contrary to the three previous architectures, there is no synching between business and tech: after a high level business analysis, the technical track will provide for data storage, data access and data cataloguing to make it exploitable for the business.