Posts tonen met het label Data Lake. Alle posts tonen
Posts tonen met het label Data Lake. Alle posts tonen

maandag 28 juni 2021

Managing a Data Lake Project Part III: Architecture Drives the Project Method

Remember the old days when the data warehouse was the only source of the facts and answered almost any business question, provided the data were available in the source systems? Today, more and more data is beyond our control. “Control” in the sense of precooked structures, well documented and well governed data objects. More and more data is generated from sources beyond our control. And only the data lake can facilitate comprehensive analytics.

To make clear how the architecture of a data lake drives the project approach, it is necessary we review the three major data warehouse architectures and their project approach before we present the new methods needed in a data lake environment.

 The Kimball architecture and its project approach


Ralph Kimball’s star schema approach is the most used -and as far as I am concerned- the most pragmatic low-threshold approach to data warehousing. Each dimension is constructed with an enterprise view and shared in the appropriate data marts. And each data mart represents a business process. For project managers, this means that an enterprise scan is needed to define the dimensions, followed by a study on the combination “information value times feasibility” to pick the order of execution. 

The Lindstedt architecture and its project approach


The great advantage of a data vault is its flexibility to adapt to new situations, new data sources and other changes in the data landscape. Like the Kimball method, it focuses on business processes and models these in a highly normalised way using hashes to “freeze” temporal links between objects and their attributes.  What this means to the project approach is obvious: we postpone the materialisation of a queryable schema until we are sure about the data persistence. In many of the projects we managed, a seamless transition from a data vault to star schema was made. For project managers, this means a heavy focus on the business process and a flexible way of representing all the processes and delivering queryable data whenever the need for it was expressed by the business.


The Corporate Information Factory architecture from Inmon and its project approach



The Inmon approach is something completely different from the previous methods. As of the early 1990s Inmon has made his case for a corporate information factory (CIF) that would take every data source in scope, build a target model in the third Normal Form (3NF) and once this Herculean task was competed it was finally time to deliver. In his method functional data marts would provide extracts from the CIF. Think of an HR data mart, a marketing data mart, a finance data mart, etc… No need to say this can only work in very stable environments where the external factors don’t influence too much the approach to analytics. In all the projects me and my colleagues have been involved this was the Never-ending Project. Please don’t go there. And if, by any chance there is a business case for this approach, allow for sufficient time and resources. You will need it. 


The data lake architecture and its project approach

A data lake project is a completely different story form the previous three: no more up front analysis of concepts, objects, entities and attributes that contribute to these concepts before building the data stores.

In a nutshell, a data lake project is about looking for cheap and simple storage like S3 on Amazon Web Services or ALS on Azure, making sure the ingestion data pipelines are in place to receive all sorts of data and once these data are in place, making sure they are ready for exploitation.  For project managers, this means a totally different project management flow. Contrary to the three previous architectures, there is no synching between business and tech: after a high level business analysis, the technical track will provide for data storage, data access and data cataloguing to make it exploitable for the business. 


woensdag 16 juni 2021

Managing a Data Lake Project Part II: A Compelling Business Case for a Governed Lake

 

In Part I A Data Lake and its Capabilities we already hinted towards a business case but in this blog we make it a little bit more explicit.




A recap from Part I: the data lake capabilities

The business case for a data lake has many aspects, some of them present sufficient rationale on their own but that depends of course on the actual situation and context of your organisation. Therefore, I mention about eleven rationales, but feel free to add yours in the comments.

 We are mixing on-premise data with Cloud based systems which causes new silos

The Cloud providers deliver software for easy switching your on-premise applications and databases to Cloud versions. But there are cases where this isn’t possible to do this in one fell swoop:

  • Some applications require refactoring before moving them to the Cloud;
  • Some are under such strict info security constraints that even the best Cloud security they can’t be relied on. I know of  retailer who keeps his excellent logistic system in something close to a bunker!
  • Sometimes the budget or the available skills are insufficient to support a 100 % Cloud environment, etc…

This provides already a very compelling business case for a governed data lake: a catalog that manages lineage and meaning will make the transition smoother and safer.

Master data is a real pain in siloed data storage, as is governance...

A governed data lake can improve master data processes by involving the end users in evaluating intuitively what’s in the data store. Using both predefined data quality rules and machine learning to detect anomalies and implicit relationships in the data as well as defining the golden record for objects like CUSTOMER, PRODUCT, REGION,… the data lake can unlock data even in technical and physical silos. 

We now deal with new data processing and storage technologies other than ETL and relational databases: NoSQL, Hadoop, Spark or Kafka to name a few

NoSQL has many advantages for certain purposes but from a governance point of view it is a nightmare: any data format, any level of nesting and any undocumented business process can be captured by a NoSQL database.

Streaming (unstructured) data is unfit for a classical ETL process which supports structured data analysis so we need to combine the flexibility of a data lake ingestion process with the governance capabilities of a data catalogue or else we will end up with a data swamp.

We don't have the time, nor the resources to analyse up front what data are useful for analysis and what data are not

There is a shortage of experienced data scientists. Initiatives like applications to support data citizens may soften the pain here and there but let’s face it, most organisations lack the capabilities for continuous sandboxing to discover what data in what form can be made meaningful. It’s easier to accept indiscriminately all data to move into the data lake and let the catalogue do some of the heavy lifting.

We need to scale horizontally to cope with massive unpredictable bursts of of data

Larger online retailers, event organisations, government e-services and other public facing organisations can use the data lake as a buffer for ingesting massive amounts of data and sort out its value in a later stage.  

We need to make a rapid and intuitive connection between business concepts and data that contribute, alter, define or challenge these concepts

This has been my mission for about three decades: to bridge the gap between business and IT and as far as “classical” architectures go, this craft was humanly possible. But in the world of NoSQL, Hadoop and Graph databases this would be an immense task if not supported by a data catalogue.  

Consequently, we need to facilitate self-service data wrangling, data integration and data analysis for the business users

A governed data lake ensures trust in the data, trust in what business can and can't do. This can speed up data literacy in the organisation by an order of magnitude.

We need to get better insight in the value and the impact of data we create, collect and store.

Reuse of well-catalogued data will enable this: end users will contribute to the evaluation of data and automated meta-analysis of data in analytics will reinforce the use of the best data available in the lake. Data lifecycle management becomes possible in a diverse data environment.

We need to avoid fines like those stipulated in the GDPR from the EU which can amount up to 4% of annual turnover!

Data privacy regulations need functionality to support “security by design” which is delivered in a governed data lake. Data pseudonimisation, data obfuscation or anonimisation come in handy when these functions are linked to security roles and user rights. 

We need a clear lineage of the crucial data to comply with stringent laws for publicly listed companies

Sarbanes Oxley and Basel III are examples of legislation that require accountability at all levels and in all business processes. Data lineage is compulsory in these legal contexts. 

But more than all of the above IT based arguments, there is one compelling business case for C-level management: speeding up the decision cycle time and the organisation’s agility in the market.

Whether this market is a profit generating market or a non-profit market where the outcomes are beneficial to society, speeding up decisions by tightening the integration between concepts and data is the main benefit of a governed data lake.

Anyone who has followed the many COVID-19 kerfuffles, the poor response times and the quality of the responses to the pandemic sees the compelling business case:

  • Rapid meta-analysis of peer reviewed research papers;
  • Social media reporting on local outbreaks and incidents;
  • Second use opportunities from drug repurposing studies;
  • Screening and analysing data from testing, vaccinations, diagnoses, death reports,…

I am sure medical professionals can come up with more rationales for a data lake, but you get the gist of it.

So, why is there a need for a special project management approach to a data lake introduction? That is the theme of Part III.  But first, let me have your comments on this blogpost.






zaterdag 29 mei 2021

Managing a Data Lake Project

With the massive growth of online generated data and IoT data, the proportion of unstructured and semi-structured data constitutes the bulk of the data that needs to be analysed. Whereas a 50 Gigabyte data warehouse to facilitate analysis of structured data was quite an achievement up to now, this number dwindles compared to the unstructured and semi-structured data avalanche.

Data Avalanche?


Yes, because compared to the steady stream of data from transaction processing systems, we now have to deal with irregular flows and massive bursts of incoming data that needs to be adequately processed to provide meaning to the data.
New data sources emerge, other than social media and IoT data, like smart machines and machine learning systems generating new data, based on existing sources. Managing various data types and metadata in impressive volumes are just a few technical aspects which can be solved by technology. The HR- , legal- and organisational aspects are level more complex, but aspects these are not in scope of this series of blog posts. 
We are adding extra process and event based decision support to our management capabilities and that alone is worth the cost, the trouble and the change management efforts to introduce a data lake.

See you at the Webinar!

Wednesday 9th June you can tune in on a short webinar hosted by the Great IT Professional. You can still register via this link. The webinar will be followed by a series of articles on how to manage the Data Lake project. Stay tuned!

Bert Brijs Webinar on Managing a Data Lake Project


zaterdag 29 december 2018

Roadmap to a successful data lake


A few years ago, a couple of eCommerce organisations asked my opinion on the viability of a data lake in their enterprise architecture for analytical purposes. After careful study the result was 50 – 50: one organisation had no immediate advantage investing in a data lake. It would become just another data silo or even a data junk yard with hard to exploit data and no idea of the added value this would bring.
The other -€ 1 bn plus company- had all the reasons in the world to start exploring the possibilities of a repository for semi-structured and unstructured data. But it would take them at least two years to set up a profitable infrastructure. Technology was not the problem: low cost processing and storage as well as the software -mainly open source- was no problem. They even had no problem attracting the right technical profiles as their job offers topped everyone in the market. No, the real problem was integrating and exploiting the new data streams in a sensible and managed way. As I am about to embark on a new mission to rethink an analytical infrastructure with the data lake in scope, I can share a few lessons from the past and think ahead for what’s coming.



Start from the data and work your way up to the business case

Analyse the Velocity, Variability and Volume of the data to meet the analytical requirements

Is it stable and predictable? Then it’s probably an indication that your organisation is not yet ready for this investment. But if there is a rapid growth rate in at least one of these three Vs, you better get planning and designing your data lake.

Planning:

  •         What time do we need to close the skills gap and manage a Hadoop environment professionally?
  •        What is a realistic timeframe to connect, understand and manage the new semi-structured and unstructured data sources?

Designing:

  •         Do we put every piece of data in the lake and write off our investments in the classical BI infrastructure or do we choose a hybrid approach where only new data types will be filling the lake?

o   In case of a hybrid approach, do we need to join between the two data sources?
o   In case of a total replacement of the data warehouse, do we have the proper front end tools to make the business users exploit the data or do they have to rely on data scientists and data engineers, potentially creating a bottleneck in the process?
  •        How will we process the data? Do we simply dump it and leave it all to the data scientists to make sense of it or do we plan ahead on some form of modelling on the Hadoop platform, creating column families which are flexible enough to cope with new attributes and which will make broader access possible?
  •        Do we have a metadata strategy that can handle the growth, especially from a user-oriented perspective?
  •        Security and governance are far more complex in a data lake than in a data warehouse. What’s our take on this issue?


Check the evolution of your business requirements

It’s no use to invest in a data lake when the business ambitions are on a basic level and stuff like a balanced scorecard is just popping up in the PowerPoints from the CEO.
Some requirements are very clear on their data needs, but others aren’t. It may take a considerable amount of analysis to surface the data requirements for semi-structured and unstructured data.
And with legislation like the GDPR, some data may be valuable but also very hard to get as the consumer is more and more aware of his position in the data game. That’s why very fine-grained opt-ins are adding complexity to customer data management.

Develop a few winning use cases


“A leader is someone who has followers” is quite applicable in this situation. You are after all challenging the status quo and if there’s one thing I’ve learned in 30 years in analytics and ICT in general: a craftsman is very loyal to his tools. Managing change in the technical department will not be a walk in the park. It may require adding an entire new team to the department or at least have some temporary professionals come in to do the dirtiest part of the job and hand over the Hadoop cluster in maintenance mode to the team.

To enable all this, you need a few winning use cases that appeal to the thought leaders in the organisation. Make sure you pick sponsors with clout and the budget to turn PowerPoints into working solutions.

There certainly will be use cases for marketing, finance and operations. Look for the maximum leverage and get funded. And by the way, don’t bother the HR department unless you are working for the armed forces. They always come last in commercial organisations…