zaterdag 29 december 2018

Roadmap to a successful data lake

A few years ago, a couple of eCommerce organisations asked my opinion on the viability of a data lake in their enterprise architecture for analytical purposes. After careful study the result was 50 – 50: one organisation had no immediate advantage investing in a data lake. It would become just another data silo or even a data junk yard with hard to exploit data and no idea of the added value this would bring.
The other -€ 1 bn plus company- had all the reasons in the world to start exploring the possibilities of a repository for semi-structured and unstructured data. But it would take them at least two years to set up a profitable infrastructure. Technology was not the problem: low cost processing and storage as well as the software -mainly open source- was no problem. They even had no problem attracting the right technical profiles as their job offers topped everyone in the market. No, the real problem was integrating and exploiting the new data streams in a sensible and managed way. As I am about to embark on a new mission to rethink an analytical infrastructure with the data lake in scope, I can share a few lessons from the past and think ahead for what’s coming.

Start from the data and work your way up to the business case

Analyse the Velocity, Variability and Volume of the data to meet the analytical requirements

Is it stable and predictable? Then it’s probably an indication that your organisation is not yet ready for this investment. But if there is a rapid growth rate in at least one of these three Vs, you better get planning and designing your data lake.


  •         What time do we need to close the skills gap and manage a Hadoop environment professionally?
  •        What is a realistic timeframe to connect, understand and manage the new semi-structured and unstructured data sources?


  •         Do we put every piece of data in the lake and write off our investments in the classical BI infrastructure or do we choose a hybrid approach where only new data types will be filling the lake?

o   In case of a hybrid approach, do we need to join between the two data sources?
o   In case of a total replacement of the data warehouse, do we have the proper front end tools to make the business users exploit the data or do they have to rely on data scientists and data engineers, potentially creating a bottleneck in the process?
  •        How will we process the data? Do we simply dump it and leave it all to the data scientists to make sense of it or do we plan ahead on some form of modelling on the Hadoop platform, creating column families which are flexible enough to cope with new attributes and which will make broader access possible?
  •        Do we have a metadata strategy that can handle the growth, especially from a user-oriented perspective?
  •        Security and governance are far more complex in a data lake than in a data warehouse. What’s our take on this issue?

Check the evolution of your business requirements

It’s no use to invest in a data lake when the business ambitions are on a basic level and stuff like a balanced scorecard is just popping up in the PowerPoints from the CEO.
Some requirements are very clear on their data needs, but others aren’t. It may take a considerable amount of analysis to surface the data requirements for semi-structured and unstructured data.
And with legislation like the GDPR, some data may be valuable but also very hard to get as the consumer is more and more aware of his position in the data game. That’s why very fine-grained opt-ins are adding complexity to customer data management.

Develop a few winning use cases

“A leader is someone who has followers” is quite applicable in this situation. You are after all challenging the status quo and if there’s one thing I’ve learned in 30 years in analytics and ICT in general: a craftsman is very loyal to his tools. Managing change in the technical department will not be a walk in the park. It may require adding an entire new team to the department or at least have some temporary professionals come in to do the dirtiest part of the job and hand over the Hadoop cluster in maintenance mode to the team.

To enable all this, you need a few winning use cases that appeal to the thought leaders in the organisation. Make sure you pick sponsors with clout and the budget to turn PowerPoints into working solutions.

There certainly will be use cases for marketing, finance and operations. Look for the maximum leverage and get funded. And by the way, don’t bother the HR department unless you are working for the armed forces. They always come last in commercial organisations…