A few years ago, a couple of eCommerce
organisations asked my opinion on the viability of a data lake in their
enterprise architecture for analytical purposes. After careful study the result
was 50 – 50: one organisation had no immediate advantage investing in a data
lake. It would become just another data silo or even a data junk yard with hard
to exploit data and no idea of the added value this would bring.
The other -€ 1 bn plus company- had all the
reasons in the world to start exploring the possibilities of a repository for
semi-structured and unstructured data. But it would take them at least two
years to set up a profitable infrastructure. Technology was not the problem:
low cost processing and storage as well as the software -mainly open source-
was no problem. They even had no problem attracting the right technical
profiles as their job offers topped everyone in the market. No, the real
problem was integrating and exploiting the new data streams in a sensible and
managed way. As I am about to embark on a new mission to rethink an analytical
infrastructure with the data lake in scope, I can share a few lessons from the
past and think ahead for what’s coming.
Start from the data and work your way up to the business case |
Analyse the Velocity, Variability and
Volume of the data to meet the analytical requirements
Is it stable and predictable? Then it’s
probably an indication that your organisation is not yet ready for this
investment. But if there is a rapid growth rate in at least one of these three
Vs, you better get planning and designing your data lake.
Planning:
- What time do we need to close the skills gap and manage a Hadoop environment professionally?
- What is a realistic timeframe to connect, understand and manage the new semi-structured and unstructured data sources?
Designing:
- Do we put every piece of data in the lake and write off our investments in the classical BI infrastructure or do we choose a hybrid approach where only new data types will be filling the lake?
o
In case of a hybrid approach,
do we need to join between the two data sources?
o
In case of a total replacement
of the data warehouse, do we have the proper front end tools to make the business users
exploit the data or do they have to rely on data scientists and data engineers, potentially creating a bottleneck in the process?
- How will we process the data? Do we simply dump it and leave it all to the data scientists to make sense of it or do we plan ahead on some form of modelling on the Hadoop platform, creating column families which are flexible enough to cope with new attributes and which will make broader access possible?
- Do we have a metadata strategy that can handle the growth, especially from a user-oriented perspective?
- Security and governance are far more complex in a data lake than in a data warehouse. What’s our take on this issue?
Check the evolution of your business
requirements
It’s no use to invest in a data lake when
the business ambitions are on a basic level and stuff like a balanced scorecard
is just popping up in the PowerPoints from the CEO.
Some requirements are very clear on their
data needs, but others aren’t. It may take a considerable amount of analysis to
surface the data requirements for semi-structured and unstructured data.
And with legislation like the GDPR, some
data may be valuable but also very hard to get as the consumer is more and more
aware of his position in the data game. That’s why very fine-grained opt-ins
are adding complexity to customer data management.
Develop a few winning use cases
“A leader is someone who has followers” is
quite applicable in this situation. You are after all challenging the status
quo and if there’s one thing I’ve learned in 30 years in analytics and ICT in
general: a craftsman is very loyal to his tools. Managing change in the
technical department will not be a walk in the park. It may require adding an
entire new team to the department or at least have some temporary professionals
come in to do the dirtiest part of the job and hand over the Hadoop cluster in
maintenance mode to the team.
To enable all this, you need a few winning
use cases that appeal to the thought leaders in the organisation. Make sure you
pick sponsors with clout and the budget to turn PowerPoints into working
solutions.
There certainly will be use cases for
marketing, finance and operations. Look for the maximum leverage and get
funded. And by the way, don’t bother the HR department unless you are working
for the armed forces. They always come last in commercial organisations…