Dear Ralph,
I know you’re
a busy man so I won’t take too much of your time to read this post. I look
forward to meeting you June 10 in 't Spant in Bussum for an in depth session on Big Data and your views on the phenomenon.
In one of
your keynotes you will address your vision on how Big Data drives the Business
and IT to adapt and evolve. Let me first of all congratulate you with the title
of your keynote. It proves that a world class BI and data warehouse veteran is
still on top of things, which we can’t say for some other gurus of your generation,
but let’s not dwell on that.
I have been
studying the Big Data Phenomenon from my narrow perspective: business analysis
and BI architecture and here are some of the questions I hope we can tackle
during your keynote session:
1. Do you
consider Big Data as something you can fit entirely in star schemas? I know
since The Data Webhouse Toolkit days
that semi structured data like web logs can find a place in a multidimensional
model but some of the Big Data produce is to my knowledge not fit for persistent
storage. Yet I believe that a derived form of persistent storage may be a good
idea. Let me give you an example. Imagine we can measure the consumer mood for
a certain brand on a daily basis, scanning the social media postings. Instead
of creating a junk-like dimension we could build a star schema with the following
dimensions: a mood dimension, social media source dimension, time, location and
brand dimension to name the minimum and a central fact table with the mood
score on a seven point Likert scale. The real challenge will lie in correctly
structuring the text strings into the proper Likert score using advanced text
analytics. Remember the wrong interpretation of the Osama Bin Laden tweets early
May 2011? The program interpreted “death” as a negative mood when the entire US
was cheering the expedient demise of the terrorist.
Figure 1: An example of derived Big Data in a multidimensional schema
2. How will
you address the volatility issue?
Because Big Data’s most convincing feature is not volume, velocity or
variety which have always been relative to the state of the art. No, volatility
is what really characterizes Big Data and I refer to my article here where I point out that Big Behavioural Data is the true challenge for analytics as
emotions and intentions can be very volatile and the Law of Large Numbers may
not always apply.
3. Do you
see a case for data provisioning strategies to address the above two issues?
With data provisioning, I mean a transparent layer between the business
questions and the BI architecture where ETL provides answers to routine or
planned business questions and virtual data marts produce answers to ad hoc and
unplanned business questions. If so, what are the major pitfalls of virtualization
for Big Data Analytics?
4. Do you see the need for new methodologies and new modeling methods or does the present toolbox suffice?
4. Do you see the need for new methodologies and new modeling methods or does the present toolbox suffice?
It’s been a
while since we met and I really look forward to meeting you in Bussum,
whether you answer these questions or not.
Kind
regards,
Bert Brijs