zaterdag 1 juni 2013

Book Review: Taming the Big Data Tidal Wave

By Bill Franks
The Wiley and SAS Business Series 2012

The Big Data Definition misses a “V”

Whenever I see a sponsored book, the little bird on my shoulder called Paranoia whispers “Don’t waste your time on airport literature”. But this time, I was rewarded for my stamina. As soon as I got through the first pages stuffed with hype and “do or die” messages the author started to bring nuanced information about Big Data.

I appreciate expressions of caution and reserve towards Big Data: most Big Data doesn’t matter (p.17) and The complexity of the rules and the magnitude of the data being removed or kept at each stage will vary by data source and by business problem. The load processes and filters that are put on top of big data are absolutely critical (p. 21).

They prove Franks knows his onions. Peeling away further in the first chapter, his ideas on the need for some form of standardisation are spot on.

But I still miss a clear and concise definition of what really distinguishes Big Data as the Gartner definition Franks applies (Velocity, Volume and Variety) misses the “V” from “Volatility”. A statistician like Franks should have made some reflections on this aspect. Because “Variety” and “Volatility” are the true defining aspects of Big Data.

Moving on to chapter two where Franks positions Web data as the original Big Data.

It’s about qualitative contexts, not just lots of text strings

It is true that web analytics can provide leading indicators for transactions further down the sales pipeline but relying on just web logs without the context may deliver a lot of noise in your analysis. Here again, Franks is getting too excited to be credible, for two reasons: you are analysing the behaviour of a PC in case of non-registered customers and even when you know the PC user, you are missing loads of qualitative information to interpret the clicks. Studies with eye cameras analysing promotions and advertising have shown that you can optimise the layout and the graphics using the eye movements combined with qualitative interviews but there is no direct link between “eyeballs and sales”. Companies like Metrix Lab who work with carefully selected customer panels also provide clickstream and qualitative analytics but to my knowledge using these results as a leading indicator for sales still remains very tricky. Captions like Read your customers’ minds (p.37) are nice for Hello magazine but are a bit over the top.

I get Big Data analytical suggestions from a well-known on  line book store suggesting me to buy a Bert doll from Sesame Street because my first name… is… you guessed? Imagine the effort and money spent to come up with this nonsense.

The airline example (p. 38-39) Franks uses is a little more complicated than shown in the book: ex post analysis may be able to explain the trade-offs between price and value the customer has made but this ignores the auction mechanisms airlines use whenever somebody is looking and booking. Only by using control groups visiting the booking site with fixed prices and compare them to the dynamic pricing group may provide reliable information.

Simple tests of price, product, promotion etc. are ideal with this Big Data approach. But don’t expect explanations from web logs. The chapter finishes with some realistic promises in attrition and response management as well as segmentation and assessing advertising results. But it is the note at the end that explains a lot: The content of this chapter is based on a conference talk… (p. 51)

Chapter three suggests the added value of various Big Data sources. Telematics, text, time and location, smart grid, RFID, sensor, telemetry and social network data are all known examples but they are discussed in a moderate tone this time. The only surprise I got was the application of RFID data in casino chips. But then it has been a while since I visited COMDEX in Vegas.

Moving on to the second part about technologies, processes and methods. It starts with a high level didactic “for Dummies” kind of overview of data warehouses, massive parallel processing systems, SQL and UDF,PMML, cloud computing, grid computing, MapReduce.

In chapter 5, the analytic sandbox is positioned as  a major driver of analytic value and rightly so. Franks addresses some architectural issues with the question of external or internal sandboxes but he is a bit unclear about when to use one or the other as he simply states the advantages and disadvantages of both choices, adding the hybrid version as simply the sum of the external and internal sandbox(p. 125 – 130).

Why and when we choose one of the options isn’t mentioned. Think of fast exploration of small data sets in an external system versus testing, modifying a model with larger data sets in an internal system for example.

When discussing the use of enterprise reusable datasets, the author does tackle the “When?” question. It seems this section has somewhat of a SAS flavour. I have witnessed a few “puppy dog” approaches of the SAS sales teams to recognise a phrase like: There is no reason why business intelligence and reporting environments; as well as their users, cant leverage the EADS (Enterprise Analytic Data Set (author’s note)) structures as well (p145). This where the EADS becomes a substitute for the existing –or TO BE-  data warehouse environment and SAS takes over the entire BI landscape. Thanks but no thanks, I prefer a best of breed approach to ETL, database technology and publication of analytical results instead of the camel’s nose. A sandbox should be a project based environment, not a persistent BI infrastructure. You can’t have your cake and eat it.

The sixth chapter discusses the evolution of analytic tools and methods and here Franks is way out of line as far as I am concerned. Many of the commonly used analytical tools and modelling approaches have been in use for many years. Some, such as linear regression or decision trees, are effective and relevant, but relatively simplistic to implement? (p. 154) I am afraid I am lost here. Does Franks mean that only complex implementations produce value in big data? Or does he mean that the old formulas are no longer appropriate? Newsflash for all statisticians, nerds and number crunching geeks: better a simple model that is understood and used by the people who execute the strategy than a complex model –running the risk of overfitting and modelling white noise- that is not understood by the people who produce and consume strategic inputs and outputs… Double blind tests between classical regression techniques and fancy new algorithms have often showed only slightly or even negative added predictive value. Because models can only survive if the business user adds context, deep knowledge and wisdom to the model.

I remember a shootout in a proof of concept between the two major data mining tools (guess who was on that shortlist!) and the existing Excel 2007 forecasting implementation. I regret to say to the data mining tool vendors that Excel won. Luckily a few pages further the author himself admits: Sometimes “good enough” really is! (p. 157)

The third part, about the people and approaches starts off on the wrong foot: A reporting environment, as we will define it here, is also often called a business intelligence (BI) environment.

Maybe Franks keeps some reserve using “is also often called” but nevertheless is reversing a few things which I am glad to restore in their glory. Business Intelligence is a comprehensive discipline. It entails the architecture of the information delivery system, the data management, the delivery processes and its products like reporting, OLAP cubes, monitoring, statistics and analytics…

But he does make a point when het states that massive amounts of reports … amount to frustrated IT providers and frustrated report users. Frank’s plea for relevant reports (p. 182) is not addressing the root cause.

That root cause is –in my humble opinion- that many organisations still use an end to end approach in reporting: building point solutions from data source to target BI report. That way, duplicates and missed information opportunities are combined because these organisations lack an architectural vision.

On page 183, Bill Franks makes a somewhat academic comparison between reporting and analysis which raises questions (and eyebrows).

Here’s the table with just one of the many comments I can make per comparison:

Just one remark (as you are pressed for time)
Provides data
Provides answers
So there are no data in analyses?
Provides what is asked for
Provides what is needed
A report can answer both open and closed questions: deviations from the norm as  answers to these questions and trend comparisons of various KPI’s leading to new analysis.
Is typically standardised
Is typically customised
OK, but don’t underestimate the number of reports with ten or more prompts: reports or analytics? I don’t care.
Does not involve a person
Involves a person
True for automated scoring in OLTP applications but I prefer some added human intelligence as the ultimate goal of BI is: improved decision making.
Is fairly inflexible
Is extremely flexible.
OK Bill, this one’s on me. You’re absolutely right!

The book presents a reflection on what makes a great analytic professional and how to enable analytic innovation. What makes a great analytic professional and a team? In a nutshell it is very simple: the person who has the competence, commitment and creativity to produce new insights. He accepts imperfect base data, is business savvy and connects the analysis to the granularity of the decision. He also knows how to communicate analytic results. So far so good. As for the analytic team discussion, I see a few discussion points, especially the suggested dichotomy between IT and analytics (pp. 245 – 247) It appears that the IT teams want to control and domesticate the creativity of the analytics team but that is a bit biased. In my experience, analysts who can explain not only what they are doing and how they work but also what the value is for the organisation can create buy in from IT.

Finally, Franks discusses the analytics culture. And this is again a call to action for innovation and introduction of Big Data Analytics. The author sums up the  barriers for innovation which I assume should be known to his audience.


Although not completely detached from commercial interests (the book is sold for a song, which says something about the intentions of the writer and the sponsors) Bill Franks gives a good C-level explanation of what Big Data is all about. It provides food for thought for executives who want to position the various aspects of Big Data in their organisation. Sure, it follows the AIDA structure of a sales call, but Bill Franks does it with a clear pen, style and elegance.

This book has a reason of existence. Make sure you get a free copy from your SAS Institute or Teradata account manager.

2 opmerkingen:

  1. Hey Bert

    What are good books (based on scientific facts) about big data in your opinion? can you give any recommends?

    Thank you!

    1. Hi Nik, I am afraid I am not aware of any scientifically based evaluation of Big Data as far as their analytical power is concerned. It would seem a very hard thing to come up with reproducible knowledge in that area. There are, of course a few studies under way about the scalability and performance of unstructured data. I know of a paper that is starting the thinking process about how to benchmark Big Data performance: