Business Analysis for Business Intelligence Blog: BA4BIBlog: semi-structured data

dinsdag 20 augustus 2013

A Short Memo for Big Data Sceptics

In an article in the NY Times from 17th August by James Glanz, a few Big Data sceptics are quoted. Here is a literal quote: Robert J. Gordon, a professor of economics at Northwestern University, said comparing Big Data to oil was promotional nonsense. “Gasoline made from oil made possible a transportation revolution as cars replaced horses and as commercial air transportation replaced railroads,” he said. “If anybody thinks that personal data are comparable to real oil and real vehicles, they don’t appreciate the realities of the last century.” I respectfully disagree with the learned scholar: the new oil is a metaphor for how our lives have changed through the use of oil in transportation. Cars and planes have influenced our social lives immensely but why shouldn't Big Data do so in equal or even superior order? Let me name just a few:

Big Data reducing traffic jams (to stick close to the real oil world)
Big Data improving the product-market match to the level of one to one, tailoring product specifications and promotions to individual preferences,
Big Data improving diagnostics and treatments in health care combining the wisdom of millions of health care workers and logged events in diagnostics epidemiologic data, death certificates etc...
Big Data and reduction of energy consumption via the smart grid and Internet of things to automate the match between production and consumption,
Big Data in text mining to catch qualitative information on a quantitative scale improving the positioning of qualitative discriminants in fashion, music, interior decorating etc... and of course... politics Ask the campaign team from 44th President of the United States and they will tell you how Big Data oiled their campaign.

As soon as better tools for structuring and analysing Big Data become available and as soon as visionary analysts are capable of integrating Big Data in regular BI architectures the revolution will grow in breadth and depth. Some authors state that entirely new skills will be needed for this emerging market. If I were to promote training and education I'd say the same. But from where I stand today I think the existing technological skills in database and file management may need a little tweaking say a three or five day course but no way is there a need for an MBD (Master in Big Data) education. On the business side of things there may some need for explaining the works of semi structured and unstructured data and their V's which already add up to seven. I believe it is going in the same direction as the marketing P's where Kottler's initial four P's were upgraded to over thirty as one professor of marketing churned out this intellectual athletic performance. Let's sum them up and see if someone can top them: Volume: a relative notion as processing and storage capabilities increase over time Velocity: ibid. Variety: also a relative notion as EBCDIC, ASCII, UTF-8 etc... are now in the company of video and speech thanks to companies like Lernhout and Hauspie whatever the courts may have decided on their Language Development Companies, Volatility: I have added this one in an article you can find on the booksite on "Business Analysis for Business Intelligence" because what is true today may not be true tomorrow, so it is not about the time horizon you need to store the data as some authors claim because that will be defined by the seasonality. The problem with these data is there might not be any seasonality in them! Veracity: how meaningful are the data for the problem or opportunity at hand? Validity: meaning Big Data can only be useful if validated by a domain expert who can identify its usefulness. Value: what can we invest in recording, storing and analysing Big Data in return for what business value? This is one of the toughest questions today as many innovative organisations follow the Nike principle: "Just do it". And that, professor Gordon is just what all the pioneers did when they introduced the car and the aeroplane to their society, ignoring the anxious remarks from horse breeders and railroad companies. Remember how the first cars where slower than trains and horses? I rest my case.

donderdag 23 mei 2013

What is Really “Big” about Big Data

Sometimes it is better to sit still and wait until the dust settles. Slowly but surely, the dust is settling around the Big Data buzz and what I am looking at does not impress me.

There are a lot of sloppy definitions of Big Data about. I have read incredible white papers, books and articles which all boil down to: “We don’t know what it is, but it is coming! Be prepared (to open your wallet).” In this article, I make an attempt to put certain aspects about Big Data in a Business Intelligence perspective and come to a definition that is more formal and truly distinguishes Big Data from just being a bigger database. I have a few challenges for you that will hopefully elicit some response and debate. These are the main challenges, intentionally served as bullet points as it will make bullet holes in some people’s assumptions.

Three V’s are not enough to describe the phenomenon
There is a word missing in the phenomenon’s naming
It has always been around: nihil novi sub sole
What’s really Big about Big Data

Between all vaguely defined concepts, vendor pushed white papers and publications, three “Vs” are supposed to define Big Data completely.

Gartner, who is always ready to define a new market segment as a target for analysis and studies as a new revenue source introduced these three “Vs” in 2008 or something.

“Variety”, “Velocity” and “Volume” are supposed to describe the essence of Big Data. Let’s see if this holds any water. “Velocity”, the speed with which data should be processed and “Volume” are relative and subject to technological advances. When I did analysis for a large telecom organisation in 1 the mid-nineties, the main challenge was to respond fast and adequately to potential churn signals, hidden in the massive amounts of call data records (CDRs). Those CDRs were the Big Data of the nineties.

Today with the ever increasing numbers in social, mobile and sensor data the velocity and volumes have increased but so has the speed of massive parallel processing capabilities as well as storage and retrieval. Speed and volume are also an identical challenge for every player in the competition. If you need to be on par with your competitors, take out your credit card and book some space on AWS or Azure and Bob’s your uncle. The entire Big Data infrastructure is designed to function on cheap hardware.

Two “Vs” that matter: Variety and Volatility

"Variety"

One of the real differentiators is “Variety”: the increased variety of data types that really defines what Big Data is all about.

This “V” you won’t read about in Big Data hype articles, yet this is a serious challenge. From the early EBCDIC and ASCII files to UTF-8, xml, BLOBs, audio, video, sensor and actuator data etc... making sense of all these data types can be a pain. But also the variety within a data type may add to complexity. Think of the various separation marks in weblogs. And it gets worse when analysing unstructured data like natural language where culture and context kicks in: events, subcultures, trends, relationships, etc... Even a simple mood analysis in the US on tweets about the death of Bin Laden was negatively interpreted since a simplistic count of the word “death” was used to parse the tweets on the minus side without taking the context into the equation.

”Volatility”

This is about the variations in sentiment, value and predictive power which is so typical for an important segment in Big Data. A few examples to make it clear.

If I give a “like” to Molson and Stella beers, will this still be valid next week, next month,…?

If I measure consumer behaviour in its broadest sense: information requests, complaints, e-mails to the customer service, reviews on websites, purchases, returns and payment behaviour and I try to link this to behaviour on social media, will it correlate with the previous measures? Think of predicting election results on the basis of opinion polls ,don’t we often say one thing while we act totally different?

This is the real challenge of Big Data. The rewards are great. The organisation that can exploit data about moods, attitudes and intentions from unstructured data will have better forecasts and respond faster to changes in the market.

This way, Big Data realises a dream from the 1960’s: establishing a relationship between communication input and its impact on sales results. In other words, the DAGMAR [1] model from Russel Colley can materialise by analysing unstructured data from a broad range of sources.

[1] DAGMAR: the abbreviation of Defining Advertising Goals for Measuring Advertising Results.