Posts tonen met het label Big Data. Alle posts tonen
Posts tonen met het label Big Data. Alle posts tonen

dinsdag 16 februari 2016

zaterdag 24 mei 2014

The Last Mile in the Belgian Elections (VII)

The Flemish Parliament’s Predictions

Scope management is important if you are on a tight budget and your sole objective is to prove that social media analytics is a journey into the future. That is why we concentrated on Flanders, the northern part of Belgium. (Yet, the outcome of the elections for the Flemish parliament will determine the events on Belgian level: if the N-VA wins significantly, they can impose some of their radical methods to get Belgium out of the economic slump which is not very appreciated in the French speaking south.)  In commercial terms, this last week of analytics would have cost the client 5.7 man-days of work. Compare this to the cost of an opinion poll and there is a valid add on available for opinion polls as the Twitter analytics can be done a continuous basis. A poll is a photograph of the situation while social media analytics show the movie.

 A poll is a photograph of the situation while social media analytics show the movie.

From Share-of-Voice to Predictions


It’s been a busy week. Interpreting tweets is not a simple task as we illustrated in the previous blog posts. And today, the challenge gets even bigger. To predict the election outcome in the northern, Dutch speaking part of Belgium on the basis of sentiment analysis related to topics is like base-jumping knowing that not one, but six guys have packed your parachute. These six guys are totally biased. Here are their names, in alphabetical order, in case you might think I am biased:


Dutch name
Name used in this blog post
CD&V (Christen Democratisch en Vlaams)
Christian democrats
Groen
Green (the ecologist party)
N-VA (Nieuw-Vlaamse Alliantie)
Flemish nationalists
O-VLD (Open Vlaamse   Liberalen en Democraten)
Liberal democrats
SP-A (Socialistische Partij Anders)
Social democrats
VB (Vlaams Belang)
Nationalist & Anti-Islam party
Table 1 Translation of the original Dutch party names

From the opinion polls, the consensus is that the Flemish nationalists can obtain a result over 30 % but the latest poll showed a downward trend breach, the Nationalist Anti-Islam party will lose further and become smaller than the Green party. In our analysis we didn’t include the extreme left wing party PVDA for the simple reason that they were almost non-existent on Twitter and the confusion with the Dutch social democrats created a tedious filtering job which is fine if you get a budget for this. But since this was not the case, we skipped them as well as any other exotic outsider. Together with the blanc and invalid votes they may account for an important percentage which will show itself at the end of math exercises. But the objective of this blog post is to examine the possibilities of approximating the market shares with the share of voice on Twitter, detect the mechanics of possible anomalies and report on the user experience as we explained at the outset of this Last Mile series of posts.

If we take the rough data of the share-of-voice on over 43.000 tweets we see some remarkable deviations from the consensus.
Party
Share of voice on Twitter
Christian democrats
21,3 %
Green (the ecologist party)
8,8 %
Flemish nationalists
27,9 %
Liberal democrats
13,6 %
Social democrats
12,8 %
Nationalist & Anti-Islam party
11,3 %
Void, blanc, mini parties
4,3 %

Table 2. Percentage share of voice on Twitter per Flemish party

It is common practice nowadays to combine the results of multiple models instead of using just one. Not only in statistics is this better, Nobel prize winner Kahneman has shown this clearly in his work. In this case we combine this model with other independent models to come to a final one.
In this case we use the opinion polls to derive the covariance matrix.
Table 3. The covariance matrix with the shifts in market shares 
This allows us to see things such as, if one party’s share grows, at which party’s expense is it? In the case of the Flemish nationalists it does so at the cost of the Liberal democrats and the Nationalist and Anti-Islam party but it wins less followers from the Christian and the social democrats. The behaviour of Green and the Nationalist and Anti-Islam party during the opinion polls was very volatile, which explains for a part the spurious correlations with other parties.


Graph 1 Overview of all opinion poll results: the evolution of the market shares in different opinion polls over time.

Comparing the different opinion polls, from different research organisations, on different samples is simply not possible. But if you combine all numbers in a mathematical model you can smooth a large part of these differences and create a central tendency.
To combine the different models, we use a derivation of the Black-Litterman model used in finance. We are violating some assumptions such as general market equilibrium which we replace by a total different concept as opinion polls. However the elegance of this approach allows us to take into account opinions, confidence in this opinion and complex interdepencies between the parties. The mathematical gain is worth the sacrifice of the theoretical underpinning.
This is based on a variant of the Black-Litterman model  μ=Π+τΣt(Ω+τPΣPt)(pPΠ)


And the Final Results Are…


Party
Central Tendency of all opinion polls
Data2Action’s Prediction
18 %
18,7 %
Green (the ecologist party)
8,7 %
8,8 %
31 %
30,3 %
14 %
13,7 %
13,3 %
13,3 %
9,4 %
9,8 %
Other (blanc, mini parties,…)
5,6 %
5,4 %
Total
100 %
100 %

Table 4. Prediction of the results of the votes for the Flemish Parliament 

Now let’s cross our fingers and hope we produced some relevant results.

In the Epilogue, next week, we will evaluate the entire process. Stay tuned! Update: navigate to the evaluation.






dinsdag 20 mei 2014

The Last Mile in the Belgian Elections (III)

Awesome Numbers... Big Data Volumes

Wow, the first results are awesome. Well, er, the first calculations at least are amazing.

  • 8500 tweets per 15 seconds measured means 1.5 billion tweets per month if you extrapolate this in a very rudimentary way...
  • 2 Kb per tweet = 2.8 Terabytes on input data per month if you follow the same reasoning. Nevertheless it is quite impressive for a small country like Belgium where the Twitter adoption is not on par with the northern countries..
  • If you use  55 kilobytes for a  model vector of 1000 features you generate 77 Terabyte of information per month
  • 55 K is a small vector. A normal feature vector of one million  features generates 72 Petabytes of information per month.

And wading through this sea of data you expect us to come up with results that matter?
Yes.
We did it.

Male versus female tweets in Belgian Elections
Gender analysis of tweets in the Belgian elections n = 4977 tweets

Today we checked the gender differences

The Belgian male Twitter species is clearly more interested in politics than the female variant: only 22 % of the 24 hours tweets were of female signature, the remaining 78 % were of male origin.
This is not because Belgian women are less present on Twitter: 48 % are female tweets against 52 % of the male sources.
Analysing the first training results for irony and sarcasm also shows a male bias. the majority of the sarcastic tweets were male: 95 out of 115. Only 50 were detected by the data mining algorithms so we still have some training to do.
More news tomorrow!

dinsdag 31 december 2013

A Small Presentation on Big Data

In eight minutes I make the connection between marketing, information management and Big Data to position the real value of it and separate it from the hype.
Click here for the presentation.

Wishing you a great 2014, where you will make decisions based on facts and data be successful in all your endeavours.

Kind regards,

bert

dinsdag 20 augustus 2013

A Short Memo for Big Data Sceptics

In an article in the NY Times from 17th August by James Glanz, a few Big Data sceptics are quoted. Here is a literal quote: Robert J. Gordon, a professor of economics at Northwestern University, said comparing Big Data to oil was promotional nonsense. “Gasoline made from oil made possible a transportation revolution as cars replaced horses and as commercial air transportation replaced railroads,” he said. “If anybody thinks that personal data are comparable to real oil and real vehicles, they don’t appreciate the realities of the last century.” I respectfully disagree with the learned scholar: the new oil is a metaphor for how our lives have changed through the use of oil in transportation. Cars and planes have influenced our social lives immensely but why shouldn't Big Data do so in equal or even superior order? Let me name just a few: 
  • Big Data reducing traffic jams (to stick close to the real oil world) 
  • Big Data improving the product-market match to the level of one to one, tailoring product specifications and promotions to individual preferences, 
  • Big Data improving diagnostics and treatments in health care combining the wisdom of millions of health care workers and logged events in diagnostics epidemiologic data, death certificates etc... 
  • Big Data and reduction of energy consumption via the smart grid and Internet of things to automate the match between production and consumption, 
  • Big Data in text mining to catch qualitative information on a quantitative scale improving the positioning of qualitative discriminants in fashion, music, interior decorating etc... and of course... politics Ask the campaign team from 44th President of the United States and they will tell you how Big Data oiled their campaign.
As soon as better tools for structuring and analysing Big Data become available and as soon as visionary analysts are capable of integrating Big Data in regular BI architectures the revolution will grow in breadth and depth. Some authors state that entirely new skills will be needed for this emerging market. If I were to promote training and education I'd say the same. But from where I stand today I think the existing technological skills in database and file management may need a little tweaking say a three or five day course but no way is there a need for an MBD (Master in Big Data) education. On the business side of things there may some need for explaining the works of semi structured and unstructured data and their V's which already add up to seven. I believe it is going in the same direction as the marketing P's where Kottler's initial four P's were upgraded to over thirty as one professor of marketing churned out this intellectual athletic performance. Let's sum them up and see if someone can top them: Volume: a relative notion as processing and storage capabilities increase over time Velocity: ibid. Variety: also a relative notion as EBCDIC, ASCII, UTF-8 etc... are now in the company of video and speech thanks to companies like Lernhout and Hauspie whatever the courts may have decided on their Language Development Companies, Volatility: I have added this one in an article you can find on the booksite on "Business Analysis for Business Intelligence" because what is true today may not be true tomorrow, so it is not about the time horizon you need to store the data as some authors claim because that will be defined by the seasonality. The problem with these data is there might not be any seasonality in them! Veracity: how meaningful are the data for the problem or opportunity at hand? Validity: meaning Big Data can only be useful if validated by a domain expert who can identify its usefulness. Value: what can we invest in recording, storing and analysing Big Data in return for what business value? This is one of the toughest questions today as many innovative organisations follow the Nike principle: "Just do it". And that, professor Gordon is just what all the pioneers did when they introduced the car and the aeroplane to their society, ignoring the anxious remarks from horse breeders and railroad companies. Remember how the first cars where slower than trains and horses? I rest my case.

zaterdag 1 juni 2013

Book Review: Taming the Big Data Tidal Wave


By Bill Franks
The Wiley and SAS Business Series 2012

The Big Data Definition misses a “V”



Whenever I see a sponsored book, the little bird on my shoulder called Paranoia whispers “Don’t waste your time on airport literature”. But this time, I was rewarded for my stamina. As soon as I got through the first pages stuffed with hype and “do or die” messages the author started to bring nuanced information about Big Data.

I appreciate expressions of caution and reserve towards Big Data: most Big Data doesn’t matter (p.17) and The complexity of the rules and the magnitude of the data being removed or kept at each stage will vary by data source and by business problem. The load processes and filters that are put on top of big data are absolutely critical (p. 21).

They prove Franks knows his onions. Peeling away further in the first chapter, his ideas on the need for some form of standardisation are spot on.

But I still miss a clear and concise definition of what really distinguishes Big Data as the Gartner definition Franks applies (Velocity, Volume and Variety) misses the “V” from “Volatility”. A statistician like Franks should have made some reflections on this aspect. Because “Variety” and “Volatility” are the true defining aspects of Big Data.

Moving on to chapter two where Franks positions Web data as the original Big Data.

It’s about qualitative contexts, not just lots of text strings


It is true that web analytics can provide leading indicators for transactions further down the sales pipeline but relying on just web logs without the context may deliver a lot of noise in your analysis. Here again, Franks is getting too excited to be credible, for two reasons: you are analysing the behaviour of a PC in case of non-registered customers and even when you know the PC user, you are missing loads of qualitative information to interpret the clicks. Studies with eye cameras analysing promotions and advertising have shown that you can optimise the layout and the graphics using the eye movements combined with qualitative interviews but there is no direct link between “eyeballs and sales”. Companies like Metrix Lab who work with carefully selected customer panels also provide clickstream and qualitative analytics but to my knowledge using these results as a leading indicator for sales still remains very tricky. Captions like Read your customers’ minds (p.37) are nice for Hello magazine but are a bit over the top.

I get Big Data analytical suggestions from a well-known on  line book store suggesting me to buy a Bert doll from Sesame Street because my first name… is… you guessed? Imagine the effort and money spent to come up with this nonsense.

The airline example (p. 38-39) Franks uses is a little more complicated than shown in the book: ex post analysis may be able to explain the trade-offs between price and value the customer has made but this ignores the auction mechanisms airlines use whenever somebody is looking and booking. Only by using control groups visiting the booking site with fixed prices and compare them to the dynamic pricing group may provide reliable information.

Simple tests of price, product, promotion etc. are ideal with this Big Data approach. But don’t expect explanations from web logs. The chapter finishes with some realistic promises in attrition and response management as well as segmentation and assessing advertising results. But it is the note at the end that explains a lot: The content of this chapter is based on a conference talk… (p. 51)


Chapter three suggests the added value of various Big Data sources. Telematics, text, time and location, smart grid, RFID, sensor, telemetry and social network data are all known examples but they are discussed in a moderate tone this time. The only surprise I got was the application of RFID data in casino chips. But then it has been a while since I visited COMDEX in Vegas.

Moving on to the second part about technologies, processes and methods. It starts with a high level didactic “for Dummies” kind of overview of data warehouses, massive parallel processing systems, SQL and UDF,PMML, cloud computing, grid computing, MapReduce.

In chapter 5, the analytic sandbox is positioned as  a major driver of analytic value and rightly so. Franks addresses some architectural issues with the question of external or internal sandboxes but he is a bit unclear about when to use one or the other as he simply states the advantages and disadvantages of both choices, adding the hybrid version as simply the sum of the external and internal sandbox(p. 125 – 130).

Why and when we choose one of the options isn’t mentioned. Think of fast exploration of small data sets in an external system versus testing, modifying a model with larger data sets in an internal system for example.

When discussing the use of enterprise reusable datasets, the author does tackle the “When?” question. It seems this section has somewhat of a SAS flavour. I have witnessed a few “puppy dog” approaches of the SAS sales teams to recognise a phrase like: There is no reason why business intelligence and reporting environments; as well as their users, cant leverage the EADS (Enterprise Analytic Data Set (author’s note)) structures as well (p145). This where the EADS becomes a substitute for the existing –or TO BE-  data warehouse environment and SAS takes over the entire BI landscape. Thanks but no thanks, I prefer a best of breed approach to ETL, database technology and publication of analytical results instead of the camel’s nose. A sandbox should be a project based environment, not a persistent BI infrastructure. You can’t have your cake and eat it.


The sixth chapter discusses the evolution of analytic tools and methods and here Franks is way out of line as far as I am concerned. Many of the commonly used analytical tools and modelling approaches have been in use for many years. Some, such as linear regression or decision trees, are effective and relevant, but relatively simplistic to implement? (p. 154) I am afraid I am lost here. Does Franks mean that only complex implementations produce value in big data? Or does he mean that the old formulas are no longer appropriate? Newsflash for all statisticians, nerds and number crunching geeks: better a simple model that is understood and used by the people who execute the strategy than a complex model –running the risk of overfitting and modelling white noise- that is not understood by the people who produce and consume strategic inputs and outputs… Double blind tests between classical regression techniques and fancy new algorithms have often showed only slightly or even negative added predictive value. Because models can only survive if the business user adds context, deep knowledge and wisdom to the model.

I remember a shootout in a proof of concept between the two major data mining tools (guess who was on that shortlist!) and the existing Excel 2007 forecasting implementation. I regret to say to the data mining tool vendors that Excel won. Luckily a few pages further the author himself admits: Sometimes “good enough” really is! (p. 157)

The third part, about the people and approaches starts off on the wrong foot: A reporting environment, as we will define it here, is also often called a business intelligence (BI) environment.

Maybe Franks keeps some reserve using “is also often called” but nevertheless is reversing a few things which I am glad to restore in their glory. Business Intelligence is a comprehensive discipline. It entails the architecture of the information delivery system, the data management, the delivery processes and its products like reporting, OLAP cubes, monitoring, statistics and analytics…

But he does make a point when het states that massive amounts of reports … amount to frustrated IT providers and frustrated report users. Frank’s plea for relevant reports (p. 182) is not addressing the root cause.

That root cause is –in my humble opinion- that many organisations still use an end to end approach in reporting: building point solutions from data source to target BI report. That way, duplicates and missed information opportunities are combined because these organisations lack an architectural vision.


On page 183, Bill Franks makes a somewhat academic comparison between reporting and analysis which raises questions (and eyebrows).

Here’s the table with just one of the many comments I can make per comparison:

Reporting…
Analysis…
Just one remark (as you are pressed for time)
Provides data
Provides answers
So there are no data in analyses?
Provides what is asked for
Provides what is needed
A report can answer both open and closed questions: deviations from the norm as  answers to these questions and trend comparisons of various KPI’s leading to new analysis.
Is typically standardised
Is typically customised
OK, but don’t underestimate the number of reports with ten or more prompts: reports or analytics? I don’t care.
Does not involve a person
Involves a person
True for automated scoring in OLTP applications but I prefer some added human intelligence as the ultimate goal of BI is: improved decision making.
Is fairly inflexible
Is extremely flexible.
OK Bill, this one’s on me. You’re absolutely right!


The book presents a reflection on what makes a great analytic professional and how to enable analytic innovation. What makes a great analytic professional and a team? In a nutshell it is very simple: the person who has the competence, commitment and creativity to produce new insights. He accepts imperfect base data, is business savvy and connects the analysis to the granularity of the decision. He also knows how to communicate analytic results. So far so good. As for the analytic team discussion, I see a few discussion points, especially the suggested dichotomy between IT and analytics (pp. 245 – 247) It appears that the IT teams want to control and domesticate the creativity of the analytics team but that is a bit biased. In my experience, analysts who can explain not only what they are doing and how they work but also what the value is for the organisation can create buy in from IT.

Finally, Franks discusses the analytics culture. And this is again a call to action for innovation and introduction of Big Data Analytics. The author sums up the  barriers for innovation which I assume should be known to his audience.

Conclusion


Although not completely detached from commercial interests (the book is sold for a song, which says something about the intentions of the writer and the sponsors) Bill Franks gives a good C-level explanation of what Big Data is all about. It provides food for thought for executives who want to position the various aspects of Big Data in their organisation. Sure, it follows the AIDA structure of a sales call, but Bill Franks does it with a clear pen, style and elegance.

This book has a reason of existence. Make sure you get a free copy from your SAS Institute or Teradata account manager.

donderdag 23 mei 2013

What is Really “Big” about Big Data


Sometimes it is better to sit still and wait until the dust settles. Slowly but surely, the dust is settling around the Big Data buzz and what I am looking at does not impress me.
There are a lot of sloppy definitions of Big Data about. I have read incredible white papers, books and articles which all boil down to: “We don’t know what it is, but it is coming! Be prepared (to open your wallet).” In this article, I make an attempt to put certain aspects about Big Data in a Business Intelligence perspective and come to a definition that is more formal and truly distinguishes Big Data from just being a bigger database. I have a few challenges for you that will hopefully elicit some response and debate. These are the main challenges, intentionally served as bullet points as it will make bullet holes in some people’s assumptions.

  • Three V’s are not enough to describe the phenomenon
  • There is a word missing in the phenomenon’s naming
  • It has always been around: nihil novi sub sole
  • What’s really Big about Big Data


Between all vaguely defined concepts, vendor pushed white papers and publications, three “Vs” are supposed to define Big Data completely.
Gartner, who is always ready to define a new market segment as a target for analysis and studies as a new revenue source introduced these three “Vs” in 2008 or something.

 “Variety”, “Velocity” and “Volume” are supposed to describe the essence of Big Data. Let’s see if this holds any water. “Velocity”, the speed with which data should be processed and “Volume” are relative and subject to technological advances. When I did analysis for a large telecom organisation in 1 the mid-nineties, the main challenge was to respond fast and adequately to potential churn signals, hidden in the massive amounts of call data records (CDRs). Those CDRs were the Big Data of the nineties.
Today with the ever increasing numbers in social, mobile and sensor data the velocity and volumes have increased but so has the speed of massive parallel processing capabilities as well as storage and retrieval. Speed and volume are also an identical challenge for every  player in the competition. If you need to be on par with your competitors, take out your credit card and book some space on AWS or Azure and Bob’s your uncle. The entire Big Data infrastructure is designed to function on cheap hardware.

Two “Vs” that matter: Variety and Volatility


"Variety"

One of the real differentiators is “Variety”: the increased variety of data types that really defines what Big Data is all about.
This “V” you won’t read about in Big Data hype articles, yet this is a serious challenge. From the early EBCDIC and ASCII files to UTF-8, xml, BLOBs, audio, video, sensor and actuator data etc... making sense of all these data types can be a pain. But also the variety within a data type may add to complexity. Think of the various separation marks in weblogs. And it gets worse when analysing unstructured data like natural language where culture and context kicks in: events, subcultures, trends, relationships, etc... Even a simple mood analysis in the US on tweets about the death of Bin Laden was negatively interpreted since a simplistic count of the word “death” was used to parse the tweets on the minus side without taking the context into the equation.

”Volatility”

This is about the variations in sentiment, value and predictive power which is so typical for an important segment in Big Data. A few examples to make it clear.
If I give a “like” to Molson and Stella beers, will this still be valid next week, next month,…?
If I measure  consumer behaviour in its broadest sense: information requests, complaints, e-mails to the customer service, reviews on websites, purchases, returns and payment behaviour and I try to link this to behaviour on social media, will it correlate with the previous measures? Think of predicting election results on the basis of opinion polls ,don’t we often say one thing while we act totally different?
This is the real challenge of Big Data. The rewards are great. The organisation that can exploit data about moods, attitudes and intentions from unstructured data will have better forecasts and respond faster to changes in the market.
This way, Big Data realises a dream from the 1960’s: establishing a relationship between communication input and its impact on sales results. In other words, the DAGMAR [1] model from Russel Colley can materialise by analysing unstructured data from a broad range of sources.





[1] DAGMAR: the abbreviation of Defining Advertising Goals for Measuring Advertising Results.