Enabling Business Analysis for Big Data Analytics: the path is the goal |
Thoughts on business intelligence and customer relationship management as customer analytics need process based support for meaningful analysis.
vrijdag 22 augustus 2014
maandag 26 mei 2014
Elections’ Epilogue: What Have We Learned?
First the good news: a MAD of 1.41 Gets the Bronze Medal of All Polls!
The results
from the Flemish Parliament elections with all votes counted are:
Party
|
Results (source: Het
Nieuwsblad)
|
SAM’s forecast
|
20,48 %
|
18,70 %
|
|
Green (Groen)
|
8,7 %
|
8,75 %
|
31,88 %
|
30,32 %
|
|
14,15 %
|
13,70 %
|
|
13,99 %
|
13,27 %
|
|
5,92%
|
9,80%
|
Table1. Results Flemish Parliament compared to our forecast
And below
is the comparative table of all polls compared to this result and the Mean Absolute
Deviation (MAD) which expresses the level of variability in the forecasts. A
MAD of zero value means you did a perfect prediction. In this case,with the highest
score of almost 32 % and the lowest of almost six % in only six observations anything under 1.5 is quite alright.
Table 2. Comparison of all opinion polls for the Flemish Parliament and our prediction based on Twitter analytics by SAM.
Compared to
16 other opinion polls, published by various national media our little SAM
(Social Analytics and Monitoring) did quite alright on the budget of a
shoestring: in only 5.7 man-days we came up with a result, competing with mega concerns
in market research.
The Mean
Absolute Deviation covers up one serious flaw in our forecast: the giant shift
from voters from VB (The nationalist Anti Islam party) to N-VA (the Flemish
nationalist party). This led to an underestimation of the N-VA result and an overestimation
of the VB result. Although the model
estimated the correct direction of the shift, it underestimated the proportion
of it.
If we would
have used more data, we might have caught that shift and ended even higher!
Conclusion
Social
Media Analytics is a step further than social media reporting as most tools
nowadays do. With our little SAM, built on the Data2Action platform, we have
sufficiently proven that forecasting on the basis of correct judgment of
sentiment on even only one source like Twitter can produce relevant results in
marketing, sales, operations and finance. Because, compared to politics, these disciplines
deliver far more predictable data as they can combine external sources like
social media with customer, production, logistics and financial data. And the social media actors and opinion leaders certainly produce less bias in these areas than is the -case in political statements. All this can be done on a continuous basis supporting day-to-day management in communication, supply chain, sales, etc...
If you want to know more
about Data2Action, the platform that made this possible, drop me a line: contact@linguafrancaconsulting.eu
Get ready for fact based decision making
on all levels of your organisation
zaterdag 24 mei 2014
The Last Mile in the Belgian Elections (VII)
The Flemish Parliament’s Predictions
Scope management is important if you are on a tight budget
and your sole objective is to prove that social media analytics is a journey
into the future. That is why we concentrated on Flanders, the northern part of
Belgium. (Yet, the outcome of the elections for the Flemish parliament will
determine the events on Belgian level: if the N-VA wins significantly, they can
impose some of their radical methods to get Belgium out of the economic slump
which is not very appreciated in the French speaking south.) In commercial terms, this last week of
analytics would have cost the client 5.7 man-days of work. Compare this to the
cost of an opinion poll and there is a valid add on available for opinion polls
as the Twitter analytics can be done a continuous basis. A poll is a photograph
of the situation while social media analytics show the movie.
A poll is a photograph of the situation while social media analytics show the movie.
From Share-of-Voice to Predictions
It’s been a busy week. Interpreting tweets is not a simple
task as we illustrated in the previous blog posts. And today, the challenge
gets even bigger. To predict the election outcome in the northern, Dutch
speaking part of Belgium on the basis of sentiment analysis related to topics
is like base-jumping knowing that not one, but six guys have packed your
parachute. These six guys are totally biased. Here are their names, in
alphabetical order, in case you might think I am biased:
Dutch name
|
Name used in this blog post
|
CD&V (Christen
Democratisch en Vlaams)
|
Christian democrats
|
Groen
|
Green (the ecologist
party)
|
N-VA (Nieuw-Vlaamse
Alliantie)
|
Flemish nationalists
|
O-VLD (Open Vlaamse Liberalen en Democraten)
|
Liberal democrats
|
SP-A (Socialistische
Partij Anders)
|
Social democrats
|
VB (Vlaams Belang)
|
Nationalist &
Anti-Islam party
|
Table 1 Translation of the original Dutch party names
From the opinion polls, the consensus is that the Flemish
nationalists can obtain a result over 30 % but the latest poll showed a
downward trend breach, the Nationalist Anti-Islam party will lose further and
become smaller than the Green party. In our analysis we didn’t include the
extreme left wing party PVDA for the simple reason that they were almost
non-existent on Twitter and the confusion with the Dutch social democrats
created a tedious filtering job which is fine if you get a budget for this. But
since this was not the case, we skipped them as well as any other exotic
outsider. Together with the blanc and invalid votes they may account for an
important percentage which will show itself at the end of math exercises. But
the objective of this blog post is to examine the possibilities of
approximating the market shares with the share of voice on Twitter, detect the
mechanics of possible anomalies and report on the user experience as we
explained at the outset of this Last Mile series of posts.
If we take the rough data of the share-of-voice on over
43.000 tweets we see some remarkable deviations from the consensus.
Party
|
Share of voice on Twitter
|
21,3 %
|
|
Green (the ecologist party)
|
8,8 %
|
Flemish nationalists
|
27,9 %
|
Liberal democrats
|
13,6 %
|
Social democrats
|
12,8 %
|
Nationalist &
Anti-Islam party
|
11,3 %
|
Void, blanc, mini
parties
|
4,3 %
|
Table 2. Percentage share of voice on Twitter per Flemish party
It is common practice nowadays to combine the results of multiple models
instead of using just one. Not only in statistics is this better, Nobel prize
winner Kahneman has shown this clearly in his work. In this case we combine
this model with other independent models to come to a final one.
In this case we use the opinion polls to derive the covariance matrix.
This
allows us to see things such as, if one party’s share grows, at which party’s
expense is it? In the case of the Flemish nationalists it does so at the cost
of the Liberal democrats and the Nationalist and Anti-Islam party but it wins
less followers from the Christian and the social democrats. The behaviour of Green
and the Nationalist and Anti-Islam party during the opinion polls was very
volatile, which explains for a part the spurious correlations with other
parties.
Graph 1 Overview of all opinion poll results: the evolution of the market shares in different opinion polls over time.
Comparing the different opinion polls, from different
research organisations, on different samples is simply not possible. But if you
combine all numbers in a mathematical model you can smooth a large part of
these differences and create a central tendency.
To combine the different models, we use a derivation of
the Black-Litterman model used in finance. We are violating some assumptions
such as general market equilibrium which we replace by a total different
concept as opinion polls. However the elegance of this approach allows us to
take into account opinions, confidence in this opinion and complex
interdepencies between the parties. The mathematical gain is worth the
sacrifice of the theoretical underpinning.
This is based on a variant of the Black-Litterman model Ī¼=Ī +ĻĪ£t(Ī©+ĻPĪ£Pt)(p−PĪ )And the Final Results Are…
Party
|
Central Tendency of all opinion polls
|
Data2Action’s Prediction
|
18 %
|
18,7 %
|
|
Green (the ecologist
party)
|
8,7 %
|
8,8 %
|
31 %
|
30,3 %
|
|
14 %
|
13,7 %
|
|
13,3 %
|
13,3 %
|
|
9,4 %
|
9,8 %
|
|
Other (blanc, mini
parties,…)
|
5,6 %
|
5,4 %
|
Total
|
100 %
|
100 %
|
Now let’s cross our fingers and hope we produced some
relevant results.
In the Epilogue, next week, we will evaluate the entire
process. Stay tuned! Update: navigate to the evaluation.
Labels:
#elections2014,
#vk2014,
Belgium,
Big Data,
Black-Litterman,
CD&V,
elections Belgiƫ,
Groen,
N-VA,
O-VLD,
opinion polls,
Social analytics,
SP-A,
verkiezingen,
Vlaams Belang
vrijdag 23 mei 2014
The Last Mile in the Belgian Elections (VI)
Are Twitter People Nice
People?
The answer
is: “Depends”. In this article I make a taxonomy of tweets in the last week of
the Belgian elections. Based on over 35.000 tweets we can be pretty sure that this
is a representative sample. You can consider this article as an introduction to
tomorrow's headline: the last election poll, based on twitter analytics.
A picture says more than a thousand tweets
The taxonomy of the Twitter community |
So here it
is. The majority of tweets are negative.
When you encounter positive tweets, they are either from somebody who wants to
market something (in case of the elections him or herself or a candidate he or
she supports) or from somebody who is forwarding a link with a positive
comment.
There is a
correlation between the level of negativity about a subject and the political party
related to the subject. From a political point of view, the polarisation between the Walloon socialist party and the Flemish nationalist party is clearly visible on Twitter.
Even today,
on the funeral of the well-respected politician of the older generation, the former
Belgian prime minister Jean-Luc Dehaene, the majority of tweets were negative.
Tweets linking him to the financial scandal of the Christian democrat trade union in Dexia were six times more than the pious "RIP JLD" variants.
So how do
you derive popularity and even arrive at some predictive value from a bunch of
negative tweets? That, my dear blog
readers, will be examined tomorrow in the final article.
donderdag 22 mei 2014
The Last Mile in the Belgian Elections (V)
Why Sentiment Measures Alone Are Not Enough
In the process of developing Social Analytics and Monitoring, we learnt something most interesting about sentiment analysis. Before we created
Data2Action as a platform for data
mining and developed SAM (Social Analytics and Monitoring) we studied many
approaches.
Many of these were just producing numbers to express
sentiment versus a brand, a person, a concept or a company, to name a few.
Isolated Sentiment Analysis is Meaningless
This can be too superficial to produce meaningful analytic
results so we recreated social constructs that match with concepts. Analysing
the sentiment of a construct element in context with a topic is not a trivial
task. But at least it approaches human judgement and it can be trained to
increase precision and relevance.
Today, I am not going to amaze you with Big Numbers but I’ll
show you some examples of how we approach sentiment analysis with SAM.
Let’s take a few tweets about the N-VA party and examine how
they are scored:
The ultimate horror for companies and a torpedo for our
welfare state: an anti N-VA coalition with the ecologist party
Another point where N-VA does not represent the Flemish people
From a one-dimensional point of view, both tweets are
negative for N-VA but the first is in fact meant as a positive, pro N-VA statement.
Let us look at this, more complex tweet:
Vande Lanotte opens up the coalition for the Green Party,
wrong move as the voters already consider N-VA strong enough.
The first part of the sentence “Vande Lanotte opens up the
coalition for the Green Party” can be considered positive for Vande Lanotte and
his socialist party SP-A. But the second part is negative. This shows the
importance of parsing the sentence correctly and attributing scores as a
function of viewpoints.woensdag 21 mei 2014
The Last Mile in the Belgian Elections (IV)
How Topic Detection Leads to Share-of-voice Analysis
It was a full day of events on Twitter. Time to make an
inventory of the principal topics and the buzz created on the social network in
the Dutch speaking community in the north of Belgium.
First, the figures: 10.605 tweets were analysed of which
5.754 were referring to an external link (i.e. a news site or another external
web site like a blog, a photo album etc…)
As the Flemish nationalist party leader Mr. Dewever from
N-VA (the New Flemish Alliance in English) launched his appeal to the French speaking
community today, we focused on the tweets about, to and from this party.
A mere 282 tweets were deemed relevant for topic analysis.
And here’s the first striking number: of these 282 tweets only 16 contained a
reactive response.
Tweets that provoked a reactive response are almost nonexistent |
About 49 topics were grouping several media sources and
publications of all sorts. We will discuss three to illustrate how the relationship
between topic, retweets, klout score and added content makes some tweets more
relevant than others. These are the three topics:
- Dewever addresses the French speaking community via Twitter
- Christian Democrat De Clerck falsely accuses N-VA of using fascist symbols in an advertisement
- You Tube movie from N-VA is ridiculed by the broad community
Dewever addresses the French speaking community via Twitter
This topic is divided in a moderately positive headline and two
neutral ones. The positive: Bart Dewever to the French Speaking Community: “Give
N-VA a Chance”
This headline generates a total klout score of 188 where the
Flemish tv station VRT takes the biggest chunk with 158 klout score.
This neutral headline generates only 98 klout score: “Dewever
puts the struggle between N-VA and the French speaking socialist party at the
centre of the discussion”
The other neutral headline “N-VA President Bart Dewever
addresses the French speaking community directly” delivers a higher score: 140 klout
score partly because one of N-VA’s members of Parliament promoted the link to
the news medium.
All in all with 426 total klout score, this topic does not
cause great ripples, especially not if you compare this to a mere anecdote,
which is the second topic.
Christian Democrat De Clerck falsely accuses N-VA of using fascist symbols in an advertisement
On the left, the swastika hoax, commented by the christian democrat and in the right the original ad showing a labyrinth |
Felix De Clerck, son of the former Christian democrat minister
of Justice Stefaan De Clerck, reacted to a hoax and was chastised for doing
this. With a klout score of 967 this has caused a bigger stir although the
political relevance is a lot smaller than Dewever’s speech… Emotions can play a
role even in simple and neutral retweets.
You Tube movie from N-VA is ridiculed by the broad community
Another day’s high was reached with an amateuristic and
unprofessional YouTube movie which showed a parody on a famous Flemish
detective series to highlight the major issues of the campaign. This product
from the candidates in West-Flanders, including the Flemish minister of Interior
Affairs, Geert Bourgeois generated a total klout score of 778 tweets and retweets
with negative or sarcastic comments.
Yet an adjacent topic about a cameraman from Bruges who is
surprised by minister Bourgeois’ enthusiasm generates a 123 moderately positive
klout score.
Three topics out of 49 generate 20.6 % of total klout scores!
This illustrates perfectly how the Twitter community selects
and reinforces topics that carry an emotional value: the YouTube movie and the
hoax from De Clerck generated a share of voice of no less than almost 17% of
the tweets.
Forgive me for reducing the scope to Flanders, the political
scope to just one party and the tweets to only three because this blog has not
the intention of presenting the full enchilada. I hope we have demonstrated
with today’s contribution that topics and the way they are perceived and
handled can vary greatly in impact and cannot be entirely reduced to numbers.
In other words, the human interpreter will deliver added value for quite a long
time.
dinsdag 20 mei 2014
The Last Mile in the Belgian Elections (III)
Awesome Numbers... Big Data Volumes
Wow, the first results are awesome. Well, er, the first calculations at least are amazing.- 8500 tweets per 15 seconds measured means 1.5 billion tweets per month if you extrapolate this in a very rudimentary way...
- 2 Kb per tweet = 2.8 Terabytes on input data per month if you follow the same reasoning. Nevertheless it is quite impressive for a small country like Belgium where the Twitter adoption is not on par with the northern countries..
- If you use 55 kilobytes for a model vector of 1000 features you generate 77 Terabyte of information per month
- 55 K is a small vector. A normal feature vector of one million features generates 72 Petabytes of information per month.
And wading through this sea of data you expect us to come up with results that matter?
Yes.
We did it.
Gender analysis of tweets in the Belgian elections n = 4977 tweets |
Today we checked the gender differences
The Belgian male Twitter species is clearly more interested in politics than the female variant: only 22 % of the 24 hours tweets were of female signature, the remaining 78 % were of male origin.This is not because Belgian women are less present on Twitter: 48 % are female tweets against 52 % of the male sources.
Analysing the first training results for irony and sarcasm also shows a male bias. the majority of the sarcastic tweets were male: 95 out of 115. Only 50 were detected by the data mining algorithms so we still have some training to do.
More news tomorrow!
Abonneren op:
Posts (Atom)