donderdag 22 mei 2014

The Last Mile in the Belgian Elections (V)

Why Sentiment Measures Alone Are Not Enough

In the process of developing Social Analytics and Monitoring, we learnt something most interesting about sentiment analysis. Before we created Data2Action as a platform for data mining and developed SAM (Social Analytics and Monitoring) we studied many approaches.

Many of these were just producing numbers to express sentiment versus a brand, a person, a concept or a company, to name a few.

Isolated Sentiment Analysis is Meaningless

This can be too superficial to produce meaningful analytic results so we recreated social constructs that match with concepts. Analysing the sentiment of a construct element in context with a topic is not a trivial task. But at least it approaches human judgement and it can be trained to increase precision and relevance.

Today, I am not going to amaze you with Big Numbers but I’ll show you some examples of how we approach sentiment analysis with SAM.

Let’s take a few tweets about the N-VA party and examine how they are scored:

The ultimate horror for companies and a torpedo for our welfare state: an anti N-VA coalition with the ecologist party

Another point where N-VA does not represent the Flemish people

From a one-dimensional point of view, both tweets are negative for N-VA but the first is in fact meant as a positive, pro N-VA statement.

Let us look at this, more complex tweet:

Vande Lanotte opens up the coalition for the Green Party, wrong move as the voters already consider N-VA strong enough.

The first part of the sentence “Vande Lanotte opens up the coalition for the Green Party” can be considered positive for Vande Lanotte and his socialist party SP-A. But the second part is negative. This shows the importance of parsing the sentence correctly and attributing scores as a function of viewpoints.

woensdag 21 mei 2014

The Last Mile in the Belgian Elections (IV)

How Topic Detection Leads to Share-of-voice Analysis

It was a full day of events on Twitter. Time to make an inventory of the principal topics and the buzz created on the social network in the Dutch speaking community in the north of Belgium.

First, the figures: 10.605 tweets were analysed of which 5.754 were referring to an external link (i.e. a news site or another external web site like a blog, a photo album etc…)

As the Flemish nationalist party leader Mr. Dewever from N-VA (the New Flemish Alliance in English) launched his appeal to the French speaking community today, we focused on the tweets about, to and from this party.

A mere 282 tweets were deemed relevant for topic analysis. And here’s the first striking number: of these 282 tweets only 16 contained a reactive response.

Tweets that provoked a reactive response are almost nonexistent

About 49 topics were grouping several media sources and publications of all sorts. We will discuss three to illustrate how the relationship between topic, retweets, klout score and added content makes some tweets more relevant than others. These are the three topics:

Dewever addresses the French speaking community via Twitter
Christian Democrat De Clerck falsely accuses N-VA of using fascist symbols in an advertisement
You Tube movie from N-VA is ridiculed by the broad community

Dewever addresses the French speaking community via Twitter

This topic is divided in a moderately positive headline and two neutral ones. The positive: Bart Dewever to the French Speaking Community: “Give N-VA a Chance”

This headline generates a total klout score of 188 where the Flemish tv station VRT takes the biggest chunk with 158 klout score.

This neutral headline generates only 98 klout score: “Dewever puts the struggle between N-VA and the French speaking socialist party at the centre of the discussion”

The other neutral headline “N-VA President Bart Dewever addresses the French speaking community directly” delivers a higher score: 140 klout score partly because one of N-VA’s members of Parliament promoted the link to the news medium.

All in all with 426 total klout score, this topic does not cause great ripples, especially not if you compare this to a mere anecdote, which is the second topic.

Christian Democrat De Clerck falsely accuses N-VA of using fascist symbols in an advertisement

On the left, the swastika hoax, commented by the christian democrat and in the right the original ad showing a labyrinth

Felix De Clerck, son of the former Christian democrat minister of Justice Stefaan De Clerck, reacted to a hoax and was chastised for doing this. With a klout score of 967 this has caused a bigger stir although the political relevance is a lot smaller than Dewever’s speech… Emotions can play a role even in simple and neutral retweets.

You Tube movie from N-VA is ridiculed by the broad community

Another day’s high was reached with an amateuristic and unprofessional YouTube movie which showed a parody on a famous Flemish detective series to highlight the major issues of the campaign. This product from the candidates in West-Flanders, including the Flemish minister of Interior Affairs, Geert Bourgeois generated a total klout score of 778 tweets and retweets with negative or sarcastic comments.

Yet an adjacent topic about a cameraman from Bruges who is surprised by minister Bourgeois’ enthusiasm generates a 123 moderately positive klout score.

Three topics out of 49 generate 20.6 % of total klout scores!

This illustrates perfectly how the Twitter community selects and reinforces topics that carry an emotional value: the YouTube movie and the hoax from De Clerck generated a share of voice of no less than almost 17% of the tweets.

Forgive me for reducing the scope to Flanders, the political scope to just one party and the tweets to only three because this blog has not the intention of presenting the full enchilada. I hope we have demonstrated with today’s contribution that topics and the way they are perceived and handled can vary greatly in impact and cannot be entirely reduced to numbers. In other words, the human interpreter will deliver added value for quite a long time.

dinsdag 20 mei 2014

The Last Mile in the Belgian Elections (III)

Awesome Numbers... Big Data Volumes

Wow, the first results are awesome. Well, er, the first calculations at least are amazing.

8500 tweets per 15 seconds measured means 1.5 billion tweets per month if you extrapolate this in a very rudimentary way...
2 Kb per tweet = 2.8 Terabytes on input data per month if you follow the same reasoning. Nevertheless it is quite impressive for a small country like Belgium where the Twitter adoption is not on par with the northern countries..
If you use 55 kilobytes for a model vector of 1000 features you generate 77 Terabyte of information per month
55 K is a small vector. A normal feature vector of one million features generates 72 Petabytes of information per month.

And wading through this sea of data you expect us to come up with results that matter?
Yes.
We did it.

Male versus female tweets in Belgian Elections

Gender analysis of tweets in the Belgian elections n = 4977 tweets

Today we checked the gender differences

The Belgian male Twitter species is clearly more interested in politics than the female variant: only 22 % of the 24 hours tweets were of female signature, the remaining 78 % were of male origin.
This is not because Belgian women are less present on Twitter: 48 % are female tweets against 52 % of the male sources.
Analysing the first training results for irony and sarcasm also shows a male bias. the majority of the sarcastic tweets were male: 95 out of 115. Only 50 were detected by the data mining algorithms so we still have some training to do.
More news tomorrow!

maandag 19 mei 2014

The Last Mile in the Belgian Elections (II)

Getting Started

I promised to report on my activities in social analytics. For this report, I will try to wear the shoes of a novice user and report, without any withholdings about this emerging discipline. I explicitly use the word “emerging” as it has all the likes of it: technology enthusiasts will have no problem overlooking the quirks preventing an easy end to end “next-next-next” solution. Because there is no user friendly wizard that can guide you from selecting the sources, setting up the target, creating the filters and optimising the analytics for cost, sample size, relevance and validity checks, I will have to go through the entire process in an iterative and sometimes trial-and-error way.

This is how massive amounts of data enter the file system

Over the weekend and today I have been mostly busy just doing that. Tweet intakes ranged from taking in 8.500 Belgian tweets in 15 seconds and doing the filtering locally on our in memory database to pushing all filters to the source system and getting 115 tweets in an hour. But finally, we got to an optimum query result and the Belgian model can be trained. The first training we will set up is detecting sarcasm and irony. With the proper developed and tested algorithms we hope for a 70% accuracy in finding tweets that express exactly the opposite sentiment of what the words say. Tweets like “well done, a**hole” are easy to detect but it’s the one without the description of the important part of the human digestive system that’s a little harder.

The cleaned output is ready for the presentation layer

Conclusion of this weekend and today: don’t start social analytics like any data mining or statistical project. Because taming the social media data is an order of magnitude harder than crunching the numbers in stats.

Let’s all cross our fingers and hope we can come up with some relevant results tomorrow.

woensdag 14 mei 2014

The Last Mile in the Belgian Elections

Sentiment Analysis, a Predictor of the Outcome?

Data2Action is an agile data mining platform consisting of efficiently integrated components for rapid application development. One deliverable of Data2Action is SAM, for Social Analytics and Monitoring.

In the coming days, I will publish the daily results from sentiment analysis on Twitter with regards to the programmes, the major candidates and interest groups.

Stay tuned for the first report on Monday 19th May

Questions like:

Which media produce the most negative or positive tweets about which party, which major candidate?
Who are the major influencers on Twitter?
What are the tweets with the highest impact?

The major networks will stimulate lots of tweets this weekend so we will present the analysis next Monday.

zaterdag 3 mei 2014

What has Immanuel Kant got to do with it??

Making a Success of New BI Tool Introduction

In the previous post I indicated the five major causes why BI consultants fail to introduce a new BI tool in the organisation. As promised, I have not just raised questions but I am ready to provide you with some answers.

Some of my colleagues in Business Intelligence commented on the LinkedIn discussion forum. I will quote their comments and integrate them in this post.

It is all about embedding the tool in a larger setting, larger than the competences of one BI specialist.

Some people won’t like to read this. The reason is simple: positioning the BI tool in a very broad, organisation wide vision goes beyond the competences of a technical project lead. The approach requires teamwork and input of business analysts, strategic consultants and change managers. It requires more time and budget and both are scarce resources in an organisation.

But if you look at the wasted time and money in remedial efforts to get the new BI tool on the road, you can consider the extra effort and resources as an insurance premium. Because you can only make a first impression once.

These are the seven steps to successful introduction I will address in the article on my book site BA4BI:

* Get a deep insight in the organisation’s DNA

* Understand its strategy

* Understand its information needs

* Assess the information modelling acceptance in the organisation

* Translate the previous in the tool’s requirements

* Introduce the tool

* Develop the decision making culture with the new tool

In the article on my book site, I elaborate on these seven steps. Click here for more information.

vrijdag 2 mei 2014

Questions to Ask Ralph Kimball the 10th June in 't Spant in Bussum (Neth.)

Dear Ralph,

I know you’re a busy man so I won’t take too much of your time to read this post. I look forward to meeting you June 10 in 't Spant in Bussum for an in depth session on Big Data and your views on the phenomenon.

In one of your keynotes you will address your vision on how Big Data drives the Business and IT to adapt and evolve. Let me first of all congratulate you with the title of your keynote. It proves that a world class BI and data warehouse veteran is still on top of things, which we can’t say for some other gurus of your generation, but let’s not dwell on that.

I have been studying the Big Data Phenomenon from my narrow perspective: business analysis and BI architecture and here are some of the questions I hope we can tackle during your keynote session:

1. Do you consider Big Data as something you can fit entirely in star schemas? I know since The Data Webhouse Toolkit days that semi structured data like web logs can find a place in a multidimensional model but some of the Big Data produce is to my knowledge not fit for persistent storage. Yet I believe that a derived form of persistent storage may be a good idea. Let me give you an example. Imagine we can measure the consumer mood for a certain brand on a daily basis, scanning the social media postings. Instead of creating a junk-like dimension we could build a star schema with the following dimensions: a mood dimension, social media source dimension, time, location and brand dimension to name the minimum and a central fact table with the mood score on a seven point Likert scale. The real challenge will lie in correctly structuring the text strings into the proper Likert score using advanced text analytics. Remember the wrong interpretation of the Osama Bin Laden tweets early May 2011? The program interpreted “death” as a negative mood when the entire US was cheering the expedient demise of the terrorist.

Figure 1: An example of derived Big Data in a multidimensional schema

2. How will you address the volatility issue? Because Big Data’s most convincing feature is not volume, velocity or variety which have always been relative to the state of the art. No, volatility is what really characterizes Big Data and I refer to my article here where I point out that Big Behavioural Data is the true challenge for analytics as emotions and intentions can be very volatile and the Law of Large Numbers may not always apply.

3. Do you see a case for data provisioning strategies to address the above two issues? With data provisioning, I mean a transparent layer between the business questions and the BI architecture where ETL provides answers to routine or planned business questions and virtual data marts produce answers to ad hoc and unplanned business questions. If so, what are the major pitfalls of virtualization for Big Data Analytics?
4. Do you see the need for new methodologies and new modeling methods or does the present toolbox suffice?

It’s been a while since we met and I really look forward to meeting you in Bussum, whether you answer these questions or not.

Kind regards,

Bert Brijs