Posts tonen met het label text mining. Alle posts tonen

maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
Tag parts of speech with their grammatical function: verb, adjective,..
Parse words as a function of their position and type
Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

An example of a thesaurus metamodel

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl

vrijdag 23 mei 2014

The Last Mile in the Belgian Elections (VI)

Are Twitter People Nice People?

The answer is: “Depends”. In this article I make a taxonomy of tweets in the last week of the Belgian elections. Based on over 35.000 tweets we can be pretty sure that this is a representative sample. You can consider this article as an introduction to tomorrow's headline: the last election poll, based on twitter analytics.

A picture says more than a thousand tweets

The taxonomy of the Twitter community

So here it is. The majority of tweets are negative. When you encounter positive tweets, they are either from somebody who wants to market something (in case of the elections him or herself or a candidate he or she supports) or from somebody who is forwarding a link with a positive comment.

There is a correlation between the level of negativity about a subject and the political party related to the subject. From a political point of view, the polarisation between the Walloon socialist party and the Flemish nationalist party is clearly visible on Twitter.

Even today, on the funeral of the well-respected politician of the older generation, the former Belgian prime minister Jean-Luc Dehaene, the majority of tweets were negative. Tweets linking him to the financial scandal of the Christian democrat trade union in Dexia were six times more than the pious "RIP JLD" variants.

So how do you derive popularity and even arrive at some predictive value from a bunch of negative tweets? That, my dear blog readers, will be examined tomorrow in the final article.

maandag 19 mei 2014

The Last Mile in the Belgian Elections (II)

Getting Started

I promised to report on my activities in social analytics. For this report, I will try to wear the shoes of a novice user and report, without any withholdings about this emerging discipline. I explicitly use the word “emerging” as it has all the likes of it: technology enthusiasts will have no problem overlooking the quirks preventing an easy end to end “next-next-next” solution. Because there is no user friendly wizard that can guide you from selecting the sources, setting up the target, creating the filters and optimising the analytics for cost, sample size, relevance and validity checks, I will have to go through the entire process in an iterative and sometimes trial-and-error way.

This is how massive amounts of data enter the file system

Over the weekend and today I have been mostly busy just doing that. Tweet intakes ranged from taking in 8.500 Belgian tweets in 15 seconds and doing the filtering locally on our in memory database to pushing all filters to the source system and getting 115 tweets in an hour. But finally, we got to an optimum query result and the Belgian model can be trained. The first training we will set up is detecting sarcasm and irony. With the proper developed and tested algorithms we hope for a 70% accuracy in finding tweets that express exactly the opposite sentiment of what the words say. Tweets like “well done, a**hole” are easy to detect but it’s the one without the description of the important part of the human digestive system that’s a little harder.

The cleaned output is ready for the presentation layer

Conclusion of this weekend and today: don’t start social analytics like any data mining or statistical project. Because taming the social media data is an order of magnitude harder than crunching the numbers in stats.

Let’s all cross our fingers and hope we can come up with some relevant results tomorrow.

woensdag 14 mei 2014

The Last Mile in the Belgian Elections

Sentiment Analysis, a Predictor of the Outcome?

Data2Action is an agile data mining platform consisting of efficiently integrated components for rapid application development. One deliverable of Data2Action is SAM, for Social Analytics and Monitoring.

In the coming days, I will publish the daily results from sentiment analysis on Twitter with regards to the programmes, the major candidates and interest groups.

Stay tuned for the first report on Monday 19th May

Questions like:

Which media produce the most negative or positive tweets about which party, which major candidate?
Who are the major influencers on Twitter?
What are the tweets with the highest impact?

The major networks will stimulate lots of tweets this weekend so we will present the analysis next Monday.