Business Analysis for Business Intelligence Blog: BA4BIBlog

maandag 5 januari 2026

The 2025 GenAI SitRep

As 2026 announces itself as the year where “the hombres will be separated from the niños” , I’d like to look back at the evolution in GenAI and present a few predictions on it for the next two to three years.

When ChatGPT was launched in November 2022; it immediately raised high hopes for a new wave in machine learning, making the interaction with neural networks more intuitive using natural language. And as time progressed, the models were trained on ever increasing data volumes as the incumbents saw this as the only way to win the platform war and telling their investors about first mover advantage to get funds at surreal valuations. Now they are finding out two things. One: size is not the distinguishing factor in the platform competition. It’s about purpose built and well curated models where the competition is heading and two: there is no platform competition because the platforms were already there: Microsoft integrating ChatGPT in Office365 and Google doing the same with Gemini in its product suite.

Other platforms like Meta, shopping platforms like Amazon, Ali Baba or banking applications, either promote home grown LLMs or use one or more models among the hundreds of thousands open access models they can find on Hugging Face.

Nevertheless, there is one aspect of platform economics emerging, the creation of exit barriers. Once you’ve chosen for a platform to integrate in your business applications the SDKs and the models used and improved by RAG and other means (which we are working on) will increase switching costs.

The first three years in GenAI were a lot about overpromising and under delivering. I am not checking if Gartner already talks about the trough of disillusionment but I have met a few clients where disillusion has set in. Cases of organisations reducing their after sales service staff and replacing them with chatbots had to reverse their decision. We all know of lawyers producing phony legal precedents are being fined. Not langer than a couple of weeks ago, a judge in California has fined plaintiffs’ law firm Hagens Berman, one of its partners, and another lawyer a combined $13,000 for the “misuse” of artificial intelligence in several court filings in a lawsuit against the parent company of adult content social media site OnlyFans.

Much of the GenAI offers are “inside out” like when a top official of ChatGPT bluntly stated that the users will need to learn what they can do with their product. To use Lee James and Warren Schirtzinger’s concept of the marketing chasm, later popularised by Geoffrey Moore’s terminology, we are very much pleasing the techies but the pragmatists are left in the cold and are waiting on the other side of the chasm for a useful application to cater to their needs. That doesn’t absolve the pragmatists from the obligation to examine the impact of GenAI on their organisation, their business processes and their talent management. But today, little efforts are made to support them in that endeavour.

Forget about network effects, it’s all about trusted data

The real battle will be fought on the data terrain. And this is where our knowledge modelling methodology comes in play. Its output combines the strengths of “older” machine learning techniques with state of the art LLM technologies and prepares domain knowledge for reliable use in a well-defined, professional context.

It is also about scope management. There is a simple, over and over empirically conformed law that illustrates the correlation between consistency and scope of information: the smaller the scope, the higher the consistency.

Let me give you two examples to illustrate this thesis.

In the accounting domain the concept of cash flow is unequivocally defined as the money that flows in and out of a business and is measured from different perspectives: operations, investing and financing. Cash flow is an important metric to determine the value of a business. But a business valuation in itself is a more fuzzy concept composed of various, sometimes contradicting metrics as not only quantitative metrics like free cash flow, debt to equity ratio, price/earnings ratio etc… are under scrutiny. There are a lot of qualitative or more fuzzy metrics and evaluations like risk, customer loyalty, market position, growth potential and brand preference that are equally important to this concept.

In the supply chain domain, Solventure’s Bram De Smet pointed out in his book “SupplyChain Strategy and Financial Metrics” that crisply defined metrics like cash, cost and service level become a dynamic and fluid mix as a function of your business strategy. Cash, cost and service level may be opposing forces in realising an optimum outcome to support the strategic direction. The author leverages Treacy & Wiersema’s model of value disciplines, showing how product leadership, customer intimacy, or operational excellence each imply different preferred positions on the service–cost–cash triangle.

Crawford & Matthews’ five value drivers (price, access, service, product, experience) are used to further link market strategy choices to supply chain and financial design.

What we need in GenAI is a method, models and software that support both the low level, crisp and well defined knowledge building blocks while producing meaningful concepts that combine these building blocks in a reliable way. Imagine on top of that, we can implement de Bono’s lateral thinking as a proxy for creativity…

The S-curve, where are we in GenAI?

GenAI technology started out as an expensive platform war with ever increasing investments in GPUs, data volumes and training efforts. But contrary to the classic S-curve evolution, it was rapidly widely adopted as the major players were seeding the market via free, universally available chatbots in a browser or an app, Microsoft making a move with CoPilot and Google introducing Gemini in its Google Search engine. True to the S-curve’s pattern, improvement is slow as the fundamental concepts are being figured out. This is where we are today.

When the period of rapid innovation and massive adoption will follow is hard to predict but there are signs it won’t take too long. I keep seeing more and more colleagues investigating real life use cases instead of considering it as a glorified search engine or an evolved autocomplete. I think before 2028 we will see some real killer apps like streaming music and video, E-Commerce and social media were for the Internet era. Instead of a top down movement, fuelled by massive investments and even “incestuous” investments compared to Lernout and Hauspie’s bloating its revenue figures during the Internet bubble, we will see bottom up initiatives on these platforms delivering measurable value.

We hope the application we are working on, will one day be part of GenAI’s killer apps.

vrijdag 22 augustus 2025

Yogi Berra on GenAI

The inimitable Yogi Berra already knew it: “Predictions are very hard, especially about the future.” So please accept my apologies in advance, as my prediction will likely be inaccurate in terms of both timing and scope when it comes to the transition we are about to witness in the field of Generative AI and its associated market.

The launch of ChatGPT-5 has demonstrated that “parameter escalation” is subject to the law of diminishing returns. Greenfield startups such as OpenAI are burning cash at an astonishing rate, and at some point a shake-out will inevitably occur. The giants—OpenAI (backed and integrated by Microsoft) and Gemini (from Google)—are well positioned to survive. Their model performance is increasingly shaped not merely by parameter counts, but also by architecture, training data quality, deployment efficiency, and—crucially—the fact that their AI functions are embedded into widely used software ecosystems, creating high entry barriers for competitors. What will become of the many other models hosted on Hugging Face remains an open question.

LLM users are already aware that parameter escalation alone cannot eliminate the remaining 1.5–2% hallucination rate. They are also keenly aware that their interactions contribute intellectual property to the models, often without compensation. Furthermore, they know that open models are vulnerable to prompt injections, nonsensical outputs, and coordinated reputation attacks by adversaries.

There is, therefore, a market for closed models built on validated data curated by experts. Beyond pattern recognition, such systems will deliver genuine problem-solving capabilities. Subject matter experts will play a central role in improving data quality—by uploading validated content, stress-testing outputs with thousands of questions, and leveraging Retrieval-Augmented Generation (RAG) to enhance reliability. The logical next step will be to integrate rule-based algorithms for decision support—bringing us full circle to the earliest AI systems of the 1960s, such as Mycin, a pioneering pharmaceutical knowledge base.

Ultimately, these systems will evolve toward mimicking human reasoning and judgment through first-principles thinking.

Beneath the surface of press releases aimed at inflating incumbents’ P/E ratios or maximizing the IPO valuations of new entrants, something more substantial is brewing.

My bold prediction is this: in the long run, closed models will generate more value than today’s “gorillas,” who are largely providing the infrastructure for them.

And so, one day, another of Yogi Berra’s paradoxical dicta may well come true:

“Nobody goes there anymore. It’s too crowded.”

dinsdag 4 februari 2025

Data Governance for GenAI

Introduction

In this article I will define what data governance (DG) for Generative Artificial Intelligence (GenAI) is and how it differs from DG as we have known it for decades in the world of transaction systems (OLTP) and analytical systems (OLAP and Data Mining).

In a second post, I will make the case for DG based on the use case at hand and illustrate a few GenAI DG use cases that are feasible and fitting the patterns and the framework.

Die “Umwertung aller Werten”

The German philosopher Friedrich Nietzsche postulated that all existing values should be discarded and replaced by values that -up to now- were considered unwanted.

This is what comes to mind when I examine some GenAI use cases and look at the widely accepted data governance policies, rules and practices.

Here are the old values that will be replaced:

• Establish data standards;

• The data model as a contract;

• Data glossary and data lineage, the universal truths;

• Data quality, data consistency and data security enforcement;

• Data stewardship based on a subject area.

Establish data standards

As the DAMA DM BOK states: Data standards and guidelines include naming standards, requirement specification standards, data modelling standards, database design standards, architecture standards, and procedural standards for each data management function.

This approach couldn’t be further away from DG for GenAI. Data standards are mostly about “spelling” which has very low impact on semantics. The syntactical aspects of data standards are more in the realm of tagging where subject matter experts provide standardised meaning to various syntactical expressions. So we can have tagging standards for supervised learning, but even those can depend on “the eye of the beholder”, i.e. the use case.

OK, we can have discussions about which language model and vector database is the best fit for the use case at hand but it will be a continued trial and error process before we have optimised the infrastructure and it certainly won’t be a general recommendation for all use cases.

And as for the requirement specification standards, as long as they don’t kill the creativity needed to deal with GenAI, I’ll give them a pass, since this is not always a linear process to identify business needs for information and data. The greatest value in GenAI lies in discovering answers to questions you didn’t dream of asking.

The data model as a contract

fig. 1: Requirements constitute a contract for the data model. Governing this contract is relatively easy.

This principle works fine for transaction systems and classic Business Intelligence data architectures where a star schema or a data vault models the world view of the stakeholders. The only contract is the aforementioned tagging and metadata specifications to make sure the data are exploitable.

Data glossary and data lineage, universal truths?

No longer. The use case context will determine the glossary and the lineage if there are intermediate steps involved before the data are accessible. Definitions may change as a function of context as well as data transformations to prepare them for the task at hand.

Data quality, data consistency and data security enforcement

In old school data governance policies, data quality (DQ) is first about complying with specs and only then does “fit for purpose” comes in as the deciding criterion as I described in Business Analysis for Business Intelligence(1):

Data quality for BI purpose is defined and gauged with reference to fitness for purpose as defined by the analytical use of the data and complying with three levels of data quality as defined by:

[Level 1] database administrators

[Level 2] data warehouse architects

[Level 3] business intelligence analysts

On level 1, data quality is narrowed down to data integrity or the degree to which the attributes of an instance describe the instance accurately and whether the attributes are valid, i.e. comply with defined ranges or definitions managed by the business users. This definition remains very close to the transaction view.

On level 2, data quality is expressed as the percentage completeness and correctness of the analytical perspectives. In other words, to what degree is each dimension, each fact table complete enough to produce significant information for analytical purpose? Issues like sparsity and spreads in the data values are harder to tackle. Timeliness and consistency need to be controlled and managed on the data warehouse level.

On level 3, data quality is the measure in which the available data are capable of adequately answering the business questions. Some use the criterion of accessibility with regards to the usability and clarity of the data. Although this seems a somewhat vague definition, it is most relevant to anyone with some analytical mileage on his odometer. I remember a vast data mining project in a mail order company producing the following astonishing result: 99.9% of all dresses sold were bought by women!

In GenAI, we can pay few attention to the aforementioned level 1 while emphasizing the higher level aspects of data quality. And there, the true challenge lies in testing the validity of three interacting aspects of GenAI data: quality, quantity and density. As mentioned above: quality in the sense of “fit-for-use-case” reducing bias and detecting trustworthy sources, quantity by guaranteeing sufficient data to include all -expected and non-expected- patterns and finally density: to make sure the language model can deliver meaningful proximity measures between the concepts in the data set.

Data stewardship based on a subject area

fig. 2:Like a football steward, a data steward must also control the crowd to prevent chaos

A business data steward, according to DAMA is a knowledge worker and business leader recognized as a subject matter expert who is assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, who will: (…)

4. Ensure the validity and relevance of assigned data model subject areas

5. Define and maintain data quality requirements and business rules for assigned data attributes.

It is clear that this definition needs adjustments. Here is my concept of a data steward for GenAI data:

It is, of course, a knowledge worker who is familiar with the use case that the GenAI data set needs to satisfy. This may be a single subject matter expert (SME) but in the majority of the cases he or she will be the coach and integrator of several SMEs to grasp the complexity of the data set under scrutiny. He or she will be more of a data quality gauge than a DQ prescriber and, together with the analysts and language model builders will take measures to enhance the quality of the output rather than investing too much effort in the input. Let me explain this before any misconceptions occur. The SME asks the system questions to which he knows the answer, checks the output and uses RAG to improve the answer. If he detects a certain substandard conciseness in all the answers he may work on the chunking size of the input, but that is without changing the input itself. Meanwhile some developers are working on automated feedback learning loops that will improve the performance of the SME, as you can imagine coming up with all sorts of questions and evaluating the answers is a time consuming task.

In conclusion

Today, GenAI is more about enablement than control. It prioritises the creative use of data while ensuring ethical and transparent use of it. Especially in Explainable Artificial Intelligence (XAI) this approach is enabled in full. I refer to a previous blog post on this subject.

Since unstructured data like documents and images are in scope, a more flexible and adaptive metadata management is key. AI is now itself being used to monitor and implement data policies. Tools like Alation, Ataccama and Alex Solutions have done groundbreaking work in this area and Microsoft’s Purview is -as always- catching up. New challenges emerge: ensuring quality and accuracy is not always feasible with images and integrating data from diverse sources and in diverse unstructured formats is also a challenge.

The more we are developing new use cases for GenAI, the more we experience a universal law: data governance for GenAI is use case based as we prove in the next blogpost. This begs the question for a flexible and adaptive governance framework that monitors, governs and enforces rules (but not too strict unless it’s about privacy) of its use. In other words, the same data set may be subject to various, clearly distinguishable governance frameworks, dictated by the use case.

________________________________________________________________________________

(1) Business Analysis for Business Intelligence, CRC Press 2012, Bert Brijs p. 272

maandag 2 december 2024

About Data Governance, Trends and Interpretations

The following graphs illustrate the opinions of 270 ICT professionals ranging from CEOs and CIOs to enterprise, solution and data architects, lead programmers, digital transformation managers and BI engineers, data scientists as well as DBAs and meta data managers. You name it, any job title in the survey is present. Of course, this is just a photo taken between February and October 2024: it was hard work reaching out to the relevant interviewee types but here are the results. Your remarks are welcome!

The graph below is music to my governance ears: a large majority supports the duopolistic governance model which is crucial if you want to survive in a volatile market. Although arguments like “efficiency” and “authority “ still support the business and ICT monarchies. Because mutual adjustment is just a waste of time to these government models, that is, in their perception…

If you take a “follow the money” approach, you get confirmation: a majority of IT professionals considers it a duopolistic issue:

Half of the respondents consider AI’s introduction as inevitable. Although much of its use is still “autocomplete on steroids” replacing a Google search and money is being burnt faster than ever in any ICT innovation, the first use cases are going to the market. (Stay tuned, we’re also working on one).

Artificial Intelligence: hype or reality?

Again, a strong majority of ICT Professionals see data and applications move to the Cloud. That is by no means ignoring the privacy issues with data hosted on the major US Cloud providers. But EU based initiatives like Open Telecom Cloud or OVH Cloud in France which has high ticket customers like Auchan, Louis Vuitton, Société Générale are beginning to show up on the radar. We notice multi cloud strategies are emerging to avoid calamities like the Office 365 outage the 25th November this year.

Although most of the interviewees heard about robotic process automation (RPA) the combo “Process and Task Mining” were not that high as expected on the agenda. With an ageing workforce and a demographic collapse in the near future, the EU should invest every penny in automation.

With tools like Mendix, but also process and task mining tools like Celonis or analytical applications like KNIME Analytics Platform or Dataiku one would think that more ICT professionals would appreciate the advance in productivity of low code tools. In this photo, they’re somewhat sitting on the fence. Is it because developers remain faithful to their tools as they are unwilling to abandon their skill set to acquire new one?

Mendix, KNIME Analytics Platform: low code tools

Amazing: most professionals embrace the Cloud but they’re not prepared to acept the logical consequence of Cloud architectures. Although complex to implement, zero trust is well-suited for remote work, cloud-based networking, and hybrid environments.

I confess, this last question was sort of a lie detector. And judging from the answers, not too many respondents were fabulating.

Of course, the Low Countries show a disproportionately large number of respondents but still, 63% are outside our home market:

About the survey

Between February and October 2024, we had to invest heavily in contacting the right ICT professionals to get meaningful answers. It took us quite a few mails, phone calls and even visits to obtain these results. Are these results representative ? I am not sure. Are they significant? Maybe. Are they inspiring? Most certainly as we use them in our ICT Literacy course to get the conversation going about business – ICT alignment. Give us your opinion in the comments.

The survey is still open if you wish to contribute.

donderdag 4 april 2024

Why XAI will be the Next Big Thing

Nothing new under the sun

Or should I use the expression “plus ça change, plus c’est la même chose”? Because what’s at stake in large language models (LLM) like ChatGPT4 and others is the trade off between the model’s fit, its accuracy on the one hand and transparency, interpretability, and explainability on the other. This dilemma is as old as classical statistics: a simple regression model may be inaccurate but it is easily readable for end users without a large background in statistics.

Anyone can read from a graphical representation that there is a correlation between the office surface, the location class and the office rent. But high dimensional analysis results from a neural network are less transparent and interpretable, let alone explainable.

The same goes for LLMs: there is no way a human domain expert can fathom the multitude of weight matrices used in determining the syntax relationship between words.

About hard to detect hallucinations

Anyone can see nonsense coming out of ChatGPT like responses inconsistent with the prompt. But what about pure fiction represented in a factual consistent and convincing way? If the end user is not a domain expert he will have trouble recognising the output.

The mitigation is called RAG (Retrieval Augmented Generation. It’s a technique that enables experts to add their own data to the prompt and ensure more precise generative AI output. But… then we’re missing the whole point of generative AI: to enable a broader audience than domain experts doing tasks for which they had little or no training or education.

Domain expertise is needed in most cases

Generating marketing and advertising content may work for low level copy like catalogue texts but I doubt, it will deliver the sort of ads you find on Ads of the World https://www.adsoftheworld.com/

I grant you the use case of enhancing the shopping experience as a “domainless” knowledge generator. But most use cases like drug discovery, health care, finance and stock market trading or urban design to name a few require domain knowledge to prevent accidents from happening.

ChatGPT has serious issues with accuracy

Only 7% of the citations were accurate!

Take health care: a study from Bhattacharyya et al in 2023[1] identified an astonishing number of errors in references to medical research. Among these references, 47% were fabricated, 46% were authentic but inaccurate, and only 7% were authentic and accurate. My friend, a medicine practitioner, was already frustrated by people googling their symptoms and entering his cabinet with the diagnosis and the treatment; with this tool I fear his frustrations will only increase… Many more examples can be found in other domains[2].

Hallucations galore in Generative AI

Another evolution in AI is about moving away from tagging by experts and replacing this process by using Self-supervised Learning (SSL, no, not the network encryption protocol). Today’s applications in medicine produce impressive results but again, this approach still requires medical expertise. In the context of generative AI, self-supervised learning can be particularly useful for pre-training models on large amounts of unlabelled data before fine-tuning them on specific tasks. By learning to predict certain properties or transformations of the data, such as predicting missing parts of an image (inpainting) or reconstructing corrupted text (denoising), the model can develop a rich understanding of the data distribution and capture meaningful features that can then be used for generating new content.

Enter XAI

The European Regulation on Artificial Intelligence (AI Act)[3] which is in the final implementation process is a serious argument for avoiding sorcerer’s apprentices. Especially in high-risk AI applications, such as those used in healthcare, transportation, and law enforcement, the AI Act will make those applications subject to strict requirements, including data quality, transparency, robustness, and human oversight. Additionally, the Act prohibits certain AI practices deemed unacceptable, such as social scoring systems that manipulate human behaviour or exploit vulnerabilities.

This will foster the use of explainable AI at least for domains where already existing legislation is requiring transparency, e.g. Sarbanes Oxley, HIPAA and others. Professionals in banking, insurance, public servants deciding on subsidies and grants, HR professionals evaluating CVs are just a few of the primary beneficiaries of XAI.

They will need models where humans can understand how the algorithm works and tweak it to test its sensitivity. By doing so, they will get a better understanding of how the model came up with a certain result.

In short, XAI models may be simpler but better governed and they will grow in usability as new increments are added to the existing knowledge base. As we speak, sector specific general models are being developed, ready for enhancing them with your specific domain knowledge.

[1] High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content.

Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE.

Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238. eCollection 2023 May.

PMID: 37337480 Free PMC article.

[2] Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus. 2023 Apr 11;15(4):e37432. doi: 10.7759/cureus.37432. PMID: 37182055; PMCID: PMC10173677.

[3] Source: https://ceridap.eu/the-impact-of-the-ai-act-on-public-authorities-and-on-administrative-procedures/?lng=en

vrijdag 2 februari 2024

Geospatial Data Warehouse and Spatially Enabled Data Warehouse: Turf Wars or Symbiosis?

Over the course of more than 25 years, I have been involved in numerous discussions between the GIS buffs and my tribe, the data warehouse team over where geospatial information should reside.

A geospatial map of Narnia. What attributes should reside in the geospatial system?

Since almost all measures have a location aspect, the spatial data warehouse was promoted as the single source of truth, able to visualize data in an unparalleled way whereas the opponents stated that all you need is to define a good location dimension and the data warehouse could do without the expensive software and the scarce resources in the geospatial domain.

I will spare you the avalanche of technical arguments back and forth between the two, leading to tugs of war between the teams and I propose an approach from the business user’s point of view.

The essential question is: “What information am I looking for?” Is it about one or more measures that need to be put in context using a majority of dimensions outside the geospatial domain, even if it includes DimLocation or is it exclusively related to questions “What happened or happens in this particular location, i.e. at this point, line or polygon?” , “What are the measures within a radius of point (x,y) on the map?” or “What is the intersect between location A and location B as far as measure Z is concerned?”.

It is clear that in the first case, the performance and cost of a classical data warehouse with a location dimension will prove to be the better choice. But if location is the point of entry to a query, then the spatial data warehouse is the smartest tool in the shed.

Symbiosis is the way forward

There are many reasons why the two environments make sense. For executive and managerial information based on structured data, the data warehouse has proven to be the platform of choice and will continue to do so. For location based analysis, the geospatial data warehouse outperforms the latter. At the same time it is much closer to operational analytics and it can even be a part of operational applications like CRM, SCM or any other OLTP system.

To enable symbiosis, the location dimension needs some connection to the geospatial system. Some plead for a simple snapshot of a shapefile, some want a full duplication of all geospatial data and their timestamp. The latter may lead to an avalanche of data as any little correction of the shape on the GIS system will send new time stamped data. This can’t be a workable situation. Either the snapshot ignores updates but takes in the original GIS object ID to secure a trace or it overwrites any location data and keeps the last version as an active one. Because the only objective here is to provide a path to analysts who need a deeper geospatial analysis of one or more measurements registered in the data warehouse.

Let's open the debate

I am sure I have missed a few points here and there. Let me know your position on this issue via the comments or a personal message via the contact form on our website: https://www.linguafrancaconsulting.eu/

This is one of the topics of our course "ICT Focus Areas for Board Members & Management"

maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
Tag parts of speech with their grammatical function: verb, adjective,..
Parse words as a function of their position and type
Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

An example of a thesaurus metamodel

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl