Business Analysis for Business Intelligence Blog: BA4BIBlog: 2024

maandag 2 december 2024

About Data Governance, Trends and Interpretations

The following graphs illustrate the opinions of 270 ICT professionals ranging from CEOs and CIOs to enterprise, solution and data architects, lead programmers, digital transformation managers and BI engineers, data scientists as well as DBAs and meta data managers. You name it, any job title in the survey is present. Of course, this is just a photo taken between February and October 2024: it was hard work reaching out to the relevant interviewee types but here are the results. Your remarks are welcome!

The graph below is music to my governance ears: a large majority supports the duopolistic governance model which is crucial if you want to survive in a volatile market. Although arguments like “efficiency” and “authority “ still support the business and ICT monarchies. Because mutual adjustment is just a waste of time to these government models, that is, in their perception…

If you take a “follow the money” approach, you get confirmation: a majority of IT professionals considers it a duopolistic issue:

Half of the respondents consider AI’s introduction as inevitable. Although much of its use is still “autocomplete on steroids” replacing a Google search and money is being burnt faster than ever in any ICT innovation, the first use cases are going to the market. (Stay tuned, we’re also working on one).

Artificial Intelligence: hype or reality?

Again, a strong majority of ICT Professionals see data and applications move to the Cloud. That is by no means ignoring the privacy issues with data hosted on the major US Cloud providers. But EU based initiatives like Open Telecom Cloud or OVH Cloud in France which has high ticket customers like Auchan, Louis Vuitton, Société Générale are beginning to show up on the radar. We notice multi cloud strategies are emerging to avoid calamities like the Office 365 outage the 25th November this year.

Although most of the interviewees heard about robotic process automation (RPA) the combo “Process and Task Mining” were not that high as expected on the agenda. With an ageing workforce and a demographic collapse in the near future, the EU should invest every penny in automation.

With tools like Mendix, but also process and task mining tools like Celonis or analytical applications like KNIME Analytics Platform or Dataiku one would think that more ICT professionals would appreciate the advance in productivity of low code tools. In this photo, they’re somewhat sitting on the fence. Is it because developers remain faithful to their tools as they are unwilling to abandon their skill set to acquire new one?

Mendix, KNIME Analytics Platform: low code tools

Amazing: most professionals embrace the Cloud but they’re not prepared to acept the logical consequence of Cloud architectures. Although complex to implement, zero trust is well-suited for remote work, cloud-based networking, and hybrid environments.

I confess, this last question was sort of a lie detector. And judging from the answers, not too many respondents were fabulating.

Of course, the Low Countries show a disproportionately large number of respondents but still, 63% are outside our home market:

About the survey

Between February and October 2024, we had to invest heavily in contacting the right ICT professionals to get meaningful answers. It took us quite a few mails, phone calls and even visits to obtain these results. Are these results representative ? I am not sure. Are they significant? Maybe. Are they inspiring? Most certainly as we use them in our ICT Literacy course to get the conversation going about business – ICT alignment. Give us your opinion in the comments.

The survey is still open if you wish to contribute.

donderdag 4 april 2024

Why XAI will be the Next Big Thing

Nothing new under the sun

Or should I use the expression “plus ça change, plus c’est la même chose”? Because what’s at stake in large language models (LLM) like ChatGPT4 and others is the trade off between the model’s fit, its accuracy on the one hand and transparency, interpretability, and explainability on the other. This dilemma is as old as classical statistics: a simple regression model may be inaccurate but it is easily readable for end users without a large background in statistics.

Anyone can read from a graphical representation that there is a correlation between the office surface, the location class and the office rent. But high dimensional analysis results from a neural network are less transparent and interpretable, let alone explainable.

The same goes for LLMs: there is no way a human domain expert can fathom the multitude of weight matrices used in determining the syntax relationship between words.

About hard to detect hallucinations

Anyone can see nonsense coming out of ChatGPT like responses inconsistent with the prompt. But what about pure fiction represented in a factual consistent and convincing way? If the end user is not a domain expert he will have trouble recognising the output.

The mitigation is called RAG (Retrieval Augmented Generation. It’s a technique that enables experts to add their own data to the prompt and ensure more precise generative AI output. But… then we’re missing the whole point of generative AI: to enable a broader audience than domain experts doing tasks for which they had little or no training or education.

Domain expertise is needed in most cases

Generating marketing and advertising content may work for low level copy like catalogue texts but I doubt, it will deliver the sort of ads you find on Ads of the World https://www.adsoftheworld.com/

I grant you the use case of enhancing the shopping experience as a “domainless” knowledge generator. But most use cases like drug discovery, health care, finance and stock market trading or urban design to name a few require domain knowledge to prevent accidents from happening.

ChatGPT has serious issues with accuracy

Only 7% of the citations were accurate!

Take health care: a study from Bhattacharyya et al in 2023[1] identified an astonishing number of errors in references to medical research. Among these references, 47% were fabricated, 46% were authentic but inaccurate, and only 7% were authentic and accurate. My friend, a medicine practitioner, was already frustrated by people googling their symptoms and entering his cabinet with the diagnosis and the treatment; with this tool I fear his frustrations will only increase… Many more examples can be found in other domains[2].

Hallucations galore in Generative AI

Another evolution in AI is about moving away from tagging by experts and replacing this process by using Self-supervised Learning (SSL, no, not the network encryption protocol). Today’s applications in medicine produce impressive results but again, this approach still requires medical expertise. In the context of generative AI, self-supervised learning can be particularly useful for pre-training models on large amounts of unlabelled data before fine-tuning them on specific tasks. By learning to predict certain properties or transformations of the data, such as predicting missing parts of an image (inpainting) or reconstructing corrupted text (denoising), the model can develop a rich understanding of the data distribution and capture meaningful features that can then be used for generating new content.

Enter XAI

The European Regulation on Artificial Intelligence (AI Act)[3] which is in the final implementation process is a serious argument for avoiding sorcerer’s apprentices. Especially in high-risk AI applications, such as those used in healthcare, transportation, and law enforcement, the AI Act will make those applications subject to strict requirements, including data quality, transparency, robustness, and human oversight. Additionally, the Act prohibits certain AI practices deemed unacceptable, such as social scoring systems that manipulate human behaviour or exploit vulnerabilities.

This will foster the use of explainable AI at least for domains where already existing legislation is requiring transparency, e.g. Sarbanes Oxley, HIPAA and others. Professionals in banking, insurance, public servants deciding on subsidies and grants, HR professionals evaluating CVs are just a few of the primary beneficiaries of XAI.

They will need models where humans can understand how the algorithm works and tweak it to test its sensitivity. By doing so, they will get a better understanding of how the model came up with a certain result.

In short, XAI models may be simpler but better governed and they will grow in usability as new increments are added to the existing knowledge base. As we speak, sector specific general models are being developed, ready for enhancing them with your specific domain knowledge.

[1] High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content.

Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE.

Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238. eCollection 2023 May.

PMID: 37337480 Free PMC article.

[2] Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus. 2023 Apr 11;15(4):e37432. doi: 10.7759/cureus.37432. PMID: 37182055; PMCID: PMC10173677.

[3] Source: https://ceridap.eu/the-impact-of-the-ai-act-on-public-authorities-and-on-administrative-procedures/?lng=en

vrijdag 2 februari 2024

Geospatial Data Warehouse and Spatially Enabled Data Warehouse: Turf Wars or Symbiosis?

Over the course of more than 25 years, I have been involved in numerous discussions between the GIS buffs and my tribe, the data warehouse team over where geospatial information should reside.

A geospatial map of Narnia. What attributes should reside in the geospatial system?

Since almost all measures have a location aspect, the spatial data warehouse was promoted as the single source of truth, able to visualize data in an unparalleled way whereas the opponents stated that all you need is to define a good location dimension and the data warehouse could do without the expensive software and the scarce resources in the geospatial domain.

I will spare you the avalanche of technical arguments back and forth between the two, leading to tugs of war between the teams and I propose an approach from the business user’s point of view.

The essential question is: “What information am I looking for?” Is it about one or more measures that need to be put in context using a majority of dimensions outside the geospatial domain, even if it includes DimLocation or is it exclusively related to questions “What happened or happens in this particular location, i.e. at this point, line or polygon?” , “What are the measures within a radius of point (x,y) on the map?” or “What is the intersect between location A and location B as far as measure Z is concerned?”.

It is clear that in the first case, the performance and cost of a classical data warehouse with a location dimension will prove to be the better choice. But if location is the point of entry to a query, then the spatial data warehouse is the smartest tool in the shed.

Symbiosis is the way forward

There are many reasons why the two environments make sense. For executive and managerial information based on structured data, the data warehouse has proven to be the platform of choice and will continue to do so. For location based analysis, the geospatial data warehouse outperforms the latter. At the same time it is much closer to operational analytics and it can even be a part of operational applications like CRM, SCM or any other OLTP system.

To enable symbiosis, the location dimension needs some connection to the geospatial system. Some plead for a simple snapshot of a shapefile, some want a full duplication of all geospatial data and their timestamp. The latter may lead to an avalanche of data as any little correction of the shape on the GIS system will send new time stamped data. This can’t be a workable situation. Either the snapshot ignores updates but takes in the original GIS object ID to secure a trace or it overwrites any location data and keeps the last version as an active one. Because the only objective here is to provide a path to analysts who need a deeper geospatial analysis of one or more measurements registered in the data warehouse.

Let's open the debate

I am sure I have missed a few points here and there. Let me know your position on this issue via the comments or a personal message via the contact form on our website: https://www.linguafrancaconsulting.eu/

This is one of the topics of our course "ICT Focus Areas for Board Members & Management"