dinsdag 4 februari 2025

Data Governance for GenAI

Introduction

In this article I will define what data governance (DG) for Generative Artificial Intelligence (GenAI) is and how it differs from DG as we have known it for decades in the world of transaction systems (OLTP) and analytical systems (OLAP and Data Mining).

In a second post, I will make the case for DG based on the use case at hand and illustrate a few GenAI DG use cases that are feasible and fitting the patterns and the framework.

Die “Umwertung aller Werten”

The German philosopher Friedrich Nietzsche postulated that all existing values should be discarded and replaced by values that -up to now- were considered unwanted. 

This is what comes to mind when I examine some GenAI use cases and look at the widely accepted data governance policies, rules and practices.

Here are the old values that will be replaced:

Establish data standards;

The data model as a contract;

Data glossary and data lineage, the universal truths;

Data quality, data consistency and data security enforcement;

Data stewardship based on a subject area.

Establish data standards

As the DAMA DM BOK states: Data standards and guidelines include naming standards, requirement specification standards, data modelling standards, database design standards, architecture standards, and procedural standards for each data management function. 

This approach couldn’t be further away from DG for GenAI. Data standards are mostly about “spelling” which has very low impact on semantics. The syntactical aspects of data standards are more in the realm of tagging where subject matter experts provide standardised meaning to various syntactical expressions. So we can have tagging standards for supervised learning, but even those can depend on “the eye of the beholder”, i.e. the use case. 

OK, we can have discussions about which language model and vector database is the best fit for the use case at hand but it will be a continued trial and error process before we have optimised the infrastructure and it certainly won’t be a general recommendation for all use cases.

And as for the requirement specification standards, as long as they don’t kill the creativity needed to deal with GenAI, I’ll give them a pass, since this is not always a linear process to identify business needs for information and data. The greatest value in GenAI lies in discovering answers to questions you didn’t dream of asking. 

The data model as a contract
Data as a contract, an example

fig. 1: Requirements constitute a contract for the data model. Governing this contract is relatively easy.

This principle works fine for transaction systems and classic Business Intelligence data architectures where a star schema or a data vault models the world view of the stakeholders. The only contract is the aforementioned tagging and metadata specifications to make sure the data are exploitable.

Data glossary and data lineage, universal truths?

No longer. The use case context will determine the glossary and the lineage if there are intermediate steps involved before the data are accessible. Definitions may change as a function of context as well as data transformations to prepare them for the task at hand. 

Data quality, data consistency and data security enforcement

In old school data governance policies, data quality (DQ) is first about complying with specs and only then does “fit for purpose” comes in as the deciding criterion as I described in Business Analysis for Business Intelligence(1):

Data quality for BI purpose is defined and gauged with reference to fitness for purpose as defined by the analytical use of the data and complying with three levels of data quality as defined by:

[Level 1] database administrators

[Level 2] data warehouse architects

[Level 3] business intelligence analysts

On level 1, data quality is narrowed down to data integrity or the degree to which the attributes of an instance describe the instance accurately and whether the attributes are valid, i.e. comply with defined ranges or definitions managed by the business users. This definition remains very close to the transaction view.

On level 2, data quality is expressed as the percentage completeness and correctness of the analytical perspectives. In other words, to what degree is each dimension, each fact table complete enough to produce significant information for analytical purpose? Issues like sparsity and spreads in the data values are harder to tackle. Timeliness and consistency need to be controlled and managed on the data warehouse level. 

On level 3, data quality is the measure in which the available data are capable of adequately answering the business questions. Some use the criterion of accessibility with regards to the usability and clarity of the data.  Although this seems a somewhat vague definition, it is most relevant to anyone with some analytical mileage on his odometer. I remember a vast data mining project in a mail order company producing the following astonishing result: 99.9% of all dresses sold were bought by women!

In GenAI, we can pay few attention to the aforementioned level 1 while emphasizing the higher level aspects of data quality. And there, the true challenge lies in testing the validity of three interacting aspects of GenAI data: quality, quantity and density. As mentioned above: quality in the sense of “fit-for-use-case” reducing bias and detecting trustworthy sources, quantity by guaranteeing sufficient data to include all -expected and non-expected- patterns and finally density: to make sure the language model can deliver meaningful proximity measures between the concepts in the data set.

Data stewardship based on a subject area



fig. 2:Like a football steward, a data steward must also control the crowd to prevent chaos

A business data steward, according to DAMA  is  a knowledge worker and business leader recognized as a subject matter expert who is assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, who will: (…)

4. Ensure the validity and relevance of assigned data model subject areas

5. Define and maintain data quality requirements and business rules for assigned data attributes.

It is clear that this definition needs adjustments. Here is my concept of a data steward for GenAI data:

It is, of course, a knowledge worker who is familiar with the use case that the GenAI data set needs to satisfy. This may be a single subject matter expert (SME) but in the majority of the cases he or she will be the coach and integrator of several SMEs to grasp the complexity of the data set under scrutiny. He or she will be more of a data quality gauge than a DQ prescriber and, together with the analysts and language model builders will take measures to enhance the quality of the output  rather than investing too much effort in the input. Let me explain this before any misconceptions occur. The SME asks the system questions to which he knows the answer, checks the output and uses RAG to improve the answer. If he detects a certain substandard conciseness in all the answers he may work on the chunking size of the input, but that is without changing the input itself. Meanwhile some developers are working on automated feedback learning loops that will improve the performance of the SME, as you can imagine coming up with all sorts of questions and evaluating the answers is a time consuming task.

In conclusion

Today, GenAI is more about enablement than control. It prioritises the creative use of data while ensuring ethical and transparent use of it. Especially in Explainable Artificial Intelligence (XAI) this approach is enabled in full. I refer to a previous blog post on this subject.

Since unstructured data like documents and images are in scope, a more flexible and adaptive metadata management is key. AI is now itself being used to monitor and implement data policies. Tools like Alation, Ataccama and Alex Solutions have done groundbreaking work in this area and Microsoft’s Purview is -as always- catching up. New challenges emerge: ensuring quality and accuracy is not always feasible with images and integrating data from diverse sources and in diverse unstructured formats is also a challenge. 

The more we are developing new use cases for GenAI, the more we experience a universal law: data governance for GenAI is use case based as we prove in the next blogpost. This begs the question for a flexible and adaptive governance framework that monitors, governs and enforces rules (but not too strict unless it’s about privacy) of its use. In other words, the same data set may be subject to various, clearly distinguishable governance frameworks, dictated by the use case. 

________________________________________________________________________________

(1) Business Analysis for Business Intelligence, CRC Press 2012, Bert Brijs p. 272








maandag 2 december 2024

About Data Governance, Trends and Interpretations

 The following graphs illustrate the opinions of 270 ICT professionals ranging from CEOs and CIOs to enterprise, solution and data architects, lead programmers, digital transformation managers and BI engineers, data scientists as well as DBAs and meta data managers. You name it, any job title in the survey is present. Of course, this is just a photo taken between February and October 2024: it was hard work reaching out to the relevant interviewee types but here are the results. Your remarks are welcome!

The graph below is music to my governance ears: a large majority supports the duopolistic governance model which is crucial if you want to survive in a volatile market. Although arguments like “efficiency” and “authority “ still support the business and ICT monarchies. Because mutual adjustment is just a waste of time to these government models, that is, in their perception…

ICT governance's guiding principles

If you take a “follow the money” approach, you get confirmation: a majority of IT professionals  considers it a duopolistic issue: 

Funding is a governance issue


Half of the respondents consider AI’s introduction as inevitable. Although much of its use is still “autocomplete on steroids” replacing a Google search and money is being burnt faster than ever in any ICT innovation, the first use cases are going to the market. (Stay tuned, we’re also working on one).

Artificial Intelligence: hype or reality?


Again, a strong majority of ICT Professionals see data and applications move to the Cloud. That is by no means ignoring the privacy issues with data hosted on the major US Cloud providers. But EU based initiatives like Open Telecom Cloud or OVH Cloud in France which has high ticket customers like Auchan, Louis Vuitton, Société Générale are beginning to show up on the radar. We notice multi cloud strategies are emerging to avoid calamities like the Office  365 outage the 25th November this year.


Applications and data move to the Cloud
OpenTelecom on Cloud sovereignty

Although most of the interviewees heard about robotic process automation (RPA) the combo “Process and Task Mining” were not that high as expected on the agenda. With an ageing workforce and a demographic collapse in the near future, the EU should invest every penny in automation. 


Process and Tasl Mining


With tools like Mendix, but also process and task mining tools like Celonis or analytical applications like KNIME Analytics Platform or Dataiku one would think that more ICT professionals would appreciate the advance in productivity of low code tools. In this photo, they’re somewhat sitting on the fence. Is it because developers remain faithful to their tools as they are unwilling to abandon their skill set to acquire  new one?

Mendix, KNIME Analytics Platform: low code tools


Amazing: most professionals embrace the Cloud but they’re not prepared to acept the logical consequence of Cloud architectures. Although complex to implement, zero trust is well-suited for remote work, cloud-based networking, and hybrid environments.

ZNA: zero trust network architectures


I confess, this last question was sort of a lie detector. And judging from the answers, not too many respondents were fabulating. 

Quantum computing adoption


Of course, the Low Countries show a disproportionately large number of respondents but still, 63% are outside our home market:

Countries of the respondents

About the survey

Between February and October 2024, we had to invest heavily in contacting the right ICT professionals to get meaningful answers. It took us quite a few mails, phone calls and even visits to obtain these results. Are these results representative ? I am not sure. Are they significant? Maybe. Are they inspiring? Most certainly as we use them in our ICT Literacy course to get the conversation going about business – ICT alignment. Give us your opinion in the comments.  







donderdag 4 april 2024

Why XAI will be the Next Big Thing

Nothing new under the sun

Or should I use the expression “plus ça change, plus c’est la même chose”? Because what’s at stake in large language models (LLM) like ChatGPT4 and others is the trade off between the model’s fit, its accuracy on the one hand and transparency, interpretability, and explainability on the other. This dilemma is as old as classical statistics: a simple regression model may be inaccurate but it is easily readable for end users without a large background in statistics.

Anyone can read from a graphical representation that there is a correlation between the office surface, the location class and the office rent.  But high dimensional analysis results from a neural network are less transparent and interpretable, let alone explainable.

The same goes for LLMs: there is no way a human domain expert can fathom the multitude of weight matrices used in determining the syntax relationship between words.

About hard to detect hallucinations

Anyone can see nonsense coming out of ChatGPT like responses inconsistent with the prompt. But what about pure fiction represented in a factual consistent and convincing way? If the end user is not a domain expert he will have trouble recognising the output.

The mitigation is called RAG (Retrieval Augmented Generation. It’s a technique that enables experts to add their own data to the prompt and ensure more precise generative AI output. But… then we’re missing the whole point of generative AI: to enable a broader audience than domain experts doing tasks for which they had little or no training or education.

Domain expertise is needed in most cases

Generating marketing and advertising content may work for low level copy like catalogue texts but I doubt, it will deliver the sort of ads you find on Ads of the World https://www.adsoftheworld.com/

I grant you the use case of enhancing the shopping experience as a “domainless” knowledge generator.  But most use cases like drug discovery, health care, finance and stock market trading or urban design to name a few require domain knowledge to prevent accidents from happening.

ChatGPT has serious issues with accuracy
Only 7% of the citations were accurate!


Take health care: a study from Bhattacharyya et al in 2023[1]  identified an astonishing number of errors in references to medical research. Among these references, 47% were fabricated, 46% were authentic but inaccurate, and only 7% were authentic and accurate. My friend, a medicine practitioner, was already frustrated by people googling their symptoms and entering his cabinet with the diagnosis and the treatment; with this tool I fear his frustrations will only increase… Many more examples can be found in other domains[2].

Hallucations galore in Generative AI

Another evolution in AI is about moving away from tagging by experts and replacing this process by using Self-supervised Learning (SSL, no, not the network encryption protocol).  Today’s applications in medicine produce impressive results but again, this approach still requires medical expertise. In the context of generative AI, self-supervised learning can be particularly useful for pre-training models on large amounts of unlabelled data before fine-tuning them on specific tasks. By learning to predict certain properties or transformations of the data, such as predicting missing parts of an image (inpainting) or reconstructing corrupted text (denoising), the model can develop a rich understanding of the data distribution and capture meaningful features that can then be used for generating new content.

Enter XAI

The European Regulation on Artificial Intelligence (AI Act)[3] which is in the final implementation process is a serious argument for avoiding sorcerer’s apprentices. Especially in high-risk AI applications, such as those used in healthcare, transportation, and law enforcement, the AI Act will make those applications subject to strict requirements, including data quality, transparency, robustness, and human oversight. Additionally, the Act prohibits certain AI practices deemed unacceptable, such as social scoring systems that manipulate human behaviour or exploit vulnerabilities.

This will foster the use of explainable AI at least for domains where already existing legislation is requiring transparency, e.g. Sarbanes Oxley, HIPAA and others. Professionals in banking, insurance, public servants deciding on subsidies and grants, HR professionals evaluating CVs are just a few of the primary beneficiaries of XAI.

They will need models where humans can understand how the algorithm works and tweak it to test its sensitivity. By doing so, they will get a better understanding of how the model came up with a certain result.

In short, XAI models may be simpler but better governed and they will grow in usability as new increments are added to the existing knowledge base. As we speak, sector specific general models are being developed, ready for enhancing them with your specific domain knowledge.



[1] High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content.

Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE.

Cureus. 2023 May 19;15(5):e39238. doi: 10.7759/cureus.39238. eCollection 2023 May.

PMID: 37337480 Free PMC article.

[2] Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus. 2023 Apr 11;15(4):e37432. doi: 10.7759/cureus.37432. PMID: 37182055; PMCID: PMC10173677.

vrijdag 2 februari 2024

Geospatial Data Warehouse and Spatially Enabled Data Warehouse: Turf Wars or Symbiosis?

Over the course of more than 25 years, I have been involved in numerous discussions between the GIS buffs and my tribe, the data warehouse team over where geospatial information should reside. 

Narnia map
A geospatial map of Narnia. What attributes should reside in the geospatial system?

Since almost all measures have a location aspect, the spatial data warehouse was promoted as the single source of truth, able to visualize data in an unparalleled way whereas the opponents stated that all you need is to define a good location dimension and the data warehouse could do without the expensive software and the scarce resources in the geospatial domain. 

I will spare you the avalanche of technical arguments back and forth between the two, leading to tugs of war between the teams and I propose an approach from the business user’s point of view.

The essential question is: “What information am I looking for?” Is it about one or more measures that need to be put in context using a majority of dimensions outside the geospatial domain, even if it includes DimLocation or is it exclusively related to questions “What happened or happens in this particular location, i.e. at this point, line or polygon?” , “What are the measures within a radius of point (x,y) on the map?” or “What is the intersect between location A and location B as far as measure Z is concerned?”.

It is clear that in the first case, the performance and cost of a classical data warehouse with a location dimension will prove to be the better choice. But if location is the point of entry to a query, then the spatial data warehouse is the smartest tool in the shed. 

Symbiosis is the way forward

There are many reasons why the two environments make sense. For executive and managerial information based on structured  data, the data warehouse has proven to be the platform of choice and will continue to do so. For location based analysis, the geospatial data warehouse outperforms the latter. At the same time it is much closer to operational analytics and it can even be a part of operational applications like CRM, SCM or any other OLTP system.

To enable symbiosis, the location dimension needs some connection to the geospatial system. Some plead for a simple snapshot of a shapefile, some want a full duplication of all geospatial data and their timestamp. The latter may lead to an avalanche of data as any little correction of the shape on the GIS system will send new time stamped data. This can’t be a workable situation. Either the snapshot ignores updates but takes in the original GIS object ID to secure a trace or it overwrites any location data and keeps the last version as an active one. Because the only objective here is to provide a path to analysts who need a deeper geospatial analysis of one or more measurements registered in the data warehouse. 

Let's open the debate

I am sure I have missed a few points here and there. Let me know your position on this issue via the comments or a personal message via the contact form on our website: https://www.linguafrancaconsulting.eu/ 

This is one of the topics of our course "ICT Focus Areas for Board Members & Management"







maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

  • Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
  • Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
  • Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
  • Tag parts of speech with their grammatical function: verb, adjective,..
  • Parse words as a function of their position and type
  • Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
  • Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
  • Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
  • Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
  • Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

 

An example of a thesaurus metamodel

  

 

 

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl

zaterdag 18 november 2023

Best Practices in Defining a Data Warehouse Architecture

This blogpost is part of a series of which the following posts have been published:

The opening statement

What is a data mesh?

Coherent business concepts keep the data relevant

In any data mesh architecture, the data warehouse is and will be a critical component for many reasons. First and foremost: some analytics need industrialised solutions, automating the entire flow from raw data tot finished reports. Structured data will always contribute to the analytical environment and will need a relational model to provide the foundation for analyses. In my experience, the most flexible and sustainable model is the process based star schema architecture from Ralph Kimball. In  one of my previous posts I have made the case for this approach.

 And in the context of a data lake project I positioned the Kimball approach as the best in class

The process diagram below tells the story of requirements gathering, ingesting all sorts of data in the lake and making the distinction between structured and unstructured data. Identifying the common dimensions and facts is crucial to make the concept work. Either you provide an increment to an existing data mart bus or you introduce a new process metrics fact table with foreign keys from existing and new dimensions.


Best practices in DWH
Managing structured and unstructured data in a data mesh environment

Making the case for the data warehouse as an endpoint of unstructured analysis

A lot of advanced analytics can be facilitated by the data lake. Think of text analytics, social media analytics and image processing. The outcomes of these analyses may find their way to the data warehouse. For example: polarity analysis in social media. Imagine a bank or a telecom provider capturing the social media comments on its performance. As we all know from customer feedback analysis, only the emotions two or three sigma away from the mean make it to social media. The client is either very satisfied or very dissatisfied and wants the world to know. Taking snapshots of the client’s mood and relating it to his financial or communication behaviour may yield interesting information. Already today, some banks are capturing their client’s mood to determine the optimum conditions to present their services. Aggregating these data may even provide macro-economic data correlating with the business cycle.

Have a look at the diagram below and imagine the business questions it can answer for you.

A high level star schema integrating social messages and their polarity with sales metrics

Think of time series: is there a some form of a leading indicator of sales in the polarity of this customer’s social messages?

If one of our products is the subject of a social media post, has this any (positive or negative) effect on sales of that particular product?

What social media sources have the greatest impact on our brand equity?

I am sure you will add your dimensions and business questions to the model. And by doing so you are realising one of the main traits of a data mesh: delivering data as a product.

I hope I have made my point clear: even in the most sophisticated data lakehouse supporting a data mesh architecture, the data warehouse is not going away.

In the next blog article we will focus on governing the data ingestion process.

Stay tuned!


zaterdag 11 november 2023

Start with Defining Coherent Business Concepts

Below is a diagram describing the governance process of defining and implementing business concepts in a data mesh environment. The business glossary domain is the user facing side of a data catalogue whereas the data management domain is the backend topology of the data catalogue. It describes how business concepts are implemented in databases, whether in virtual or persistent storage.

But first and foremost: it is the glue that holds any dispersed data landscape together. If you can govern the meaning of any data model, any implementation of concepts like PARTY, PARTY ROLE, PROJECT, ASSET and PRODUCT to name a few, the data can be anywhere, in any form but the usability will be guaranteed. Of course, data quality will be a local responsibility in case global concepts need specialisation to cater for local information needs.


Business perspective on defining and implementing a business concept for a data mesh

FAQ on this process model

Why does the process owner initiate the process?

The reason is simple: process owners have a transversal view on the enterprise and are aware they organisation needs shareable concepts.

Do we still need class definitions and class diagrams in data lakehouses?

Yes, since a great deal of data is still in a structured ”schema on write” form and even unstructured or “schema on read” data may benefit from a class diagram creating order in and  comprehension from the underlying data. Even streaming analytics use some tabular form to make the data exploitable.

What is the role of the taxonomy editor?

He or she will make sure the published concept is in synch with the overall knowledge categorisation, providing “the right path” to the concept.

Is there always need for a physical data model?

Sure, any conceptual data model can be physically implemented via a relational model, a NoSQL model in any of the flavours or a native graph database. So yes, if you want complete governance from business concept to implementation, the physical model is also in scope. 

Any questions you might have?

Drop me line or reply in the comments.

The next blog article Best Practices in Defining a Data Warehouse Architecture will focus on the place of a data warehouse in a data mesh.