Posts tonen met het label Data Governance. Alle posts tonen

dinsdag 4 februari 2025

Data Governance for GenAI

Introduction

In this article I will define what data governance (DG) for Generative Artificial Intelligence (GenAI) is and how it differs from DG as we have known it for decades in the world of transaction systems (OLTP) and analytical systems (OLAP and Data Mining).

In a second post, I will make the case for DG based on the use case at hand and illustrate a few GenAI DG use cases that are feasible and fitting the patterns and the framework.

Die “Umwertung aller Werten”

The German philosopher Friedrich Nietzsche postulated that all existing values should be discarded and replaced by values that -up to now- were considered unwanted.

This is what comes to mind when I examine some GenAI use cases and look at the widely accepted data governance policies, rules and practices.

Here are the old values that will be replaced:

• Establish data standards;

• The data model as a contract;

• Data glossary and data lineage, the universal truths;

• Data quality, data consistency and data security enforcement;

• Data stewardship based on a subject area.

Establish data standards

As the DAMA DM BOK states: Data standards and guidelines include naming standards, requirement specification standards, data modelling standards, database design standards, architecture standards, and procedural standards for each data management function.

This approach couldn’t be further away from DG for GenAI. Data standards are mostly about “spelling” which has very low impact on semantics. The syntactical aspects of data standards are more in the realm of tagging where subject matter experts provide standardised meaning to various syntactical expressions. So we can have tagging standards for supervised learning, but even those can depend on “the eye of the beholder”, i.e. the use case.

OK, we can have discussions about which language model and vector database is the best fit for the use case at hand but it will be a continued trial and error process before we have optimised the infrastructure and it certainly won’t be a general recommendation for all use cases.

And as for the requirement specification standards, as long as they don’t kill the creativity needed to deal with GenAI, I’ll give them a pass, since this is not always a linear process to identify business needs for information and data. The greatest value in GenAI lies in discovering answers to questions you didn’t dream of asking.

The data model as a contract

fig. 1: Requirements constitute a contract for the data model. Governing this contract is relatively easy.

This principle works fine for transaction systems and classic Business Intelligence data architectures where a star schema or a data vault models the world view of the stakeholders. The only contract is the aforementioned tagging and metadata specifications to make sure the data are exploitable.

Data glossary and data lineage, universal truths?

No longer. The use case context will determine the glossary and the lineage if there are intermediate steps involved before the data are accessible. Definitions may change as a function of context as well as data transformations to prepare them for the task at hand.

Data quality, data consistency and data security enforcement

In old school data governance policies, data quality (DQ) is first about complying with specs and only then does “fit for purpose” comes in as the deciding criterion as I described in Business Analysis for Business Intelligence(1):

Data quality for BI purpose is defined and gauged with reference to fitness for purpose as defined by the analytical use of the data and complying with three levels of data quality as defined by:

[Level 1] database administrators

[Level 2] data warehouse architects

[Level 3] business intelligence analysts

On level 1, data quality is narrowed down to data integrity or the degree to which the attributes of an instance describe the instance accurately and whether the attributes are valid, i.e. comply with defined ranges or definitions managed by the business users. This definition remains very close to the transaction view.

On level 2, data quality is expressed as the percentage completeness and correctness of the analytical perspectives. In other words, to what degree is each dimension, each fact table complete enough to produce significant information for analytical purpose? Issues like sparsity and spreads in the data values are harder to tackle. Timeliness and consistency need to be controlled and managed on the data warehouse level.

On level 3, data quality is the measure in which the available data are capable of adequately answering the business questions. Some use the criterion of accessibility with regards to the usability and clarity of the data. Although this seems a somewhat vague definition, it is most relevant to anyone with some analytical mileage on his odometer. I remember a vast data mining project in a mail order company producing the following astonishing result: 99.9% of all dresses sold were bought by women!

In GenAI, we can pay few attention to the aforementioned level 1 while emphasizing the higher level aspects of data quality. And there, the true challenge lies in testing the validity of three interacting aspects of GenAI data: quality, quantity and density. As mentioned above: quality in the sense of “fit-for-use-case” reducing bias and detecting trustworthy sources, quantity by guaranteeing sufficient data to include all -expected and non-expected- patterns and finally density: to make sure the language model can deliver meaningful proximity measures between the concepts in the data set.

Data stewardship based on a subject area

fig. 2:Like a football steward, a data steward must also control the crowd to prevent chaos

A business data steward, according to DAMA is a knowledge worker and business leader recognized as a subject matter expert who is assigned accountability for the data specifications and data quality of specifically assigned business entities, subject areas or databases, who will: (…)

4. Ensure the validity and relevance of assigned data model subject areas

5. Define and maintain data quality requirements and business rules for assigned data attributes.

It is clear that this definition needs adjustments. Here is my concept of a data steward for GenAI data:

It is, of course, a knowledge worker who is familiar with the use case that the GenAI data set needs to satisfy. This may be a single subject matter expert (SME) but in the majority of the cases he or she will be the coach and integrator of several SMEs to grasp the complexity of the data set under scrutiny. He or she will be more of a data quality gauge than a DQ prescriber and, together with the analysts and language model builders will take measures to enhance the quality of the output rather than investing too much effort in the input. Let me explain this before any misconceptions occur. The SME asks the system questions to which he knows the answer, checks the output and uses RAG to improve the answer. If he detects a certain substandard conciseness in all the answers he may work on the chunking size of the input, but that is without changing the input itself. Meanwhile some developers are working on automated feedback learning loops that will improve the performance of the SME, as you can imagine coming up with all sorts of questions and evaluating the answers is a time consuming task.

In conclusion

Today, GenAI is more about enablement than control. It prioritises the creative use of data while ensuring ethical and transparent use of it. Especially in Explainable Artificial Intelligence (XAI) this approach is enabled in full. I refer to a previous blog post on this subject.

Since unstructured data like documents and images are in scope, a more flexible and adaptive metadata management is key. AI is now itself being used to monitor and implement data policies. Tools like Alation, Ataccama and Alex Solutions have done groundbreaking work in this area and Microsoft’s Purview is -as always- catching up. New challenges emerge: ensuring quality and accuracy is not always feasible with images and integrating data from diverse sources and in diverse unstructured formats is also a challenge.

The more we are developing new use cases for GenAI, the more we experience a universal law: data governance for GenAI is use case based as we prove in the next blogpost. This begs the question for a flexible and adaptive governance framework that monitors, governs and enforces rules (but not too strict unless it’s about privacy) of its use. In other words, the same data set may be subject to various, clearly distinguishable governance frameworks, dictated by the use case.

________________________________________________________________________________

(1) Business Analysis for Business Intelligence, CRC Press 2012, Bert Brijs p. 272

maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
Tag parts of speech with their grammatical function: verb, adjective,..
Parse words as a function of their position and type
Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

An example of a thesaurus metamodel

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl

zaterdag 11 november 2023

Start with Defining Coherent Business Concepts

Below is a diagram describing the governance process of defining and implementing business concepts in a data mesh environment. The business glossary domain is the user facing side of a data catalogue whereas the data management domain is the backend topology of the data catalogue. It describes how business concepts are implemented in databases, whether in virtual or persistent storage.

But first and foremost: it is the glue that holds any dispersed data landscape together. If you can govern the meaning of any data model, any implementation of concepts like PARTY, PARTY ROLE, PROJECT, ASSET and PRODUCT to name a few, the data can be anywhere, in any form but the usability will be guaranteed. Of course, data quality will be a local responsibility in case global concepts need specialisation to cater for local information needs.

Business perspective on defining and implementing a business concept for a data mesh

FAQ on this process model

Why does the process owner initiate the process?

The reason is simple: process owners have a transversal view on the enterprise and are aware they organisation needs shareable concepts.

Do we still need class definitions and class diagrams in data lakehouses?

Yes, since a great deal of data is still in a structured ”schema on write” form and even unstructured or “schema on read” data may benefit from a class diagram creating order in and comprehension from the underlying data. Even streaming analytics use some tabular form to make the data exploitable.

What is the role of the taxonomy editor?

He or she will make sure the published concept is in synch with the overall knowledge categorisation, providing “the right path” to the concept.

Is there always need for a physical data model?

Sure, any conceptual data model can be physically implemented via a relational model, a NoSQL model in any of the flavours or a native graph database. So yes, if you want complete governance from business concept to implementation, the physical model is also in scope.

Any questions you might have?

Drop me line or reply in the comments.

The next blog article Best Practices in Defining a Data Warehouse Architecture will focus on the place of a data warehouse in a data mesh.

dinsdag 31 oktober 2023

Defining a Data Mesh

Zhamak Dehgani cornered the concept of a data mesh in 2019. The data mesh is characterised by four important aspects:

Data is organised by business domain;
Data is packaged as a product, ready for consumption;
Governance is federated
A data mesh enables self-service data platforms.

Below is an example of a data mesh architecture. The HQ of a multinational food marketer is responsible for the global governance of customers (i.e. retailers and buying organisations), assets (but limited to the global manufacturing sites), products (i.e. the composition of global brands) and competences that are supposed to be present in all subsidiaries.

The metamodels are governed at the HQ and data for the EMEA Branch are packaged with all the necessary metadata needed for EMEA Branch consumption. These data products are imported in the EMEA Data Mesh where they will be merged with EMEA level data on products (i.e. localised and local brands), local competences, local customer knowledge and local assets like vehicles, offices…

Example of a data mesh architecture, repackaging data from the HQ Domains into an EMEA branch package

The data producer’s domain knowledge and place in the organisation enables the domain experts to set data governance policies focused on business definitions, documentation, data quality, and data access, i.e. information security and privacy. This “data packaging” enables self-service use across an organisation.

This federated approach allows for more flexibility compared to central, monolithic systems. But this does not mean traditional storage systems, like data lakes or data warehouses cannot be used in a mesh. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories, connected via a conceptual layer and preferably governed via a powerful data catalogue.

The data mesh concept is easy to compare to microservices helping business audiences understand its use. As this distributed architecture is particularly helpful in scaling data needs across complex organizations like multinationals, government agencies and conglomerates, it is by no means a useful solution for SME or even larger companies that sell a limited range of products to a limited type of customers.

In the next blog Start with defining coherent business concepts we will illustrate a data governance process, typical for a data mesh architecture.

…

dinsdag 24 oktober 2023

Why Data Governance is here to stay

More than a fairly stable Google Trend Index, proving that Data Governance issues won’t go away is the fact that “Johnny-come-lately-but-always-catches-up-in-the-end” Microsoft is seriously investing in its data governance software. After letting the playing field for innovators like Ataccama, Alation, Alex Solutions and Collibra, Microsoft is ramping the functionality of its data catalogue product, Purview.

Google Trend Index on "Data Governance"

The reason for this is twofold: the emerging multicloud architectures as well as the advent of the data mesh architecture driving new data ecosystems for complex data landscapes.

Without firm data governance processes and software supporting these processes, the return on information would produce negative figures.

In the next blog Defining a Data Mesh I will define what a data mesh is about and in the following blog articles I will suggest a few measures needed to avoid data swamps. Stay tuned!

maandag 27 juni 2016

Why Master Data Management is Not Just a Nice-to-have…

Sometimes the ideas for a blog just land on your desk without any effort. This time, all the effort was made by one of the world’s largest fast moving consumer goods companies with 355.000 employees worldwide.

But this is not a guarantee for smart process and data management as the next experience from yours truly will illustrate.

The Anamnesis

One rainy day, the tenth of May, I receive a mail piece with a nice promotional offer: buy a coffee machine for one euro while you order your exquisite cups online. On rainy days you take more time to read junk mail and sometimes you even respond to them. So I surfed to their website and filled out the order form. After introducing the invoice data (VAT number,invoice address,…) an interesting question popped up:

Is your delivery address different from your invoice address?

As a matter of fact it was, it was the holiday season and the office was closed for a week but I was at a customer’s site and thought it would be a good idea to have it delivered there.

So I ticked the box and filled in the delivery address. That’s when the horror started.

Because, when I hit the order button, there was no feedback after saving, no chance to check the order and wham, there came the order confirmation by e-mail.

Oops: the delivery address and the invoice address were switched. Was this my fault or a glitch in the web form? Who cares, best practice in e-commerce is to leave the option for changing the order on details and even cancelling the order, right? Wrong. There was no way of changing the order, all I could do was call the free customer service number to hopefully make the switch undone.

10th May, Call to Consumer Service Desk #1

IVR: “Choose 2 if this is your first order”

Me: “2”

Client service agent: “What is your member number?

Me: “I don’t have member number since this is my first order. It’s about order nr NAW19092… “

Client service agent: “hmmm we can’t use the order number to find your data. What is your postcode and house number?”

Me: “This is tricky since I want to switch delivery address with the invoice address. You know what, I’ll give you both”.

Client service agent: Can’t find your order”

So, I am completely out of the picture: not via the company, the address, the order number, let alone a unique identifier like the VAT number

Client service agent: “Please send a mail to our service e-mail address “yyy@zzz.com”.

Me: “Send e-mail” Result: no receipt confirmation, no answer from this e-mail address. Great customer experience guys!

10th May Call to Consumer Service Desk #2

Client service agent: “Oh Sir, you are calling the consumer line, you should dial YYY/YYYYYY for the business customers”

Me:" But that’s the only phone number on your website and the order confirmation???!!!"

10^th May 2 PM Call to Business customer service #3

Client service agent: “Let me check if I can find your order”… (2’ wait time) “Yes, it’s here how can I help you?”

Me: “I want to switch the invoice with the delivery address”

Client service agent“OK Sir, done”

11^th May: The delivery service provider sends a message the delivery is due on the original address from the order.

No switch had been made…

Call to DPD? Too late.. these guys were too efficient...

The Diagnosis, What Else?

Marketing didn’t have a clue about the order flow and launched a promotion without an end-to-end view on the process which resulted in a half-baked online order process: no reviewing of the order possible, no feedback and the wrong customer service number on the order confirmation.

Data elements describing CUSTOMER, ORDER and PRODUCT may or may not be conformed (from the outside hard to validate) but they are certainly locked in functional silos: consumers and companies.

Customer service has no direct connection to the delivery process

The shipping company (DPD) provided the best possible service under the circumstances.

And this is only a major global player! Can you imagine how lesser Gods screw up their online experience?

Yes, it can get worse!

One of my clients called me in on a project that was under way and was seriously going south.

What happened? The organisation had developed a back office application to support a public agenda of events. As a customer of this organisation you could contact the front desk who would then log some data in the back office application and wrap up the rest of the process via e-mail. Each co-worker would use his own “data standards” in Outlook so every event had to be handled by the initial co-worker if the organisation wanted to avoid mistakes. No wonder some event logging processes sometimes took quite a while when the initiator was on a holiday or on sick leave…

A few months later -keep that in mind- the organisation decided to push the front desk work to the web and guess what? Half the process flow and half the data couldn’t be supported by the back office application because the business logic applied by the front desk worker wasn’t analysed when developing the back office app.

Siloed application development can lead you to funny (but unworkable) products

So, please all you folks out there, invest some money in an end-to-end analysis of the process and the master data. It’s a fraction of the building cost and it will save you tons of money and bad will with customers, coworkers and suppliers.

vrijdag 20 mei 2016

Afterthoughts on Data Governance for BI

Why Business Intelligence needs a specific approach to data governance

During my talk at the Data Governance Conference, at least one of my audience was paying attention and asked me a pertinent question. “Why should you need a separate approach for data governance in Business Intelligence?”

My first reaction was “’Oops, I’ve skipped a few stadia in my introduction…” So here’s an opportunity to set things right.

Some theory, from the presentation

At the conference, I took some time to explain the matrix below.

the relevance of data for decision making

Data portfolio management as presented at the 2016 data governance conference in London

If you analyse the nature of the data present in any organisation, you can discern four major types.

Let’s take a walk through the matrix in the form of an ice cream producer.

Strategic Data: this is critical to future strategy development; both forming and executing strategy are supported by the data. By definition almost, strategic data are not in your process data or at best are integrated data objects from process data and/or external data. A simple example: (internal) ice cream consumption per vending machine matched with (external) weather data and an (external) count of competing vending machines and other competing outlets create a market penetration index which in its turn has a predictive value for future trends.

Turnaround Data: critical to future business success as today’s operations are not supported, new operations will be needed to execute. E.g.: new isolation methods and materials make ice cream fit for e-commerce. The company needs to assess the potential of this new channel as well as the potential cannibalizing effect of the substitute product. In case the company decides not to compete in this segment, what are the countermeasures to ward off the competition? Market research will produce the qualitative and quantitative data that need to be mapped on the existing customer base and the present outlets.

Factory Data: this is critical to existing business operations. Think of the classical reports, dashboards and scorecards. For example: sales per outlet type in value and volume, inventory turnover… all sorts of KPIs marketing, operations and finance want every week on their desk.

Support Data: these data are valuable but not critical to success. For instance reference data for vending locations, ice cream types and packaging types for logistics and any other attribute that may cause a nuisance if it’s not well managed.

If you look at the process data as the object of study in data governance, they fall entirely in the last two quadrants.

They contribute to decision making in operational, tactical and strategic areas but they do not deliver the complete picture as the examples clearly illustrate. There are a few other reasons why data governance in BI needs special attention, If you need to discuss this further, drop me a line via the Lingua Franca contact form.