maandag 27 november 2023

Governing the Data Ingestion Process

“Data lakehousing” is all about good housekeeping your data. There is, of course, room for ungoverned data which are in a quarantine area but if you want to make use of the structured and especially the semi structured and unstructured data you’d better govern the influx of data before your data lake becomes a swamp producing no value whatsoever.

Three data flavours need three different treatments

Structured data are relatively easy to manage: profile the data, look for referential integrity failures, outliers, free text that may need categorising etc… In short: harmonise the data with the target model which can be one or more unrelated tables or a set of data marts to produce meaningful analytical data.

Semi structured data demand a data pipeline that can combine the structured aspects of clickstream or log files analysis with the less structured parts like search terms. It also takes care of matching IP addresses with geolocation data since ISPs sometimes sell blocks of IP ranges to colleagues abroad.

Unstructured data like text files from social media, e-mails, blogposts, document and the likes need more complex treatment. It’s all about finding structure in these data. Preparing these data for text mining means a lot of disambiguation process steps to get from text input to meaning output:

  • Tokenization of the input is the process of splitting a text object into smaller chunks known as tokens. These tokens can be single words or word combinations, characters, numbers, symbols, or n-grams.
  • Normalisation of the input: separating prefixes and/or suffixes from the morpheme to become the base form, e.g. unnatural -> nature
  • Reduce certain word forms to their lemma, e.g. the infinitive of a conjugated verb
  • Tag parts of speech with their grammatical function: verb, adjective,..
  • Parse words as a function of their position and type
  • Check for modality and negations: “could”, “should”, “must”, “maybe”, etc… express modality
  • Disambiguate the sense of words: “very” can be both a positive and a negative term in combination with whatever follows
  • Semantic role labelling: determine the function of the words in a sentence: is the subject an agent or the subject of an action in “I have been treated for hepatitis B”? What is the goal or the result of the action in “I sold the house to a real estate company”?
  • Named entity recognition: categorising text into pre-defined categories like person names, organisation names, location names, time denominations, quantities, monetary values, titles, percentages,…
  • Co-reference resolution: when two or more expressions in a sentence refer to the same object: “Bert bought the book from Alice but she warned him, he would soon get bored of the author’s style as it was a tedious way of writing.” In this sentence, “him” and “he” refer to “Bert”, “she” refers to “Alice” while “it” refers to “the author’s style”.

What architectural components support these treatments?

The first two data types can be handled with the classical Extract, Transform and Load or Extract, Load and Transform pipelines, in short: ETL or ELT. We refer to ample documentation about these processes in the footnote below[1].

But for processing unstructured data, you need to develop classifiers, thesauri and ontologies to represent your “knowledge inventory” as reference model for the text analytics. This takes up a lot of resources and careful analysis to make sure you come up with a complete, yet practical set of tools to support named entity recognition.

The conclusion is straightforward: the less structure predefined in your data, the more efforts in data governance are needed.

 

An example of a thesaurus metamodel

  

 

 

[1] Three reliable sources, each with their nuances and perspectives on ETL/ELT:

https://aws.amazon.com/what-is/etl/

https://www.ibm.com/topics/etl

https://www.snowflake.com/guides/what-etl

zaterdag 18 november 2023

Best Practices in Defining a Data Warehouse Architecture

This blogpost is part of a series of which the following posts have been published:

The opening statement

What is a data mesh?

Coherent business concepts keep the data relevant

In any data mesh architecture, the data warehouse is and will be a critical component for many reasons. First and foremost: some analytics need industrialised solutions, automating the entire flow from raw data tot finished reports. Structured data will always contribute to the analytical environment and will need a relational model to provide the foundation for analyses. In my experience, the most flexible and sustainable model is the process based star schema architecture from Ralph Kimball. In  one of my previous posts I have made the case for this approach.

 And in the context of a data lake project I positioned the Kimball approach as the best in class

The process diagram below tells the story of requirements gathering, ingesting all sorts of data in the lake and making the distinction between structured and unstructured data. Identifying the common dimensions and facts is crucial to make the concept work. Either you provide an increment to an existing data mart bus or you introduce a new process metrics fact table with foreign keys from existing and new dimensions.


Best practices in DWH
Managing structured and unstructured data in a data mesh environment

Making the case for the data warehouse as an endpoint of unstructured analysis

A lot of advanced analytics can be facilitated by the data lake. Think of text analytics, social media analytics and image processing. The outcomes of these analyses may find their way to the data warehouse. For example: polarity analysis in social media. Imagine a bank or a telecom provider capturing the social media comments on its performance. As we all know from customer feedback analysis, only the emotions two or three sigma away from the mean make it to social media. The client is either very satisfied or very dissatisfied and wants the world to know. Taking snapshots of the client’s mood and relating it to his financial or communication behaviour may yield interesting information. Already today, some banks are capturing their client’s mood to determine the optimum conditions to present their services. Aggregating these data may even provide macro-economic data correlating with the business cycle.

Have a look at the diagram below and imagine the business questions it can answer for you.

A high level star schema integrating social messages and their polarity with sales metrics

Think of time series: is there a some form of a leading indicator of sales in the polarity of this customer’s social messages?

If one of our products is the subject of a social media post, has this any (positive or negative) effect on sales of that particular product?

What social media sources have the greatest impact on our brand equity?

I am sure you will add your dimensions and business questions to the model. And by doing so you are realising one of the main traits of a data mesh: delivering data as a product.

I hope I have made my point clear: even in the most sophisticated data lakehouse supporting a data mesh architecture, the data warehouse is not going away.

In the next blog article we will focus on governing the data ingestion process.

Stay tuned!


zaterdag 11 november 2023

Start with Defining Coherent Business Concepts

Below is a diagram describing the governance process of defining and implementing business concepts in a data mesh environment. The business glossary domain is the user facing side of a data catalogue whereas the data management domain is the backend topology of the data catalogue. It describes how business concepts are implemented in databases, whether in virtual or persistent storage.

But first and foremost: it is the glue that holds any dispersed data landscape together. If you can govern the meaning of any data model, any implementation of concepts like PARTY, PARTY ROLE, PROJECT, ASSET and PRODUCT to name a few, the data can be anywhere, in any form but the usability will be guaranteed. Of course, data quality will be a local responsibility in case global concepts need specialisation to cater for local information needs.


Business perspective on defining and implementing a business concept for a data mesh

FAQ on this process model

Why does the process owner initiate the process?

The reason is simple: process owners have a transversal view on the enterprise and are aware they organisation needs shareable concepts.

Do we still need class definitions and class diagrams in data lakehouses?

Yes, since a great deal of data is still in a structured ”schema on write” form and even unstructured or “schema on read” data may benefit from a class diagram creating order in and  comprehension from the underlying data. Even streaming analytics use some tabular form to make the data exploitable.

What is the role of the taxonomy editor?

He or she will make sure the published concept is in synch with the overall knowledge categorisation, providing “the right path” to the concept.

Is there always need for a physical data model?

Sure, any conceptual data model can be physically implemented via a relational model, a NoSQL model in any of the flavours or a native graph database. So yes, if you want complete governance from business concept to implementation, the physical model is also in scope. 

Any questions you might have?

Drop me line or reply in the comments.

The next blog article Best Practices in Defining a Data Warehouse Architecture will focus on the place of a data warehouse in a data mesh.



dinsdag 31 oktober 2023

Defining a Data Mesh

Zhamak Dehgani cornered the concept of a data mesh in 2019. The data mesh is characterised by four important aspects:

  • Data is organised by business domain;
  • Data is packaged as a product, ready for consumption;
  • Governance is federated
  • A data mesh enables self-service data platforms.

Below is an example of a data mesh architecture. The HQ of a multinational food marketer is responsible for the global governance of customers (i.e. retailers and buying organisations), assets (but limited to the global manufacturing sites), products (i.e. the composition of global brands) and competences that are supposed to be present in all subsidiaries. 

The metamodels are governed at the HQ and data for the EMEA Branch are packaged with all the necessary metadata needed for EMEA Branch consumption. These data products are imported in the EMEA Data Mesh where they will be merged with EMEA level data on products (i.e. localised and local brands), local competences, local customer knowledge and local assets like vehicles, offices…


Example of a data mesh architecture, repackaging data from the HQ Domains into an EMEA branch package

The data producer’s domain knowledge and place in the organisation enables the domain experts to set data governance policies focused on business definitions, documentation, data quality, and data access, i.e. information security and privacy. This “data packaging” enables self-service use across an organisation.

This federated approach allows for more flexibility compared to central, monolithic systems. But this does not mean traditional storage systems, like data lakes or data warehouses cannot be used in a mesh. It just means that their use has shifted from a single, centralized data platform to multiple decentralized data repositories, connected via a conceptual layer and preferably governed via a powerful data catalogue.

The data mesh concept is easy to compare to microservices helping business audiences understand its use. As this distributed architecture is particularly helpful in scaling data needs across complex organizations like multinationals, government agencies and conglomerates, it is by no means a useful solution for SME or even larger companies that sell a limited range of products to a limited type of customers.

In the next blog Start with defining coherent business concepts we will illustrate a data governance process, typical for a data mesh architecture. 

dinsdag 24 oktober 2023

Why Data Governance is here to stay

More than a fairly stable Google Trend Index, proving that Data Governance issues won’t go away is the fact that “Johnny-come-lately-but-always-catches-up-in-the-end” Microsoft is seriously investing in its data governance software. After letting the playing field for innovators like Ataccama, Alation,  Alex Solutions and Collibra, Microsoft is ramping the functionality of its data catalogue product, Purview.

 

Google Trend Index "Data Governance"
Google Trend Index on "Data Governance"

The reason for this is twofold: the emerging multicloud architectures as well as the advent of the data mesh architecture driving new data ecosystems for complex data landscapes.

Without firm data governance processes and software supporting these processes, the return on information would produce negative figures.

In the next blog Defining a Data Mesh  I will define what a data mesh is about and in the following blog articles I will suggest a few measures needed to avoid data swamps. Stay tuned!

dinsdag 21 februari 2023

How will ChatGPT affect the practice of business analysis and enterprise architecture?

 

ChatGPT (Chat Generative Pre-trained Transformer), is a language model-based chat bot developed by OpenAI enabling users to refine and steer a conversation towards a desired length, format, style, level of detail, and language.  Many of my colleagues are assessing the impact of Artificial Intelligence products on their practice and the jury is still out there: some of them consider it a threat that will wipe out their business model and others see it as an opportunity to improve productivity and effectiveness of their practice.



I have a somewhat different opinion. Language training models use gigantic amounts of data to train the models but I am afraid if you want to use the Internet data you certainly have a massive amount of data but of dubious and not always verifiable quality.

General Internet data is polluted with commercial content, hoaxes and ambiguous statements that need strong cultural background analysis to make sense of it.

The data that has better quality than general Internet data is almost always protected by a copyright; Therefore use without permission is not always gentlemanlike to say the least.

Another source of training data are the whitepapers and other information packages you get in exchange for your data: e-mail, function, company,… These documents often start with stating a problem in a correct and useful way but then direct you to the solution delivered by their product.

The best practices in business analysis and enterprise architecture are -I am afraid- not on the Internet. They’re like news articles behind a paywall. So if you ask CHAT GPT a question like “Where can I find information to do business analysis for analytics and business intelligence?” You get superficial answers that –at best- provide a starting point to study the topic.


A screenshot of the shallow and casual reply. It goes on with riveting advice like “Stay Informed”, “Training and Certification”, “Networking”, “Documentation”,…

And the question “What are Best Practices in business intelligence” leads to the same level of platitudes and triteness:

  • “Align with business goals” who would have thought that?
  • “User involvement and collaboration” Really?
  • “Data Quality and Governance” Sure, but how? And when and where?

In conclusion: a professional analyst or enterprise architect has nothing to fear from ChatGPT. 


At best, it provides a somewhat more verbose and redacted answer to a question saving you the time to plough through over a billion answers from Google.