Posts tonen met het label data architecture. Alle posts tonen
Posts tonen met het label data architecture. Alle posts tonen

maandag 5 december 2022

Data Architecture as a Consequence of Organisation Design

 

Lingua Franca was involved in the data architecture of an organisation which name and type is of no interest for the case I am making. Namely,  the way an organisation functions and is structured determines the data architecture. It is a text book example of many organisations today.

The organisation was a merger of various business units which all used their own proprietary business processes, data standards and data definitions.

The CIO had a vision of well governed, standardised processes that would create a unified organisation that operated in a predictable and transparent manner.

Harmonised End to End Processes Are the Basis of Transparent Decision Making

Common dimensions and common facts

Shared facts and dimensions assure a scalable and manageable analytics architecture

The case for a Kimball approach in data warehousing was clear: if every department, every knowledge unit would use the same processes, the shared facts and common dimension architecture was a no brainer.

As the diagram suggests: it takes effort to make sure everybody is on the same page about the metrics and the dimensions but once this is established, new iterations will go smoothly and build trust in the data.

For more than 4 years, the resistance to change wore the CIO, data warehouse team and finally the data architect out when the CIO left the organisation. The new CIO decided to not continue the fight for harmonised processes and saw this as a reduced need for a data warehouse. If every business unit would use its own operational reporting, it would produce rapid results at a far lower cost than a data warehouse foundation delivering the reports. A new crew was on boarded: two ETL developers, two front end developers and a data architect.

Satisfying Clients in Their Operational Silos Creates Technical Debt

A third normal form data model for operational reporting

Cutting corners for fast delivery creates technical debt, that needs to be repaid

As this diagram suggests, the client defines his particular needs, asking for a report not on SKU level because he’s only interested in product sets. The sets require special handling so they are linked to specific shippers who have their delivery areas.  Although this schema may cause no problems for the frontend developer to produce a nice looking report, consolidating the information on corporate level will take time and effort.

The reality will prove differently, of course. If every business unit uses its own definitions, metrics and dimensions there is no chance of having correct, aggregated information for strategic decision making. To remedy this shortcoming, the new data architect will have to go back to 2008, publish date of Bill Inmon’s  DW 2.0. The idea is to create the operational report as fast as possible and after delivering the product refactor the underlying data to make them compatible with other data used in previous reports.

The result is a serious governance effort, lots of rework and an ever growing DW 2.0 in the third normal form that one day may contain sufficient enterprise wide data to produce meaningful aggregates for strategic direction. The Corporate Information factory (CIF) revisited so to speak.

Why the CIF Never Realised Any Value

In Inmon’s world, it was recommended to build the entire data warehouse before extracting any data marts. These data marts are aggregates, based on user profiles or functions in the organisation and are groupings of detailed data that may change over time.

This led to many problems on the sites I have visited during my career as a business analyst and data architect.

First and foremost: by the time you have covered the entire scope of the CIF, the world has changed and you can refactor entire parts of the  data model and reload quite a few data to be in synch with new realities. Doing this on a 3NF data schema can be quite complex and time and resource consuming. And then there is the data mart management problem: if requirements for aggregations change over time, keeping track of historical changes in aggregations and trends is a real pain.


About DW 2.0: the Data Quagmire



To anyone who hasn’t read this book: it’s the last attempt of the “father of data warehousing” to defend his erroneous Corporate Information Factory (CIF), adding some text data to a structured data warehouse in the third normal form. The book is full of conceptual drawings but that is all they are; not one implementation direction follows up on the drawings. Compare this to the Kimball books where every architectural concept is translated into SQL scripts and clear instructions and you know where the real value is.

With DW 2.0 the organisation is trying to salvage some of the operational reports’ value but at a cost, significantly higher than respecting the principle “Do IT right the first time”. The only good thing about this new approach is that nobody will notice the cost overrun because it is spread over numerous operational reports over time. Only when the functional data marts need rebuilding, may some people notice the data quagmire the organisation has stepped into.

Conclusion, to paraphrase A.D. Chandler: Data structure follows strategy



woensdag 16 juni 2021

Managing a Data Lake Project Part II: A Compelling Business Case for a Governed Lake

 

In Part I A Data Lake and its Capabilities we already hinted towards a business case but in this blog we make it a little bit more explicit.




A recap from Part I: the data lake capabilities

The business case for a data lake has many aspects, some of them present sufficient rationale on their own but that depends of course on the actual situation and context of your organisation. Therefore, I mention about eleven rationales, but feel free to add yours in the comments.

 We are mixing on-premise data with Cloud based systems which causes new silos

The Cloud providers deliver software for easy switching your on-premise applications and databases to Cloud versions. But there are cases where this isn’t possible to do this in one fell swoop:

  • Some applications require refactoring before moving them to the Cloud;
  • Some are under such strict info security constraints that even the best Cloud security they can’t be relied on. I know of  retailer who keeps his excellent logistic system in something close to a bunker!
  • Sometimes the budget or the available skills are insufficient to support a 100 % Cloud environment, etc…

This provides already a very compelling business case for a governed data lake: a catalog that manages lineage and meaning will make the transition smoother and safer.

Master data is a real pain in siloed data storage, as is governance...

A governed data lake can improve master data processes by involving the end users in evaluating intuitively what’s in the data store. Using both predefined data quality rules and machine learning to detect anomalies and implicit relationships in the data as well as defining the golden record for objects like CUSTOMER, PRODUCT, REGION,… the data lake can unlock data even in technical and physical silos. 

We now deal with new data processing and storage technologies other than ETL and relational databases: NoSQL, Hadoop, Spark or Kafka to name a few

NoSQL has many advantages for certain purposes but from a governance point of view it is a nightmare: any data format, any level of nesting and any undocumented business process can be captured by a NoSQL database.

Streaming (unstructured) data is unfit for a classical ETL process which supports structured data analysis so we need to combine the flexibility of a data lake ingestion process with the governance capabilities of a data catalogue or else we will end up with a data swamp.

We don't have the time, nor the resources to analyse up front what data are useful for analysis and what data are not

There is a shortage of experienced data scientists. Initiatives like applications to support data citizens may soften the pain here and there but let’s face it, most organisations lack the capabilities for continuous sandboxing to discover what data in what form can be made meaningful. It’s easier to accept indiscriminately all data to move into the data lake and let the catalogue do some of the heavy lifting.

We need to scale horizontally to cope with massive unpredictable bursts of of data

Larger online retailers, event organisations, government e-services and other public facing organisations can use the data lake as a buffer for ingesting massive amounts of data and sort out its value in a later stage.  

We need to make a rapid and intuitive connection between business concepts and data that contribute, alter, define or challenge these concepts

This has been my mission for about three decades: to bridge the gap between business and IT and as far as “classical” architectures go, this craft was humanly possible. But in the world of NoSQL, Hadoop and Graph databases this would be an immense task if not supported by a data catalogue.  

Consequently, we need to facilitate self-service data wrangling, data integration and data analysis for the business users

A governed data lake ensures trust in the data, trust in what business can and can't do. This can speed up data literacy in the organisation by an order of magnitude.

We need to get better insight in the value and the impact of data we create, collect and store.

Reuse of well-catalogued data will enable this: end users will contribute to the evaluation of data and automated meta-analysis of data in analytics will reinforce the use of the best data available in the lake. Data lifecycle management becomes possible in a diverse data environment.

We need to avoid fines like those stipulated in the GDPR from the EU which can amount up to 4% of annual turnover!

Data privacy regulations need functionality to support “security by design” which is delivered in a governed data lake. Data pseudonimisation, data obfuscation or anonimisation come in handy when these functions are linked to security roles and user rights. 

We need a clear lineage of the crucial data to comply with stringent laws for publicly listed companies

Sarbanes Oxley and Basel III are examples of legislation that require accountability at all levels and in all business processes. Data lineage is compulsory in these legal contexts. 

But more than all of the above IT based arguments, there is one compelling business case for C-level management: speeding up the decision cycle time and the organisation’s agility in the market.

Whether this market is a profit generating market or a non-profit market where the outcomes are beneficial to society, speeding up decisions by tightening the integration between concepts and data is the main benefit of a governed data lake.

Anyone who has followed the many COVID-19 kerfuffles, the poor response times and the quality of the responses to the pandemic sees the compelling business case:

  • Rapid meta-analysis of peer reviewed research papers;
  • Social media reporting on local outbreaks and incidents;
  • Second use opportunities from drug repurposing studies;
  • Screening and analysing data from testing, vaccinations, diagnoses, death reports,…

I am sure medical professionals can come up with more rationales for a data lake, but you get the gist of it.

So, why is there a need for a special project management approach to a data lake introduction? That is the theme of Part III.  But first, let me have your comments on this blogpost.