woensdag 9 juni 2021

Managing a Data Lake Project Part I: A Data Lake and its Capabilities

 A data lake can provide us with the technology to cope with the challenges of various data formats arriving in massive amounts, too fast and diverse for a classic data pipeline resulting in a data warehouse. As a the data warehouse is optimised for analysis of structured data, the inflow of unstructured data strings, entire documents, JSONs with n levels of nesting, binaries, etc… is simply too much for a data warehouse.

A data lake is an environment that manages any type of data from any type of source or process in a transparent way for the business. In tandem with a data catalogue, a lake provides data governance and facilitates data wrangling,  trusted analytical capabilities as well as self-service analytics to name a few.

If we zoom in on these capabilities, we can list these as the basic requirements for a minimum viable product:

  • Automated discovery, cataloguing and classification of ingested data;
  • Collaborative options for evaluating the ingested data;
  • Governance of quality, reliability, security and privacy aspects as well as lifecycle management;
  • Facilitates data preparation for analytical purposes in projects as well as for unsupervised and spontaneous self-service analytics;
  • Provides the business end users with an intuitive search and discovery platform;
  • Archives data where and when necessary.

 

Generic data processing map
Data comes from events that lead to business processes as well as from outside events that may become part of the business processes

Some vendors launch the term “data marketplace” to stress the self-service aspects of a data lake. But this position depends on the analytical maturity of the organisation. If introduced too early it may provide further substantiation for the claim that:

“Analytics is a process of ingesting, transforming and preparing data for publication and analysis to end up in Excel sheets, used a “proof” for a management hypothesis”.

What makes a data lake ready for use?

Meta data: data describing the data in the lake: its providence, the data format(s), the business and technical definitions,…;

Governance: business and IT control over meaning, application and quality of data as well as information security and data privacy regulation;

Cataloguing: either by machine learning or precooked categories and rule engines, data is sorted and ordered according to meaningful categories for the business.

Structuring: data increases in meaning if relationships with other concepts are modelled in hierarchies, taxonomies and ontologies;

Tagging: both governed and ungoverned tags (i.e. user tags) dramatically improve the usability of the ingested data. If these tags are evaluated on practical use by the user community they become part of a continuous quality improvement process;

Hierarchies: identical to tagging, there may be governed and personal hierarchies in use;

Taxonomies: systematic hierarchies, based on scientific methods;

Ontologies: a set of concepts and categories in a subject area or data domain that shows their properties and the relations between them to model the way the organisation sees the world.


zaterdag 29 mei 2021

Managing a Data Lake Project

With the massive growth of online generated data and IoT data, the proportion of unstructured and semi-structured data constitutes the bulk of the data that needs to be analysed. Whereas a 50 Gigabyte data warehouse to facilitate analysis of structured data was quite an achievement up to now, this number dwindles compared to the unstructured and semi-structured data avalanche.

Data Avalanche?


Yes, because compared to the steady stream of data from transaction processing systems, we now have to deal with irregular flows and massive bursts of incoming data that needs to be adequately processed to provide meaning to the data.
New data sources emerge, other than social media and IoT data, like smart machines and machine learning systems generating new data, based on existing sources. Managing various data types and metadata in impressive volumes are just a few technical aspects which can be solved by technology. The HR- , legal- and organisational aspects are level more complex, but aspects these are not in scope of this series of blog posts. 
We are adding extra process and event based decision support to our management capabilities and that alone is worth the cost, the trouble and the change management efforts to introduce a data lake.

See you at the Webinar!

Wednesday 9th June you can tune in on a short webinar hosted by the Great IT Professional. You can still register via this link. The webinar will be followed by a series of articles on how to manage the Data Lake project. Stay tuned!

Bert Brijs Webinar on Managing a Data Lake Project


woensdag 30 december 2020

New Inroads for Analytics in the Post Corona Era

 

OK, 2021 will not get rid of the virus immediately but the new consumer behaviour, induced by the pandemic will have lasting effects that need to be taken care of by brand owners, distribution channels and -consequently- by the analytics approach and infrastructure.

So, what is exactly this new consumer behaviour?

You already guessed: more online shopping and more pervasive switching to web shops from the local shops to compete with the incumbents. The local shop owners finally have understood the value of proximity combined with the convenience of online browsing and online ordering or preordering and collecting the order at the local shop.

But there’s more. Not only have the predominant shopping logistics changed; the product range has also undergone the influence of the various lockdown periods. Consumers have a tendency to shop for more luxury products in the food section as a means of self-indulgence and the dichotomy between convenience and fun shopping is getting clearer and larger. Some retail chains are already experimenting with automatic replenishment of convenience products using automated algorithms. But some supermarkets in the Benelux are combining convenience, fun and self-indulgence offering prepared meals that can be consumed in the shop. Plus, Albert Heyn and Jumbo are experimenting with the concept. This can have an impact on local restaurants who have survived on their take away service during the pandemic.

Due to Covid-19 this section where you can have a meal at a Plus supermarket is closed...

And how does this emerging consumer behaviour affect the analytics profession?

The larger distribution chains will continue to develop their centralised analytical systems. The data flows from the outlets’ cash registers to the central data warehouse and delivers customer and product insights as this has been the case since AC Nielsen built the first embryo of a retail data warehouse somewhere in the seventies.

New opportunities for innovation in analytics for large retailers lie in edge computing. Think of directed dialogues with the customer, analysing conversion rates from looking at products, holding them, inspecting them and finally putting the product in the shopping trolley and feeding it back to the pricing and communication in the isles.

Now, as local shops discover the value of customer data, syndicators will emerge to provide economies of scale and of scope to aggregate data of the local shops and provide benchmarks and high level customer insights as a first deliverable. It will take some serious investments in persuading the local shops to share their data but it will happen in the next three to five years. My experience with a data warehouse project for an association of independent retailers tells me it’s doable if you mimic the architecture of epidemiological analytics. These systems have the highest levels of information security combined with state of the art analytical capabilities. And so another product of this pandemic may contribute to new analytical solutions.

But the major shift in the analytics landscape is happening with the brand owners. Up to now, most brand owners were OK with the idea that customer behaviour data resided in the systems of large retailers. Some of the clever ones developed a data sharing approach with the retailers accepting the possibility of a biased view on their final customers.

Now the need for massive customer data for brand owners is unavoidable. New ways of collecting unfiltered customer data will emerge. Smartphones, fit bits and other devices will have new roles to play in this strategic movement.

 

 

 

 


woensdag 30 oktober 2019

Enterprise Architectures for Artificial Intelligence (III)


Taxonomies of Artificial Intelligence

There are at least five ways to position AI in the enterprise landscape:
  1. By processing method: batch, micro batch and real time
  2. By algorithm type: pattern recognition, clustering, associations, scoring, predictive, classification, text, speech and image mining, …
  3. By data type: high vs low dimensionality, graph data, self-describing data vs structured schema data, machine vs human sourced data, mediated data registration vs direct data registration,…
  4. By data behaviour: volatile vs stable data values, long vs data persistency,
  5. By analytics goals and or business process: churn prevention, prospect qualification, complex evaluations of loan applications, CVs, customer feedback, basket analysis, next best action proposals, fraud detection …

The enterprise architect will choose the relevant combinations between these taxonomies to produce a coherent end-to-end vision on the architecture. A possible selection criterion is the governance model used in the organisation. In a business monopoly analytics goals will be leading and combined with algorithm type. In an  IT monopoly processing methods combined with data behaviour is the most probable direction and in a duopoly, well,… that depends.


Let’s do an exercise and suppose this is the outcome of a duopoly governance model: combining the processing method with the algorithm type to indicate which processing method is most suited for the chosen algorithm type. Using this schema may help to manage expectations between the business and the IT people better.



Batch
Micro batch
Real time
pattern recognition
Ideal method for large data sets
Suited for simple patterns
Only as a binary in/out of pattern decision which implies a large (batch) training set
clustering
Ideal method for large data sets
Suited for simple clustering criteria
Only as Y or N adherence to an existing cluster which implies a large (batch) training set
associations
Ideal method

Hardly possible
impossible
scoring
Develop a base line
Adjust the base line
Score against the base line
predictive
Develop a base line

Adjust the base line
Match with the trend
classification
Train the dataset

Classify new data
Simple classification
text mining
Train the dataset
Reveal polarity, topics, etc…
Deliver alerts
speech mining
Train the dataset
Reveal polarity, topics, etc…
Deliver alerts
Image & video mining
Train the dataset
Classify images
Deliver alerts




From this crosstab, it becomes possible to position the concrete algorithms, the data sets and their life cycle management, the ingestion volumes, timing and the technology to deliver on the various promises made.
Other methods will give you paths to the same end result: a coherent and methodical inventory of the landscape, linking business processes to AI and data mining initiatives and routines as well as the data and the applications to deliver the goods. Based on a gap analysis, the enterprise architect can develop a roadmap that communicates with all parties concerned.

maandag 30 september 2019

Enterprise Architectures for Artificial Intelligence (II)


A generic model for primary processes



Every organisation is unique but most organisations share some basic principles in the way they operate. Business processes have some form (between 5 and 100%) of support by online transaction systems (OLTP). Business drivers like consumer demand, government regulations, special interest groups, technological evolutions, availability of raw materials and labour and many others influence the business processes intended to deliver a product or service that meets market demand within a set of constraints. These constraints can range from enforcing regulatory bodies to voluntary self-regulation and measures inspired by public relations objectives.
This is a high level approach of how AI can support business processes


AI and enterprise architecture
High level generic architecture

Business drivers are at the basis of business processes to realise certain business goals and delivering products for an internal or external customer.  These processes are supported by applications, the so-called online transaction processing (OLTP) systems.
Business process owners formulate an a priori scoring model that is constantly adapted by both microscopic transaction data as well as historic trend data from the data warehouse (DWH). Both data sources can blend into decision support data, suited for sharply defined data requirements as well as vague assumptions about their value for decision making.  The decisions at hand can be either microscopic or macroscopic. 

Introducing AI in the business processes


As an architect one of the first decisions to make is whether and when AI becomes relevant enough to become part of routine business processes. There are many AI initiatives in organisations but the majority is still in R & D mode or –at best- in project mode.  It takes special skills to determine when the transition to routine process management can provide some form of sustainable added value.
I am not sure if these skills are all determined and present in the body of knowledge of architects but here are some proposals for the ideal set of competences.
  • A special form of requirements management which you can only master if the added value as well as the pitfalls of AI in business processes are thoroughly understood,
  • As a consequence, the ability to produce use cases for the technology,
  • Master the various taxonomies to position AI in a correct way to make sure you obtain maximum value from the technology (more on this in a next post),
  • Have clear insights in the lifecycle management of the various analytical solutions in terms of data persistency, tuning of the algorithm and translation into appropriate action(s).


In the next post, I will elaborate a bit more on the various taxonomies to position AI in the organisation. 



donderdag 19 september 2019

Enterprise Architectures for Artificial Intelligence (I)


In the past three decades, I have seen artificial intelligence (AI) coming and going a couple of times. From studying MYCIN via speech technology in Flanders Language Valley to today’s machine learning and heuristics as used by Textgain from Antwerp University, the technology is here to stay this time.
Why? Because the cost of using AI has fallen dramatically not just in terms of hard and software but also in terms of acquiring the necessary knowledge to master the discipline.
Yet, most of the AI initiatives are still very much in the R&D phase or are used in limited scope. But here and there, e.g. in big (online) retail and telecommunications, AI is gaining traction on enterprise level.  And through APIs, open data and other initiatives, AI will become available for smaller organisations in the near future.
To make sure this effort has a maximum chance of success, CIOs need to embed this technology in an enterprise architecture covering all aspects: motivations, objectives, requirements and constraints, business processes, applications and data.
Being fully aware that I am trodding on uncharted territory, this article is –for now- my state of the art.

Introducing AI in the capability map

AI will enhance our capabilities in all areas of Treacy & Wiersema’s model, probably in a certain order. First comes operational excellence as processes and procedures are easier to describe, measure and monitor. Customer intimacy is the next frontier as the existing discipline of customer analytics lays the foundation for smarter interactions with customers and prospects.
The toughest challenge is in the realm of product leadership. This is an area where creativity is key to success. There is an approximation of creativity using what I call “property exploration” where a dimensional model of all possible properties of a product, a service, a marketing or production plan are mapped and an automatic cartesian product of all levels or degrees of each property with all the other properties is evaluated for cost and effectiveness. Sales pitch: if you want more information about this approach, contact us.

Capabilities and AI
Capabilities where state of the art AI can play a significant role
Examples of capabilities where AI can play a defining role. Some of these capabilities are already well supported, to name a few: inventory management (automatic replenishment and dynamic storage), cycle time management (optimising man-machine interactions), quality management (visual inspection systems), churn management (churn prediction and avoidance in CRM systems), yield management (price, customer loyalty, revenue and capacity optimisation) and talent management (mining competences from CVs).

Areas where AI is coming of age: loyalty management and competitive intelligence, R & D management and product development.

In the next post I will discuss a generic architecture for AI in support of primary processes; Stay tuned and… share your insights on this topic!