Knowledge Base - Witboost

Power up your metadata layer and automate your Data Mesh with this

Written by Witboost Team | 4/27/22 8:33 PM

Just like every informative content in the Data Mesh area should do, I’d like to start quoting the reference article on the topic, by Zhamak Dehghani:

[…] For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadata that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.

A “set of metadata” is hereby to be associated with our Data Products. I’ve read a lot of articles (I’m gonna report some references along the way in the article) about why it’s important and what theoretically can be achieved by leveraging them… they all make a lot of sense. But — as frequently happens when someone talks about Data Mesh related aspects — there’s often a lack of “SO WHAT”.

For this reason, I’d like to share, after a brief introduction, a practical approach we are proposing. It’s open-source, it’s absolutely not perfect for every use case, but it’s evolving along with our real-world experience of driving enterprise customers’ Data Mesh journeys. Most importantly, it’s currently used — in production, not just on articles — at some important customers of ours to enable the automation needed by their platform supporting their Data Mesh.

Why metadata are so important

I believe that automation is the only way to survive in a jungle of domains willing to create Data Products. Automation with respect to:

  1. easy and fast creation of Data Products leveraging pre-built reusable templates that guarantee out-of-the-box compliance to the Data-as-a-Product principle;
  2. implementation and execution of the Federated Governance policies (which therefore become “computational”);
  3. enablement of Self-Service Infrastructure, as a fundamental pillar to provide Decentralized Domains the autonomy and ownership they need.

The pillars together are all mandatory to reach the scale. They are complex to set up and spread, both as cultural and technical requirements, but they are foundational if we want to build a Data Mesh and harvest that so-widely promised value from your data.

But how can we get to this level of automation?
By leveraging a metadata model representing/associated with a Data Product.

Such a model must be general, technology agnostic, and imagined as the key enabler for the automation of a Data Mesh platform (to be intended as an ecosystem of components and tools taking actions based on specific sections of this metadata specification, which must be standardized to allow interoperability across the platform’s components and services that need to interact with each other).

State of the art

In order to provide a broader view on the topic, I think it’s important to report some references to understand what brought us to do what we did (OK, now I’m creating hype on purpose ).

  • Feed the Control Ports: Arne Roßmann in this article describes how control ports of Data Products should serve foundational APIs by leveraging metadata collected at different levels.
  • Active Metadata: Prukalpa Sankar in this article mentions a commercial product tackling the active metadata space as a way to autonomously collect precious pieces of information about several things from SQL plans to understand data lineage, to monitor Data Products update rate, and other very cool stuff.
  • Metadata types: Zeena in this post introduces their Data Catalog solution after reporting a very interesting differentiation of metadata into DESCRIPTIVE, STRUCTURAL, ADMINISTRATIVE.
  • Standards for Metadata in generalOpenMetadata is doing excellent work in creating standards and a framework for metadata management, including specific vocabulary and syntax (even if in some cases it becomes quite complex IMHO).
  • General Data Product definition: what gets closer to our vision is OpenDataProducts, a good work to factor out some important information around the general Data Product concept (e.g. Billing and SLA). However, generalizing too much is risky and, in this case, I believe they got a little too far from the way Data Mesh requires a Data Product to be described and built. For example, there’s no mention of Domains, Schema, Output Ports, or Observability.

According to these references, there’s still no clear metadata-based representation of a Data Product addressing — specifically — the Data Mesh and its automation requirements.

How we intend a data product specification

We believe in an Infrastructure-As-Code declarative, idempotent, and versioned approach. The goal is to have a standardized yet customizable specification addressing this holistic representation of the Data Mesh’s architectural quantum that is the Data Product.

Principles involved:

  • Data Product components (input ports, output ports, workloads, staging areas)
  • self-documenting Data Product (from schema to dependencies, to description and access patterns)
  • clear ownership (from Domain to Data Product team)
  • semantic linking with other Data Products (like polysemes to help improve interoperability and references to build a semantic layer or Knowledge Graph on top of the Data Mesh)
  • data observability
  • SLA/SLO and Data Contracts (expectations, agreements, objectives)
  • Access and usage control

The Data Product Specification aims to gather together crucial pieces of information related to all these aspects, under the strict ownership of the Data Product Owner.

What we did so far

In order to fill up this empty space, we tried to create a Data Product Specification by starting this open-source initiative:

https://github.com/agile-lab-dev/Data-Product-Specification

The repo contains a detailed documentation field-by-field; however, I’d like to point out here some features I believe to be important:

  • One point of extension and — at the same time — standardization is the expectation that some of the fields (like tags) are expressed in the OpenMetadata specification.
  • Depending on the context, several of those fields can be considered optional: their valorization is supposed to be checked by specific federated governance policies.
  • Although there’s an aim for standardization, the model is expandable for future add-ons: your Data Mesh (and platform) will probably keep evolving for a long time and the Data Product Specification is supposed to do the same, along the way. You can start simple, then add pieces by pieces (and related governance policies).
  • being in YAML format it’s easy to parse and validate, e.g. via a JSON parsing library like circe.

The Data Product Specification itself covers the main components of a Data Product:

  • general bounded context and domain metadata the Data Product belongs to (to support descriptive API on control ports)
  • clear ownership references
  • input ports, workloads, internal storage/staging areas, output ports, and logical dependencies between them
  • data observability
  • data documentation, semantic linking info, schema, usage guidelines (from serialization format and access patterns to billing models)

NOTE: what is here presented as a “conceptual” quantum can be (and it is, in some of our real-world implementations) split into its main componing parts that are then git-version controlled under their own repositories (belonging to a common group, which is the Data Product one).

Key benefits

The Data Product Specification is intended to be heavily exploited by a central Data Mesh-enabling platform (as a set of components and tools) for:

  • Data Products creation boost, since templates can be created out of this specification (for the single components and for the whole Data Products). They can then be evolved, also hiding some complexity from end-users.
  • Federated Governance policies execution (e.g. a policy statement could be “the DP description must be at least 300 chars long, the owner must be a valid ActiveDirectory user/group, there must be at least 1 endpoint for Data Observability, the region for the S3 bucket must be US-East-1”, etc…), thus making “approved templates” to be compliant by-design with the Data Mesh principles and policies.
  • Change Management improvement, since specifications are version controlled and well structured, thus allowing to detect breaking changes between different releases.
  • Self-service Infrastructure enablement, since everything a DP needs to be provisioned and deployed for is included in a specific and well-known version of the DP specification, thus allowing a platform to break it down and trigger the automated provisioning and deployment of the necessary storage/compute/other resources.

As I said, this specification is evolving and can be surely improved. Being open-source, every contribution is welcome.

Wrapping up

How can a Data Mesh journey reach the scale? With automation to guarantee out-of-the-box compliance of Data Products to the Data Mesh principles.