Data Governance Framework: Pioneering Governance Shift Left
Explore proactive data governance with Governance Shift Left. Navigate challenges, embrace strategies, and transform data management paradigms.
Enhance Data Governance efficiency with the Governance Decision Record framework. Improve computational governance for structured data practices.
In this article I will present the open-source Governance Decision Record (GDR) framework and how it could boost the efficiency and efficacy of Data Governance frameworks.
We care a lot about the word practice in Agile Lab, as a foundational piece of our Elite Data Engineering manifesto.
There’s a clear reference and a perfect fit of the Governance Decision Record into the Data Mesh scenario in the form of a Federated Computational practice for Data Governance goals. Still, the idea of structuring an operating model for dealing with the decisions to be taken as part of the enterprise Data Governance accountabilities can very easily be adopted even in many data management contexts and architectures, like the Data Lakehouse.
Let’s take a look!
First, let’s understand very briefly what the modern view of federated computational data governance is. To do so, let’s go over the status quo first.
Yeah, late.
In one word, that’s what Data Governance Frameworks have been most of the time in the majority of the enterprises I was able to put my eyes on (I’m Staff Data Architect at a Data Engineering solutions firm). Centralized Data Governance teams usually:
A long list, which always made the centralized practice inefficient and not scalable.
Another word that represents typical Data Governance initiative is manual.
Humans don’t scale out. We sleep, and we make mistakes. Data don’t sleep, so mistakes mean risks and losses.
High-quality analytical data must be provided with clear ownership and timeliness. They must be interoperable across domains, from both a semantic and technical point of view. They need to be safely but securely accessed and need to be discovered, understood, and especially used with trust and reliability.
Ok, I’ve brutally summarized several concepts of the Data Mesh paradigm (excuse me Zhamak- but I wanted to get straight to the chase).
To get all the features above, we need to answer the following questions:
Furthermore, one last question:
How are effective data governance decisions cataloged, maintained, and shared with the whole organization?
The answer to this last question is the whole purpose of the Governance Decision Record (GDR). It also provides a framework/tool to answer the others above.
Every decision leads to a series of policies. They need clear ownership but can be discussed with different organizational models. Eventually, decisions can change, evolve, and improve. Such decisions also should lead to automation. These are architectural decisions.
Automation is what brings us from late to on time. From to be verified to compliant by design. From arbitrary to structural.
The Governance Decision Record is basically an evolution of the Architectural Decision Record. It’s a broadly adopted framework to consolidate architectural decisions, with 2 major improvements dedicated to computational governance:
A Governance Decision Record contains:
Also, policies can have different scopes:
A Markdown Governance Decision Record template is provided, as a version-controlled document. I believe a Git repo is way more searchable and maintainable (in terms of releases, evolutions, concurrent contribution, and enabler for automation).
Let’s dive into each section.
This can be as simple as a label tracking down the lifecycle state of a policy. Common values are:
In the Governance Decision Record template file, some pre-compiled web-rendered labels are provided.
Let’s go over the composing parts.
This can be as simple as a label tracking down the history state of a policy. Common statuses are:
NOTE: in the case of amend* and supersede* the related policy should be linked.
In the GDR template file, some pre-compiled web-rendered labels are provided.
This section describes what is the context to which the policy applies (and why).
The decision the policy aims to apply. Below, an example will clarify the scope of a possible governance decision.
Declare what changes to the metadata (or anything else) would be considered breaking and what not breaking. This is important to implement automation at the platform level and create a robust change management process based on trust.
What we accept to happen while the policy is applied, including pros (improvements) and cons (impacts, rework, new accountabilities, or requirements).
Since there's no "universally optimal decision", the policy should also report the trade-offs the organization is going to accept with this policy, which could mean in some scenarios making the accumulated tech debt explicit.
A note on tech debt: it’s usually hidden and hard to track. When making it explicit, it is easier to measure/keep track of the overall tech debt, system quality in terms of architecture and behavior, etc.
The person responsible for taking care of the implementation. We talk about implementation since the policy is supposed to become as "computational" as possible, thus leading to automating the data management practice, probably with the help of a backing platform.
It can also be the role of accountability to follow the application of such policy.
These are the specific points in the architecture, the platform, the system, the context, etc., where this policy and its checks, if any, are implemented to become automation (thus becoming "computational").
This is split into LOCAL and GLOBAL policy: while the former assesses the context of a policy locally implemented/applied/verified, the latter is for policies globally applied.
The LOCAL application is supposed to be applied/verified at runtime (e.g. in the execution environment of a data asset, being that a specific analytical workspace, distributed processing or storage system, or a running job and its output data quality metrics, etc…), while the GLOBAL one addresses checks at deployment time (an example of application can target deployments of Data Products modeled as Data Product Specification).
If using a descriptive modeling language, a metadata validation policy-as-code file can be provided (it will probably be integrated into the platform, e.g., using CUE lang for YAML).
In the example provided in the repo, we assume that a policy becomes computational thanks to an enabling platform, driven by machine-readable metadata to model the policy content.
In the repo, examples leveraging the Data Product Specification are presented but, again, the framework is agnostic to whether we are adopting the Data Mesh paradigm or not. We are decentralizing ownership towards domains of the central Data Team that is managing the whole data platform. We are creating Data Products or “just” data assets.
A pretty exhaustive example policy and related metadata + policy-as-code validation files are provided in the example folder of the repo. In this example, the specific GDR is provided to describe how an Output Port of type “FILES” should be defined, provisioned, configured, described, and validated. The folder contains 3 files:
The Governance Decision Record versioning assumes this is the first policy created to address this governance topic.
The policy metadata can be validated with the policy-as-code file using the CUE CLI (if installed):
cue vet example/data-mesh/data-product/output-port/files/0001-data-product-output-port-files-example.yaml example/data-mesh/data-product/output-port/files/0001-data-product-output-port-files.cue
Many more details can be found in the official GitHub repo.
This article won’t dive into operating models for federated decision-making, however, the Governance Decision Record with its status labels is ready to be adopted in any Agile-like workflow. Please let me know if you are interested in a follow-up article on this topic.
How do you persist data governance decisions (both as policies and as code)? What specification does such a policy follow?
The Governance Decision Record is a handy open-source option for that, but alone it’s not enough:
Next step 1 - An organizational and operating model needs to be structured to take federated decisions.
Next Step 2 - A platform is where policies become computational. Here, we also might have a solution.
Automating data governance requires your governance policies to become computational. Discover the lifecycle of a computational policy in our free-to-download white paper.
Explore proactive data governance with Governance Shift Left. Navigate challenges, embrace strategies, and transform data management paradigms.
Streamline BI with Witboost and GoodData by automating lifecycle management, ensuring robust governance, and enhancing self-service capabilities.
Discover the potential of data contracts in Witboost and how they enhance data quality, governance, and interoperability.