Boost Data Governance with Governance Decision Record
Enhance Data Governance efficiency with the Governance Decision Record framework. Improve computational governance for structured data practices.
Explore four data contract implementation patterns for complex environments, emphasizing proactive measures to prevent data issues and maximize ownership.
Data contracts deserve to be explored in greater depth because, beyond their textbook definition, few enterprises truly examine how to implement them in complex and diverse environments. More importantly, the lack of resources for this exploration can't ensure that data contracts function effectively and deliver the expected benefits.
In this article, we'll discuss various data contract implementation patterns, starting with the most basic and progressively increasing their complexity and sophistication.
We'll skip the data contract definition for this one (read this article for more on that), but let's highlight an often-overlooked socio-technical aspect:
A data contract serves as a mechanism to hold data product creators accountable for how their data is presented and behaves for consumers. Ultimately, it is about establishing true data ownership and formally regulating the discontinuities of said ownership along the data transformation chain.
Let's explore 4 implementation options for both batch and streaming data contracts! We'll take a look at each of them with a short description, cons, pros, and the team's perspective on them.
The first three rely on a platform where the metadata of a data contract can be registered, and data quality checks can be performed via connectors.
Several data catalogs meet these requirements, so for simplicity, we'll refer to “the data catalog” when discussing the first three options. If your data catalog is different, that's okay.
The fourth option, however, introduces an alternative pattern that solves various challenges and is our recommended option. Let's dive in!
This pattern defines data contracts within the data catalog and enables periodic metadata scans for compliance checks. However, this approach is entirely passive.
We'll skip the pros for this one because calling it a “data contract” is misleading. This is just traditional metadata and data quality management, offering no real added value.
This approach is similar to the one above but applied in a streaming context.
Streaming is commonly used to regulate data exchange between operational and analytical systems. Mainframe offloading through change data capture is a typical use case. However, this pattern remains entirely passive and cannot stop the data flow in case of issues.
Positioning data contracts between operational and analytical landscapes is strategic and effective for improving data quality and ownership early in the value chain.
However, this pattern only enables monitoring and does not prevent bad data from propagating across functional areas.
This is an improvement over Option 1, as it eliminates global scheduling for data contract inspections. Instead, it integrates with the local scheduling of the data product, triggering an individual scan immediately after the data refresh is completed.
✔ Better synchronization between data refresh and data contract assessment, reducing late detection and computational inefficiencies.
A step in the right direction, but still lacking proactive measures to prevent data issues and avoid the propagation of bad data.
This pattern introduces the Write-Audit-Publish (WAP) mechanism, which blocks the delivery of non-compliant data, preventing consumers from ingesting bad data.
This is the key distinction between data quality/observability and the data contract pattern: in the latter, the data product creator is held accountable for delivering what was promised.
✔ Saves time, costs, and operational effort by enabling circuit breaking, which prevents bad data from spreading.
While this is the right approach for implementing a data contract in batch mode, relying on an external platform for all business processes is a poor architectural choice. At scale, it could become a nightmare.
Applying WAP principles in streaming is far more complex. In streaming, data is checked continuously (record by record), making it impractical to invoke the data catalog for every event.
Additionally, the stream processor must also dispatch data based on validation results, meaning it cannot simply act as a passive job triggered by the data catalog.
We don't see a viable way to implement this pattern effectively using an external platform, which is a major limitation.
This pattern achieves the same results as Option 3 but with some significant improvements:
✅ Enforcement is fully decentralized and self-contained within the data product lifecycle. There are no external dependencies and no single point of failure. Each data contract is auto-consistent and independent in terms of performance and reliability.
✅ Change management is streamlined into a single source of truth, which is DevOps-friendly, and eliminates inconsistencies. The CI/CD creates the proper data contract definition in the data marketplace* and creates the physical infrastructure.
✅ Dependencies are removed, thanks to the data marketplace becoming read-only and a metadata discovery layer.
✅ Improved security as the monitoring process (the data contract guardian) runs with a specific service principle that can access only the data of its own data contract.
✅ Simplified scheduling compared to Option 3, removing multiple steps and polling.
✅ No limitations: this pattern can be applied to structured and unstructured data, API and RAG interfaces. There is no dependency on data catalog connectors.
*In this option, we replaced “data catalog” with “data marketplace” to highlight that it is becoming a passive component for discovering and consuming information and is no longer an authoring tool. That being said, the marketplace can be a data catalog tool too by disabling edit permissions on data contracts and using DQ reporting capabilities without triggering DQ jobs.
This pattern achieves the same results as Option 4 but in streaming, which wasn't possible in Option 3's streaming version.
At the same time, the data contract can inspect and also influence the data flow.
When a pattern solves a problem sub-optimally, it requires careful consideration. It may meet immediate needs, but what happens at scale?
We always prioritize long-term solutions for our customers and partners, so let's take a look at two real-world scenarios that highlight critical challenges some organizations will face.
Imagine a batch-oriented, analytical process integrating hundreds of data sources across multiple business domains. This process must be completed within a cut-off time and meet regulatory constraints.
If data quality or timeliness fails, the bank faces penalties.
The WAP pattern is essential. Errors must be immediately blocked before spreading, otherwise, troubleshooting will be a nightmare. With so many actors and processing steps involved, moving quickly toward identification and resolution is challenging.
Data contracts fit like a glove here but they must be implemented effectively.
IoT devices continuously stream data on physical network conditions. Reactive maintenance is time-sensitive, so when an alarm triggers, a repair team must be dispatched at once.
Bad data can lead to costly errors. Imagine sending a crew to investigate a false alarm. Time, money, resources, are all wasted. In Operational Technology (OT), data errors are common (e.g., due to physical system outages).
If ba data can't be blocked, incorrect decisions are inevitable. If stopping the algorithm is impossible, monitoring the data contract is pointless. The team will still act on bad data.
Implementing data contracts effectively is more than just monitoring. True accountability means enforcing proactive measures to prevent bad data from spreading.
While passive data quality and observability can be useful, they don't achieve the same level of control and reliability as a well-implemented data contract enforcement mechanism.
Data contracts are like a superpower! Find out more about how Witboost implements them by clicking the button below.
Enhance Data Governance efficiency with the Governance Decision Record framework. Improve computational governance for structured data practices.
Explore proactive data governance with Governance Shift Left. Navigate challenges, embrace strategies, and transform data management paradigms.
Learn how to build a robust data fabric using data products to enhance metadata quality, governance, and integration across your organization.