Data Governance

4 Data Contract Implementation Patterns for Complex Data Environments

Explore four data contract implementation patterns for complex environments, emphasizing proactive measures to prevent data issues and maximize ownership.

Subscribe

Data contracts deserve to be explored in greater depth because, beyond their textbook definition, few enterprises truly examine how to implement them in complex and diverse environments. More importantly, the lack of resources for this exploration can't ensure that data contracts function effectively and deliver the expected benefits.

In this article, we'll discuss various data contract implementation patterns, starting with the most basic and progressively increasing their complexity and sophistication.

We'll skip the data contract definition for this one (read this article for more on that), but let's highlight an often-overlooked socio-technical aspect:

A data contract serves as a mechanism to hold data product creators accountable for how their data is presented and behaves for consumers. Ultimately, it is about establishing true data ownership and formally regulating the discontinuities of said ownership along the data transformation chain.

Let's explore 4 implementation options for both batch and streaming data contracts! We'll take a look at each of them with a short description, cons, pros, and the team's perspective on them.

The first three rely on a platform where the metadata of a data contract can be registered, and data quality checks can be performed via connectors.

Several data catalogs meet these requirements, so for simplicity, we'll refer to “the data catalog” when discussing the first three options. If your data catalog is different, that's okay.

The fourth option, however, introduces an alternative pattern that solves various challenges and is our recommended option. Let's dive in!

Data Contract Implementation Patterns Infographic - 1920x1080

 

Option 1: Passive Monitoring

Batch Data Contracts Version

This pattern defines data contracts within the data catalog and enables periodic metadata scans for compliance checks. However, this approach is entirely passive.

Option 1-Passive Monitoring-batch

Cons:

  • Late detection: Issues are detected only after a periodic scan which isn't synchronized with the data refresh rate. This means that consumers may have already used bad data before the problem was identified.

  • System overload: Increasing the scan frequency to improve breach detection can put excessive strain on infrastructure, leading to higher cloud costs and inefficient resource utilization.

  • Change management complexity: Managing the data contract requires updates in two separate places — Git and CI/CD, and within the Data Catalog — at different times, increasing the risk of misalignment.

  • Privilege requirements: The Data Catalog must have the necessary permissions to scan all data contracts across the company.

  • No prevention mechanism: There is no way to prevent bad data from propagating to downstream consumers before detection.

Our view

We'll skip the pros for this one because calling it a “data contract” is misleading. This is just traditional metadata and data quality management, offering no real added value.

 

Streaming Data Contracts Version

This approach is similar to the one above but applied in a streaming context.

Streaming is commonly used to regulate data exchange between operational and analytical systems. Mainframe offloading through change data capture is a typical use case. However, this pattern remains entirely passive and cannot stop the data flow in case of issues.

Option 1-Passive Monitoring-streaming

 

Cons:

  • All the cons from the batch option still apply.

Our view

Positioning data contracts between operational and analytical landscapes is strategic and effective for improving data quality and ownership early in the value chain.

However, this pattern only enables monitoring and does not prevent bad data from propagating across functional areas.

Option 2 : Synchronized Inspection

Batch Data Contracts Version

This is an improvement over Option 1, as it eliminates global scheduling for data contract inspections. Instead, it integrates with the local scheduling of the data product, triggering an individual scan immediately after the data refresh is completed.

Option 2-Synchronized Inspection-batch

Pros

✔ Better synchronization between data refresh and data contract assessment, reducing late detection and computational inefficiencies.

Cons

  • Potential bottleneck: The Data Catalog could struggle to handle hundreds of simultaneous API calls for contract evaluations.

  • Persisting change management complexity: Updates are still required in two separate places, leading to a misalignment risk.

  • Privilege concerns: The data catalog must have permission to scan all data contracts across the company.

  • Bad data propagation: This pattern still doesn't solve the issue.

Our view

A step in the right direction, but still lacking proactive measures to prevent data issues and avoid the propagation of bad data.

Option 3 : Enforced Compliance

Batch Data Contracts Version

This pattern introduces the Write-Audit-Publish (WAP) mechanism, which blocks the delivery of non-compliant data, preventing consumers from ingesting bad data.

This is the key distinction between data quality/observability and the data contract pattern: in the latter, the data product creator is held accountable for delivering what was promised.

Option 3-Enforced Compliance-batch

Pros:

✔ Saves time, costs, and operational effort by enabling circuit breaking, which prevents bad data from spreading.

Cons:

  • Increased complexity: At the orchestration level for each data product.

  • System dependencies: Frequent API calls to the data catalog make the system reliant on its responsiveness.

  • Mission-critical risk: If the data catalog goes down, it can impact all the business processes. This is a risk it was never originally designed to handle.

  • Change management complexity persists: Updates are still required in two separate places, leading to a risk of misalignment.

  • Privilege concerns: The data catalog must have permission to scan all data contracts across the company.

Our view

While this is the right approach for implementing a data contract in batch mode, relying on an external platform for all business processes is a poor architectural choice. At scale, it could become a nightmare.

 

Option 3: Enforced Compliance

Streaming Data Contracts Version

Applying WAP principles in streaming is far more complex. In streaming, data is checked continuously (record by record), making it impractical to invoke the data catalog for every event.

Additionally, the stream processor must also dispatch data based on validation results, meaning it cannot simply act as a passive job triggered by the data catalog.

Option 3-Enforced Compliance-streaming

Our view

We don't see a viable way to implement this pattern effectively using an external platform, which is a major limitation.

Option 4: Embedded Enforcement

Batch Data Contracts Version

This pattern achieves the same results as Option 3 but with some significant improvements:

Pros:

✅ Enforcement is fully decentralized and self-contained within the data product lifecycle. There are no external dependencies and no single point of failure. Each data contract is auto-consistent and independent in terms of performance and reliability.

✅ Change management is streamlined into a single source of truth, which is DevOps-friendly, and eliminates inconsistencies. The CI/CD creates the proper data contract definition in the data marketplace* and creates the physical infrastructure.

✅ Dependencies are removed, thanks to the data marketplace becoming read-only and a metadata discovery layer.

✅ Improved security as the monitoring process (the data contract guardian) runs with a specific service principle that can access only the data of its own data contract.

✅ Simplified scheduling compared to Option 3, removing multiple steps and polling.

✅ No limitations: this pattern can be applied to structured and unstructured data, API and RAG interfaces. There is no dependency on data catalog connectors.

Option 4-Embedded Enforcement-batch

*In this option, we replaced “data catalog” with “data marketplace” to highlight that it is becoming a passive component for discovering and consuming information and is no longer an authoring tool. That being said, the marketplace can be a data catalog tool too by disabling edit permissions on data contracts and using DQ reporting capabilities without triggering DQ jobs.

 

Option 4: Embedded Enforcement

Streaming Data Contracts Version

This pattern achieves the same results as Option 4 but in streaming, which wasn't possible in Option 3's streaming version.

At the same time, the data contract can inspect and also influence the data flow.

Option 4-Embedded Enforcement-streaming

 

Pros:

  • Same as the batch option with the major addition of it being implemented in stream processing

 

Final Thoughts

When a pattern solves a problem sub-optimally, it requires careful consideration. It may meet immediate needs, but what happens at scale?

We always prioritize long-term solutions for our customers and partners, so let's take a look at two real-world scenarios that highlight critical challenges some organizations will face.

 

Scenario 1: Banking — Liquidity & Risk Management

Imagine a batch-oriented, analytical process integrating hundreds of data sources across multiple business domains. This process must be completed within a cut-off time and meet regulatory constraints. 

If data quality or timeliness fails, the bank faces penalties.

The WAP pattern is essential. Errors must be immediately blocked before spreading, otherwise, troubleshooting will be a nightmare. With so many actors and processing steps involved, moving quickly toward identification and resolution is challenging.

Data contracts fit like a glove here but they must be implemented effectively.



Scenario 2: Energy — Real-Time Infrastructure Maintenance

IoT devices continuously stream data on physical network conditions. Reactive maintenance is time-sensitive, so when an alarm triggers, a repair team must be dispatched at once.

Bad data can lead to costly errors. Imagine sending a crew to investigate a false alarm. Time, money, resources, are all wasted. In Operational Technology (OT), data errors are common (e.g., due to physical system outages). 

If ba data can't be blocked, incorrect decisions are inevitable. If stopping the algorithm is impossible, monitoring the data contract is pointless. The team will still act on bad data.

 

Implementing data contracts effectively is more than just monitoring. True accountability means enforcing proactive measures to prevent bad data from spreading.

While passive data quality and observability can be useful, they don't achieve the same level of control and reliability as a well-implemented data contract enforcement mechanism.


 

Data contracts are like a superpower! Find out more about how Witboost implements them by clicking the button below.

 

Similar posts