Governance and the Streaming Datalake:
Do it well and do it once

Exploring the efficiency of applying governance principles once in streaming datalakes to simplify data management across organizations.

Tom Scott
Founder & CEO
18th January 2024

Table of Contents

Share This Post
Facebook
Twitter
LinkedIn

Introduction to Governance in Data-Driven Organizations

It’s no secret that effective data governance is one of the key attributes of successful data-driven (and data mesh) organizations. These same organizations provide a wealth of informative blogs and articles about how to do data governance well, this isn’t one of them. Instead of repeating the well-known and documented truths around governance (I personally rely on the federated governance principles in Zhamak Deghani and Adam Bellemere’s Data Mesh books) here I want to discuss where and, more importantly how many times, governance principles are applied.

The Traditional Challenge of Data Governance

To set our scene let’s take the most obvious of the common aspects of governance: security. It makes perfect sense that not everyone should have the same access to the same data. Most data systems have access control lists (ACLs), role-based access control (RBAC), or some other mechanism to control who can access what from where. Let’s start at the point where most data is generated, in a modern microservices architecture this is most likely an event source that is producing into an event streaming platform (thanks to its dominance there’s also an 80% chance that this is Apache Kafka).

Operational vs. Analytical Governance

Kafka has a rich and extensible security ecosystem that ensures that principals have access scope that is in line with the tasks that they must perform. Everything is great but this is not the full picture, what we’ve described so far covers only the operational cases, what about analytical uses?

Like the operational systems mentioned earlier there are a plethora of analytical tools that support access control but first we have to get the data there. Traditionally this is accomplished via ETL/ELT jobs and these must themselves have access to the data they are sourcing. 

The Complications of Traditional ETL/ELT

Usually, some kind of service principal is used for this. Management of these principals is tricky. They exist for a purely technical reason (to transfer data from A to B) so by definition are not business-aligned and are often cross-domain. Maintenance of the access of these principals usually falls to data engineering teams and, whilst they are usually completely read-only in nature, they often are scoped way beyond their requirements (svc_org_data_read_only account anyone?) to allow easy expansion of ETL/ELT activities.

That’s still not the end of the story, effective ETL/ELT lands a second copy of the source data into an analytical system and this must itself restrict access. A further set of mechanisms and principals must be maintained for this too.

The end result is 4 sets of governance rules across multiple systems over two copies for a single dataset.

The Streaming Datalake Solution

The Streaming Datalake approach is different. Streaming lakes consider the event streaming system to be the single source of data for both operational and analytical purposes. No copies of the data are required in order to service analytical cases and no ETL/ELT is needed. This means that governance need only be applied once, in the origin system for the data.

This approach greatly simplifies governance vs traditional ELT patterns

Bridging Operational and Analytical Needs with Streambased

Obviously, we don’t duplicate data for fun so how can the operational system satisfy the demands of analytical use cases? At Streambased we enrich the operational data set with metadata to drastically increase the performance of analytical queries (thinks indexing, statistics, and pre-aggregation). This metadata is tightly coupled to the data it is operating on and not useful on its own so requires no significant additional governance. To complete the picture, Streambased also provides access to operational data via industry-standard analytical tooling and protocols (SQL over JDBC) so that, to the analyst, the tool and patterns don’t change between working with a system in the analytical realm and a system in the operational realm.

Conclusion: Simplifying Governance with Streaming Datalakes

So far we’ve only talked about security but the same principle holds for all aspects of governance. Why handle data privacy in two separate systems? Why create separate retention policies for multiple systems? Why develop separate risk management processes for operational and analytical realms? In effective data management, simplicity is key and the Streaming Datalake offers a simple, single source of truth and single source of governance for the entire data estate, it doesn’t get much simpler than that!

Footnote on Governance Features in Event Streaming Systems

Historically, event streaming systems have lagged behind analytical systems in terms of governance features. For instance, data masking is a pretty common requirement for private data sets but is not available in the off-the-shelf Apache Kafka offering. Thankfully this is changing, projects such as Conduktor’s Gateway and Kroxylicious provide proxies that enrich the functionality of Kafka bringing it on-par with analytical systems for governance features.

Share This Post
Facebook
Twitter
LinkedIn