Event sourcing is an increasingly popular pattern that revolves around storing data as a series of events rather than as a final outcome. This is ideally suited to Kafka’s append only log approach but, as we will discover in this article, Kafka alone is missing some key features to create a useful and efficient event sourcing architecture.
Event sourcing - a quick primer
Event sourcing has been well documented many times. I particularly recommend Confluent’s videos on it here: https://www.youtube.com/watch?v=wPwD9CQAGsk and here: https://www.youtube.com/watch?v=EvIg6buGo9k. The principle is that a given use case can be broken down into a series of events that combine to arrive at the state we use in our applications. Take a banking application for example, the current balance of any account is the sum of all deposits and withdrawals that have been made to the account. Whilst you may traditionally store the current balance in the account in a database, an event sourcing approach would store separate records for all deposits and withdrawals and, when the balance is requested, sum them all to provide it.
There are lots of advantages to approaching problems of this type in this way. One very exciting one is that you gain the ability to “time travel” back to the assembled state after any previous event was written. In our banking app I’m no longer only able to see my balance today but I can see what my balance was last week or at any point back to when the account was opened.
Kafka is an ideal candidate for storage of data in this fashion. Any new events can be produced to a Kafka topic and stored resiliently and efficiently forever. Better yet, Kafka has a whole host of features designed to ensure you can seek to historical points in the log and be guaranteed to read all events forward. Even in the event of application failures, power outages or any of the other disasters that can be encountered in typical applications.
The problem with event sourcing with Kafka alone
If the above sounds too good to be true it’s because it is. Downstream applications very rarely ask questions of event sourced datasets that are relevant to every event in the dataset. Instead, a set of filtering criteria are usually applied. Kafka is excellent at storing and retrieving sequences of events but is terrible at filtering event data for a particular attribute or set of attributes. One potential approach to this issue is to simply have different datasets for every key set we could filter on (if the application makes requests based on customer name then store a different dataset for every customer name). The unit of logical separation in Kafka is a topic but it is not usually practical to have a separate topic for every key due to the management overhead of high cardinality keysets. This means that most topics represent the entire keyset (“customers” contains all customer names).
Because of Kafka’s limitations in filtering, most queries with filtering criteria in Kafka involve processing the entire topic from start to finish. As an example consider the “transactions” topic that stores all deposits and withdrawals across all accounts for a bank:
To find the balance for accountId: 1001001 we must sum 3 messages starting from the beginning of the topic but for accountId: 1001003 we only need to sum the last message on the topic. This should mean that the processing requirement for accountId: 1001003 is less than 1001001 but unfortunately with Kafka this is not the case. The only way for a Kafka based application to be aware of which accountId is associated with which message is for it to process the message. This means that, regardless of the distribution of the accountId within the topic, an application that calculates the balance will need to start at the beginning of the topic and process all messages to the end to be sure it has seen every entry for a given accountId. This becomes terribly slow as the volume of events increases.
For this reason most event sourcing applications today have an intermediate step, a data store that is optimised for filtering that is written ahead of the Kafka cluster. Whilst this solves the immediate problem it introduces some new ones.
First and foremost we must now consider which of these stores is the source of truth? What if my account balance shows differently between the filtering system and Kafka? Should we trust the Kafka result or the intermediate store? This is actually a very common occurrence as, in these architectures, the Kafka cluster is usually eventually consistent so a difference between them can be expected.
There are other concerns too, both systems store the same dataset and so costs and maintenance overhead are doubled. Furthermore, the ELT pipeline that connects them must be maintained and evolved as the hosted datasets change.
Enter Streambased
Streambased technology brings database technology normally associated with the analytical world into the operational world of Kafka. By maintaining indexes, pre-aggregation, statistics and many other data analytics staples Streambased can meet the fast filtering requirements associated with the event sourcing architectures outlined above.
Streambased stores metadata that enriches the source Kafka data. Unlike other approaches, with Streambased the source data is left untouched and is never moved or modified, preserving the Kafka performance and experience that has become so prevalent. Metadata is associated with a block of offsets within a topic and allows us to determine the nature of the block without needing to process every message in it. The simplest (and most relevant to event sourcing) question we can ask the metadata is “Does this block contain value X?”. Let’s have a look at our example topic with this new metadata available:
Now the processing application can query the metadata alone to know that, for accountId: 1001003, it need not process the first 4 events as only the last one is relevant. This massively reduces the processing requirement for queries of this type and make Kafka alone a viable option for this type use case.
This is only one of the many techniques Streambased uses to accelerate access to Kafka data but all of them follow the same principle: Enrich the source data with metadata to minimise the amount of events that must be processed. Metadata is typically much smaller than the Kafka data itself (usually around 100x smaller) and its addition does not require any transformation of the source data. This means it does not suffer from the consistency and cost issues that come with the intermediate store solution from the previous section.
To Sum Up (pun intended)
Event sourcing as an architecture opens up a number of exciting opportunities for insight, recovery and auditing by adding a historical element to every dataset that resides in your business. Kafka contains many of the traits that make a good data storage system for event sourcing but is missing some key features to be effective for fast data retrieval and filtering.
Streambased adds these features to base Kafka to close the gap between analytical focused and operationally focused systems. The combination of Kafka’s append only log storage and resilience and Streambased’ acceleration technology creates a complete toolset suitable for all analytical and operational cases that arise from an event sourced dataset.