Introduction
‍
Stream processing is hard! There I said it. To effectively utilize the data flowing in and out of your Apache Kafka cluster and between the microservices it supplies requires a deep understanding of some of the most complex problems in computing. You need to think about out-of-order messages, duplicate consumption, late arriving data, and transactional processing. Whilst many attempts have been made to lower the bar to this powerful technology, it still remains the realm only of experienced distributed systems and data engineers, and newcomers to the field can expect to spend a great deal of time reading before taking those tentative first steps.
‍
A Different Approach
‍
But does it need to be so? In this article, I’d like to take a different approach. The tools and techniques described above are hard to master, but is there anything special about the underlying data it works with? At the end of the day, Kafka data, like most modern data systems, boils down to a bunch of files stored on disk in partitioned directories on servers. If you look at the underlying structure behind a Kafka topic and the structure of a Hive table, you see a lot of similarities. In fact, an early version of Streambased treated the files on disk exactly like this, orchestrating an Apache Spark cluster to read data directly from segment files and present it in a table-like structure.
‍
Simplifying Kafka Data
‍
Let’s say we decide to view Kafka data in this way. Can we remove a lot of the complexities involved in stream processing and begin to look at Kafka data as more of a simple data store? Does this make life easier, is it safe, and what are the consequences?
‍
Changing the Terminology
‍
Firstly, we can no longer use the streaming nomenclature. Terms like “topic, message, consumer, and producer” are very streaming specific and no longer relevant to our new “Kafka as a simple store” world. Let’s translate them:
- Topic: This is the easiest to translate. As a logical grouping of similar data, topics and tables are pretty much interchangeable.
- Message: Another easy one. A message in Kafka can easily be viewed as a row within a big data table.
- Message field: If we’re viewing a message as a row, then any attributes of that message are surely columns. A collection of them makes up a row, and we can filter, aggregate, and perform a number of operations on them to achieve our goals.
- Schema Registry: One of the things that has made Kafka so flexible is that it considers messages to be just collections of bytes. This means an external system must impose structure on these messages so that we can determine field names and type. In the Kafka world, this is typically handled by a component named Schema Registry. An equivalent in the big data world would be Hive’s Metastore
‍
Using Kafka as a Data Store
‍
With the above translation complete, we can now use our new data source in all of the ways more traditional RDBMS systems have been used for decades. The lingua franca here is SQL, and “SELECT someField FROM someTopic JOIN someOtherTopic” works just as well as “SELECT someColumn FROM someTable JOIN someOtherTable”.
For many use cases, this is plenty enough to get the job done, and it makes the data really easy to reason about. Particularly in the case of exploratory usages, viewing Kafka as just a simple data store means analysts/data scientists, etc., can interact with it in the languages and tools they are used to. Why not bring your PowerBI or Superset or Jupyter notebook and work directly with Kafka data (seriously, why not?). This approach is especially powerful when you consider that Kafka is usually the point at which data is ingested into your business. In a fully event-driven organization, it holds the most raw data with the highest variety available, exactly the kind of data you want to put in front of your data scientist!
‍
Performance Considerations
‍
There is a catch, unfortunately. Kafka is not optimized for data access in this way. A typical analytical query may focus on only a small subset of data, restricting the dataset to a particular column value or grouping similar data together, etc. Queries of this type in Kafka usually end up reading from the very earliest data available all the way to the latest, a very slow process. The reason for this is that many of the technologies that make database queries fast in the analytical world are missing from Kafka. Technologies like indexing, pre-aggregation, and topic statistics can greatly improve query performance and are not available out of the box.
‍
Streambased Solution
‍
At Streambased, we adopted these technologies and learned how to apply them to Kafka. We did it without moving or transforming data and using the public Kafka APIs (meaning Streambased is compatible with all flavors of Kafka and Kafka-compatible systems like RedPanda and WarpStream). The result is a high-performance view on your Kafka data as if it was just another database.
With these enhancements the stream table duality is complete, but just because you can doesn’t necessarily mean you should. I’ll end this post with some pointers around the attributes of applications that lend themselves to streams and those that lend themselves to tables. Find these in your cases and the correct approach will jump out at you.
‍