Streambased Indexing

Tom Scott

June 2, 2025

blog

“We say Streambased is an analytics company but really it’s an indexing company” has been a sentence that’s come out of my mouth about a thousand times over the last few weeks. Indexing is the core technology that enables Streambased to create fast logical projections over Kafka data.

After all, what use is an Iceberg projection over Kafka such as I.S.K. if you can’t use the partitioning capabilities of Iceberg to prune away unneeded reads? What use is a fully featured SQL engine in A.S.K. if it can’t run queries at interactive speed? Indexing is the technology that closes this gap.

Before we begin, let’s talk about accessing data in Kafka. Streambased uses the consumer API to fetch records from Kafka according to an offset (a monotonically increasing id assigned to a message) and a partition (a logical storage unit). The consumer API allows you to specify these coordinates and be sure to receive the desired record in the results.

Unfortunately the API does not guarantee only to return the desired record and instead will return a batch of records of unpredictable size. This matters a lot to indexing as it means that any index that refers to a single record is likely to be very inefficient and instead we should assume batching in our index design. For instance, a traditional forward index can map a search parameter (e.g. customer name: Tom) to a list of Kafka offsets that match. With batching this becomes inefficient however as the list can contain multiple offsets from the same batch that will be read multiple times.

This is just one example of the problems we face when bending KAfka to our analytical whims. However, in this post I’m just going to brush the surface of our indexing tech to explain 2 ways in which we can index and accelerate simple but common analytical queries.

Filtering

The most common analytical query is one that selects a subset of data from a larger set. A set of criteria are supplied to the query (e.g. via the SQL WHERE clause) and records that do not match this criteria are not included in the results.

Most Kafka approaches to this problem follow the “read and drop” pattern where all records are read and the ones that don’t match the criteria are dropped from the results. This is extremely inefficient, can we do better?

There is an indexing structure ideally suited to this problem: the bloom filter. Bloom filters test whether a particular value is present in a set. They provide a fast and space efficient way to answer the following questions:

Does the set definitely not contain the value probed (answered with 100% certainty)
Could the set contain the value (allowing for some false positives)

Imagine a bloom filter over a selection of Kafka offsets (say offsets 0-1000 for partition 0). We could test this to see if it contains our search values and if the test dictates it does not, we would not read it.

Applying this practice to a simple predicate query yields massively reduced read requirements. In the extreme case where only one Kafka message matches our criteria the bloom filters reduce the read requirement to only the messages covered by the single filter that matches.

FILTERING

Pre-aggregation

The next most common query submitted over a data set is an aggregate query. These typically answer questions similar to “what is the count of my customers from each country I sell to?”. They contain a set of grouping criteria (country) and an aggregate function (count) and produce a result set with a row for each group/aggregate pair.

In a lot of data platforms, hints are included along with the data in order to speed up queries of this type but, as with filtering, Kafka has none of these. Luckily, we can apply a similar principle as with the filtering above to close the gap.

Streambased allows you to specify a set of grouping fields and a set of aggregate fields that will be used in common queries of this type. Streambased indexing will then pre-compute sum, count, min and max aggregates for these combinations over blocks of offsets as with filtering. The result is a pre-computed count for offsets 0-1000,1001-2000 and so on and so on.

When a query of this type is run, Streambased can use the precomputed values rather than computing the values by reading records, removing both the expensive read and compute stages of the query.

PREAGG

The unindexed head

Streambased prides itself on using all of the very latest data in its queries, however you can imagine that the techniques above can lag behind the latest data written to Kafka. Imagine a situation where 10010 messages had been written to Kafka but only 10000 have been indexed, it’s no good skipping those last 10 offsets in our query.

Streambased addresses this by reading the unindexed head (the last 10 messages) and combining them with the indexed data pre-computed above. By combining in this way you can be certain that your query includes everything available to it. One nice side effect of this is that, should something go wrong and index information not be available, Streambased query performance will gracefully degrade according to how much index information is available but the result will always be correct. A query with only half its data indexed will be faster than a query with no data indexed and slower than a query with all data indexed. They will all, however, produce the same result.

Conclusions

Streambased, while positioned as an analytics company, is fundamentally built on indexing technology that enables fast, efficient queries over Kafka data. By designing batch-aware indexes, it overcomes the inefficiencies of Kafka’s consumer API, allowing for advanced features like Iceberg projections and interactive SQL. Streambased uses techniques like Bloom filters for filtering and pre-aggregations to reduce the need for costly reads and computations, significantly speeding up common analytical queries.. If you're building analytics on Kafka and want truly interactive performance, it's time to explore what Streambased indexing can do for you.