How Streambased works

A Unified Logical Layer For Data with no ETL

Streambased makes real-time and historical data behave like a single Iceberg table or a single Kafka topic, eliminating ETL, preserving performance, and unifying streaming + batch workloads on one logical layer.

Streambased Platform consists of 2 services:

Surfacing Kafka data as Iceberg
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.

Surfacing Iceberg data as Kafka
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.

I.S.K. - One Table Across All Time

Streambased I.S.K. presents a set of Iceberg tables composed of a section of real-time data from Kafka (the hotset) and a section of physical Iceberg data (the coldset).

Tables in I.S.K. combine these two sections in a way that is completely transparent to any clients interacting with it (it just looks like a regular Iceberg table).

The I.S.K. architecture consists of the following components:

A Storage Gateway
Iceberg is expecting files so I.S.K. must have a way to provide a file based interface to engines. I.S.K. presents an Amazon S3 compatible API to engines that can serve both metadata and data files with data sourced from Kafka.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.
An Iceberg Catalog
I.S.K. presents a simple, read only, catalog for Kafka data, this is the entrypoint for Iceberg engines.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.
A Cache
To reduce impact on the Kafka cluster and improve Iceberg performance, I.S.K. caches files served by the storage gateway. These files represent sections of immutable Kafka log and so can be cached and invalidated at will.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.
An indexing engine
Most Iceberg queries will not address the entire dataset. The Kafka API does not allow access patterns that easily address subsets of data. To address this I.S.K. maintains indexes that map Iceberg partitions -> Kafka offsets, making Iceberg engines able to prune away the Kafka data they do not need.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.

K.S.I. - One Stream Across All Time

Streambased K.S.I. presents Kafka topics composed of a “hotset” section of data served directly from Kafka and a “coldset” section served from Iceberg.

Kafka’s partition and offset concepts are mapped from columns in the Iceberg data allowing Kafka clients to interact with them as if they were Kafka topics.

The K.S.I. architecture consists of:

An Iceberg Engine
Required to fetch table formatted data
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.
A Row Processor
This component reformats the column oriented Iceberg data into the key/value based messages Kafka clients expect. Governance steps like Schema Registry integration are applied here too.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.
A Proxy (we use the open source Kroxylicious)
Most requests/responses will be passed through to the underlying Kafka cluster but fetch requests that reference cold stored Iceberg data will be served by K.S.I. and not the underlying cluster.
Keep dashboards and reports aligned with live data. Every Kafka topic is instantly available in Iceberg, so teams can query fresh events without waiting for pipelines to finish.

Let’s find the right solution for your data

We’re here to help you unlock the full potential of your streaming data. Tell us about your challenges or ideas — and let’s explore how Streambased can support your business.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
How Streambased works

A Unified Logical Layer
for Data with no ETL

Streambased composes a dataset made up of real-time data (from Apache Kafka) and historical data (from Apache Iceberg) with the design goals of:

A single source of truth – consistent data across streaming and analytical use cases.
Simplified architecture – fewer hops, fewer systems, reduced maintenance burden.
Zero-latency data access – using analytical tooling to query up-to-date “real-time” data without waiting for batch ingestion.
Elimination of ETL and data duplication – composition eliminates “ahead of time” data movement pipelines, increasing availability andreducing operational overhead.
Cost-efficient storage – leveraging Apache Iceberg for scalable, low-cost long-term retention for Kafka use cases.
Consistent governance and schema management – enforcing access control and data structure across underlying storage platforms.

Operational and analytical architecture unified - As Iceberg

Streambased serves Iceberg engines via I.S.K. (Iceberg Service for Kafka). Iceberg tables served by this component seamlessly combine Kafka and Iceberg data into a single logical view. From the Iceberg engine perspective, they behave exactly like standard tables, with no additional complexity or integration overhead.

The I.S.K. architecture consists of the following:

An Iceberg catalog: 

I.S.K. presents a simple, read-only, catalog compliant with the Iceberg REST specification. This is the entrypoint for Iceberg engines and presents 3 namespaces:

1

Hotset – these tables represent only Kafka data.

2

Coldset – these tables represent only Iceberg data.

3

Mergedset – these tables represent a union of Kafka and Iceberg data deduplicated by Kafka partition/offset.

A storage gateway

I.S.K. presents an Amazon S3 compatible API for serving Iceberg metadata and table data. When required, the storage gateway will perform any transformation necessary (for instance from Kafka data format to Iceberg native Parquet).  

A hotset cache

To reduce impact on the Kafka cluster and improve Iceberg performance, I.S.K. caches files served by the storage gateway. These files represent sections of immutable Kafka log and so can be cached and invalidated as required.

Operational and analytical architecture unified - As Kafka

Streambased serves Kafka clients via K.S.I. (Kafka Service for Iceberg). Kafka topics served by this component represent the same core Streambased seamless logical view of Kafka and Iceberg but this time served via Kafka protocol. Kafka’s partition and offset concepts are mapped from columns in the Iceberg data allowing Kafka clients to interact with them as if they were Kafka topics.

The K.S.I. architecture consists of:

A proxy 
Streambased serves Kafka requests via a proxy component that sits between the underlying Kafka/Iceberg infrastructure and clients. This proxy (based on the open-source Kroxylicious project) intercepts requests that can be served from the Streambased logical view and satisfies them via a separate code path. All requests that are not relevant to the logical view are simply passed through.
An Iceberg engine
In order to fetch Iceberg table records (that will later be transformed and served as Kafka), K.S.I. must employ an Iceberg engine to query the underlying infrastructure. Streambased supports external engines (such as Trino or Spark) for this purpose, or it can use a small embedded service. 
A row processor
This component reformats the column-oriented Iceberg data into the key/value-based messages that Kafka clients expect. Governance steps like Schema Registry integration are applied here too.

Unified governance across Iceberg and Kafka

Streambased enforces a single layer of schema management, access control and data policies that span both Kafka and Iceberg. This ensures that data remains consistent and compliant as it is ingested by Kafka, ages, and then transitions from Kafka into Iceberg. By aligning governance across both domains, organisations gain a single source of truth with predictable data contracts, simplified auditing and reduced operational complexity.

Streambased integrates with external data structure providers such as Confluent’s Schema Registry and leverages schema-evolution mechanisms in both Kafka and Iceberg to automatically ensure consistency. 

Painfree transfer from Kafka to Iceberg

Traditional Kafka-to-Iceberg pipelines (Kafka Connect) involve uncomfortable compromises between latency, file layout and differing data structure concepts. The Streambased composed view allows access to Kafka and Iceberg data from all applications without requiring data transfer, so sidestepping these compromises and delivering the full promise of a combined real-time and analytical view. 

Streambased addresses the common Iceberg transfer pains of small files and snapshots:

A storage gateway

Streaming pipelines write data to Iceberg as it arrives, generating many small, inefficient files. Streambased avoids this by exposing Kafka data as logical Iceberg tables, eliminating the need for immediate physical writes.

Snapshots

When new data is written to Iceberg, associated metadata iscreated alongside it. Snapshots are metadata recording which data was inserted at which times, enabling Iceberg’s time travel feature. Like the small files problem, a large number of snapshots will degrade query performance and iscostly to clean up. Streambased treats snapshots as a logical construct, allowing them to be created, merged or removed with minimal overhead.

High-performance data access via secondary indexing

Secondary indexing means creating additional structures that allow you to efficiently look up or filter data based on different attributes.In Kafka, data is naturally organised by offset, but a secondary index mightallow fast access by another field (such as user, timestamp, or status) without requiring queries to scan everything. This all also applies to Iceberg data via its partitioning feature.
Streambased creates secondary index structures as Iceberg tables that can be used to greatly accelerate queries that don’t naturally match Kafka’s produce/consume access pattern.

Zero-latency CDC

CDC (Change Data Capture) involves continuously streaming inserts, updates, and deletes from source systems to create low latency materialised copies of the source data. It is a very common pattern with Apache Kafka.

Streambased takes a fundamentally different approach to CDC by avoiding the need to fully materialise streaming data into Iceberg ahead oftime. Instead of writing every change through a sink connector, Streambased composable views combine pre-materialised data in Iceberg with live data directly from Kafka. 

The result is immediate data freshness, ensuring events arequeryable as soon as they arrive in Kafka. By deferring materialisation, Streambased also significantly reduces the usual streaming to Iceberg pains. 

Overall, this approach simplifies the architecture, reduces infrastructure and maintenance overhead, and delivers truly real-time analytics without the trade-offs of traditional CDC pipelines.

Let’s find the right solution for your data

We’re here to help you unlock the full potential of your streaming data. Tell us about your challenges or ideas — and let’s explore how Streambased can support your business.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Script: