At the bank, a series of lending decisions gets cleared over the weekend. On Monday morning, the Fraud team can see the live signals clear enough – unusual login activity, a new device. But what they can’t see is whether this pattern has appeared before, across months of account history that long ago aged out of Kafka and now lives in Iceberg. The approvals have already gone through. The complete picture was never available in one place.
It’s an all-too-familiar scenario. The bank runs Kafka for real-time data and stores its analytical history in Apache Iceberg. It has all the data it needs, and both systems are working. The problem is that the Risk and Fraud teams never see the same version of their data at the same time – live signals in Kafka, historical context in Iceberg, and no single view that spans both. And so the conversation degenerates into a heated exchange about which system to trust…
The same challenge plays out across many sectors. At a regional energy supplier, for example, the Analytics team is trying to explain to Operations why their models didn’t catch the signals that preceded the weekend’s outage. The live telemetry was visible in Kafka for 36 hours before the fault cascaded. But without the historical baseline data sitting in Iceberg – the long-term performance records that would have confirmed the pattern as anomalous rather than routine – the models had no way to interpret what they were seeing.
Over at the head office of a national retailer, meanwhile, the Data Science team is being asked why their pricing models missed a demand spike that was building in real-time clickstream data. The answer is the same: the live signal was there, but the historical elasticity and seasonal context needed to act on it was in a separate system, refreshed on a batch cycle, and so never current enough to be useful at the moment the decision was made.
Is it a pipeline problem?
None of these organisations has been caught out by a lack of investment in data infrastructure. In fact, the bank had spent months building out a new pipeline specifically to improve the freshness of its risk models. The energy supplier had recently migrated to a modern data lake architecture that its vendors had promised would close exactly this kind of gap. And the retailer has three data engineers whose primary task is maintaining the ETL jobs that keep the analytical store current.
Stream processing frameworks can make real-time data taster to reach. What they cannot do is extend that reach back through the full historical archive. Kafka retention windows are finite by design. The data that aged out last month, last quarter or last year is in Iceberg – and no streaming pipeline can bring those two time horizons together into a single queryable view.
So the pipelines are running, and the data is moving. But still flawed decisions get made before the data arrives. The issue is the underlying model. ETL was built to move data from one place to another. It was never built to make both places tell the same story at the same time.
Understanding the ETL illusion
For a generation of data teams, ETL represented real progress: moving data from Kafka into a warehouse made it queryable and reportable. But it solves one problem only to compound another.
ETL solves movement but introduces latency. When a pipeline runs on a batch cycle, it creates what looks like a unified view but is actually two separate versions of reality frozen at different points in time. The moment the pipeline completes, operational truth – what’s happening right now in Kafka – drifts apart from analytical truth – what the warehouse believes based on its last refresh.
The deeper problem is not just latency. Even a perfectly fresh pipeline only moves a window of recent data. The complete picture – the behavioural baseline, the historical anomaly record, the archived performance data – lives in Iceberg because Kafka was never designed for long-term retention. ETL moves a slice of data from one place to another. It does not create a continuous view that spans from right now all the way back to the beginning of the archive.
Every decision made in that gap is a decision made without the full picture. The fraud analyst sees the current transaction but not the complete behavioural history. The network engineer sees the live telemetry but not the baseline that would confirm whether it is anomalous. The pricing model sees the live demand signal but not the years of seasonal context that would make it meaningful. No amount of pipeline investment changes that, because the gap is structural, not operational.
The reconciliation trap
Organisations that recognise this issue often build reconciliation processes to bring operational and analytical data back into alignment. It’s a logical response, but it also creates a permanent overhead that treats the symptom rather than the cause.
Every reconciliation process is an admission that copying data has created two sources of truth rather than one. Every engineering hour spent reconciling is an hour spent managing the consequences of a boundary that should not exist. And every decision made while reconciliation is still running is a decision made on data that has not yet been verified.
Most organisations have come to accept this as the cost of doing business with data. But it is only the cost of a specific architectural choice – the decision to copy data from where it is created into a separate system for analysis. That choice made sense when there was no ETL alternative. There is now.
No more need to move data
ETL moved data, but Streambased removes that need by treating Kafka and Iceberg not as two systems to be connected but as two time horizons of the same dataset. Rather than copying data from one to the other, it creates a unified logical view at query time, projecting Kafka topics as Iceberg-compatible tables and merging them with historical data already in Iceberg. No pipelines, no duplication, no batch cycle, no drift.
To any standard analytical tool, the result looks like a single, continuously current dataset stretching from the latest event in Kafka back through years of history in Iceberg. The results are transformative.
Let's go back to our original scenarios. The Credit Risk and Fraud teams are now looking at the same system. The live signals are visible alongside the complete account history – months of behavioural data that would previously have aged out of Kafka – in the same query, at the same moment. The disagreement about which system to trust does not happen, because there is only one view. The conversation is now about risk, not about data.
The energy supplier's operations and analytics teams stop arguing about why the models missed the signals and start acting on them together, with live telemetry and long-term performance baselines available in the same query. The retailer’s commercial and data science teams stop reconstructing last week’s decisions and start making next week’s, with live demand signals and years of seasonal context visible simultaneously.
When Kafka and Iceberg are unified into a single logical view, the question is no longer: how do we move data quickly enough to be useful? It becomes: what do we want to know, and how do we ask it? The gap between the live signal and its historical context closes to zero – because there is no longer a copy to drift.




