It is 2pm on a Tuesday and a fraud analyst at a major retail bank is watching a £47,000 wire transfer clear in real time. The behavioural model flagged it as low risk. What the model didn't know – what it couldn’t know, because the data hadn’t arrived yet – is that this customer has made three unusual login attempts in the past hour, from a device they’d never used before. That data was sitting in a system that only updates overnight.
Three thousand miles of fibre away, a network engineer at a telecoms operator receives an alarm: Cell Tower #4732 is degrading. Is this an isolated fault or one in a pattern of failures emerging across the region over several weeks? The historical performance data that would answer that question is in a warehouse that runs on a four-hour batch cycle. By the time the pattern is visible, two more towers have gone down.
Meanwhile, in a pharmaceutical distribution centre, a cold-chain shipment triggers a temperature alert. The operations team needs to decide immediately whether to quarantine the batch or release it. To inform their options, they can access the product’s historical excursion tolerance data, the carrier’s track record on the route, and the regulatory thresholds for the medicinal compound… but all that data lives in a system that was last refreshed this morning. So the decision gets made based on partial intel, gut instinct and a phone call to a colleague.
None of these scenarios are edge cases. They are illustrations of a flaw in the modern data stack that, until recently, businesses have assumed they have to live with – that when it comes to querying data, real-time data and historical data live in separate systems and cannot be queried as a single dataset at the same moment.
Between speed and context, there always had to be a trade-off. And the bill for that disconnect gets paid in all sorts of painfully concrete ways: lost revenue, missed marketing opportunities, customer churn, avoidable repairs, loss of competitive advantage.
Understanding the speed v context trade-off
Over the past decade, two systems have emerged as the de-facto standards for enterprise data: Kafka for real-time event streaming and Apache Iceberg for large-scale analytical storage. Both are excellent at what they do. Kafka gives you data at the speed of now. Iceberg gives you data at the depth of years. The problem is that they were built for different purposes and don’t know how to speak to each other.
Connecting the two means ETL: scheduled pipelines that copy data from Kafka into Iceberg on a batch cycle. This works, after a fashion, but in the process it introduces lag, duplicates storage and creates fragile, sprawling infrastructure that can break when something upstream changes.
The implications of this are profound. ETL moves data, but it does not unify decisions. Operational truth and analytical truth drift apart the moment the pipeline runs, so that every decision made in the gap between them is a decision made on partial information. Fraud models score transactions without live behavioural signals. Churn prevention systems fire without sight of recent service quality events. Predictive maintenance tools correlate sensor readings against historical baselines that are already hours old. Speed or context? You can’t have both.
Unaware of any alternative, the industry has normalised this trade-off. Real-time has come to mean fast but incomplete; historical has come to mean thorough but stale. And the teams caught between these two poles — the fraud analysts, network engineers, logistics operators, BAs and risk managers — have been forced to work around an architectural constraint that actively undermines their efforts. Tools that should be making them sharper are blunted by the data they can’t reach. But so ingrained is the trade-off that it can be hard to recognise it as a solvable problem rather than an inevitable fact of life. Until you see an alternative.
Back to Tuesday: What the full data picture unlocks
Streambased is that alternative. Rather than copying data between systems, it exposes Kafka and Iceberg as a single logical dataset at query time, bringing queries to where the data already lives rather than moving data to where the queries are.
There are no pipelines to maintain, no duplication of storage and no ingestion lag. To any standard analytical tool – Tableau, Power BI, Snowflake, Databricks, Trino… – the result looks like a single, continuously updated dataset stretching from the latest millisecond in Kafka back through the full historical record stored in Iceberg.
To see the radical difference this makes, let’s return to our Tuesday-afternoon scenarios.
Over at the bank, the fraud analyst’s model now queries the customer’s complete behavioural history and the live session data simultaneously, during the authorisation window itself. The unusual device, the three failed logins, the transfer amount relative to a two-year spending pattern – all of it is visible in the moment it matters. The transfer is held for review, the fraud pre-empted.
Meanwhile, the network engineer pulls up Tower #4732’s alarm and instantly sees it overlaid against six weeks of performance history across the region. The pattern is unmistakable. Maintenance is dispatched to four more towers before any of them fail. The proactive fix costs a fraction of what reactive repair would have.
At the distribution centre, the cold-chain team queries the shipment alert against the product’s complete excursion history, the carrier’s route performance over three years and the applicable regulatory thresholds – all in a matter of moments. The batch is within tolerance. It ships.
Streambased: The benefits of real time
It’s the same data that was always there, and the same teams making the same decisions. The difference is the huge advantage derived from a unified view of live signals and historical depth. The fraud gets caught instead of cleared, the towers get fixed before customers notice and the shipment decision takes minutes instead of hours. Multiply those differences across every data-backed decision an organisation makes, and the business case becomes difficult to ignore.
A retailer can adjust pricing dynamically as a flash sale unfolds, with live basket data and conversion rates informing a model that already knows years of seasonal elasticity. A logistics operator can reroute a time-sensitive delivery in response to a live traffic disruption, cross-referenced instantly against the historical performance of every alternative carrier on that lane. Marketing and customer experience teams can adapt in-session, responding to how customer behaviour is changing in the moment rather than what it looked like yesterday.
And AI and machine learning models, so often starved of fresh, reliable features by the latency of traditional pipelines, can draw on a continuous stream that spans milliseconds of live data all the way back to years of archived history, without a pipeline project standing between a new data source and the model that needs it.
So speed and context were never truly in conflict – they simply lived in separate systems and were never exposed as the same dataset. The trade-off was an artefact of architecture rather than an inevitable fact of data life. But with Streambased this constraint, which has shaped how organisations think about real-time data for years, disappears. The only question now is how you will use your new data advantage.
