Why is it so difficult for analysts
to get at a newly created Kafka topic?
5th January 2024
Table of Contents
Introduction: Unlocking Widget Data Potential
Picture the scene, you’re a data scientist tasked with determining how to improve the reliability of the widgets your company makes.. You’re on your way home and looking at the app that controls your own widget (all widgets have Bluetooth 😉 ) and you see a new value for uptime in the UI. This is game changing, knowing the average uptime of a widget is a key factor in determining its reliability, you can’t wait to get back to work tomorrow and enjoy all the new insight this new data set brings.
Challenges: Between Operational and Analytical Domains
When you get into work the next day however, the reality is much less exciting. You track down the uptime tracking into your company’s Kafka infrastructure, widgets report their uptime, it’s processed by the various necessary services and reported in the app in the microservices architectures we have come to know and love. The problem is it’s not available to you. Kafka data is in the operational realm and you, as a data scientist, are in the analytical realm.
The ELT/ETL Hurdle: Data Accessibility Issues
We know what happens next, analysts should submit requests to data engineers to create a new ELT/ETL pipeline to move the new data from operational systems to analytical ones. This is a multi-iteration process and a commitment to maintaining and enhancing this pipeline forever. It’s time consuming and expensive and quickly changes the excitement at the opportunities offered by the new data into a daunting prospect where it is easier to see the concrete barriers than the theoretical advantages.
Streambased's Solution: Eliminating Data Pipelines
In our previous lives, the Streambased founders have pushed up against the barriers many times and conquered them with varying levels of success. The realisation quickly came that the end game solution to this problem was not to make pipeline creation tooling slicker and more self service but to do away with the pipeline altogether. By bringing analytical workloads to the data at its point of creation we make access a simple case of updating ACLs rather than a complex engineering prospect.
Practical Benefits: Direct Access with Streambased
Obviously it’s not quite as simple as this, some of the key reasons that the ELT approach has been so successful is that operation data is not in the correct format to be consumed efficiently by analytical tools. In short, analysts prefer SQL and event streams do not! At Streambased we present Kafka data via industry standard protocols (JDBC) so that analysts can bring their favourite tooling and techniques and apply them directly to the data at its origin. This is not enough though, to make these analytical queries perform we need to borrow some of the tricks on which analytical systems have been based for the last 20yrs or more. Streambased offers indexing, statistics, pre-aggregation and more besides all without moving or changing the original data set (stored in Kafka).
Case Study: A Data Scientist's Empowered Experience
Now let’s run the scenario again but with Streambased, our data scientist discovers this wonderful new operation data source and arrives at work the next day raring to go. They fire up their favourite JDBC tool (DBT, Superset, Qlik, Tableau…) and connect to Streambased. Given their role in the company it’s likely they already see the new dataset but if they do not have access a simple and quick request for access makes the new dataset available. Our scientist can go right ahead and perform whatever analytical tasks they need to with an expectation of performance in line with the other data stores they are using day to day. With the extra insight they can go about making a real impact on the strategic path of the company without allowing issues that are purely technical in nature to stop them.
Data Lifecycle: Streambased's Maintenance Advantage
One final thought, once our scientist has finished their task they will move on to the next one, the uptime data that was so critical may now be left to rot, irrelevant to the next step in the journey. Spare a thought to the poor data engineer in the ELT approach, once a pipeline has been established it’s rarely removed (when was the last time you heard a data analyst say “I’ll never need that data again, you can turn it off”?), the engineer must maintain the pipeline throughout any changes that occur upstream or downstream potentially forever. By taking the Streambased approach no pipeline was created and so maintenance (ok, negligible maintenance) is required. Data engineers can go back to doing what data engineers should be doing, rather than grinding pipeline grunt work.
Conclusion: Inviting Analysts and Engineers to Innovate
If you’re an analyst who wishes to have new datasets available instantly or a data engineer who wants to do something other than ELT, reach out to us now and see how we can help drive the change to a simpler, more open future.