What is data streaming?
There are a lot of misnomers and confusion between different systems that are out there. Let us begin by saying what it is not. It is not a media streaming platform like Netflix and Prime Video. It is not a video streaming platform. Confluent is the dark knight who is powering such massive digital systems, from a back-end perspective. Videos do not get streamed through Confluent. But every time we watch a particular series or a movie on Netflix, it generates an event which is a customer interacting with Netflix, which needs to be recorded in the back-end system. Events such as customer interactions, different payment providers integrating with these platforms, etc., generate data. In business, this is called a single event.
Grab, a digital organization uses Confluent. Grab provides different lines of business, including consumer applications and services. A key part of the business is capturing data from customer interactions and third-party vendors and partners that are part of the ecosystem. Each of these events is traditionally captured into databases tied to specific applications. As more and more services and applications are being built around the whole Grab ecosystem, it becomes difficult for them to consolidate all of the data into a single source of truth and be able to use it for new features that are being built, as well as for various other analytics, like detection of fraud, and so on. So Grab streams all collected data through the event streaming platform, and it is available for their analytics and machine learning teams to predict fraud, for instance.
Traditionally we built applications that interacted with the end-user. The end-user would take some action, which was recorded as a transaction within a database. When you wanted some data to be retrieved and returned to the application to the end-user, you would have to issue a query. It fetched data from the database and plugged it back into the application. Traditionally, databases were designed for simple static real-time queries. But the issue with that was that if you have a large number of users and large volumes of data, it becomes difficult to serve all of that by just having a single database.
The other flipside of database technology was that you had to offload a lot of this data into a data warehouse to run large-scale data processing. Typically, these workflows were done through batch processing, which was probably triggered once every day or every week, where you offload most of the data into a data warehouse, and then run the analytics or reporting. Over time, they built various systems for different application teams using their technology stack. This required a lot of integration between these systems at different levels; point to point; applications talking to applications; databases talking to databases, or applications talking to multiple databases from other applications, etc. It becomes complex as the business skills, and integrations can get messy when stitching all these systems together.
How do we approach this slightly differently?
That’s where the whole idea of data streaming or event streaming comes in. We are looking at everything that happens from a customer or end-user perspective as an event. For example, if a customer interacts with your business and purchases something, it’s a sales order, which is an event that is generated, that needs to be captured by the back-end system. If the order is processed and the shipment needs to be dispatched, the shipment is another event that occurs as part of your business process and is captured by your stream processing system. You need to take action based on all these different events. We are looking at business processes as events that occur at a time. We want to be able to react to those events in real-time and be able to process them and make decisions for the business as well as the customer accordingly. The idea with data streaming or event streaming is to capture data as and when it occurs at the source systems. You propagate that through your entire business process, and any department which needs to use the data would have it as a central source of truth, which can be sourced at any point.
What is the difference between data streaming and event streaming?
I have been using these two terms interchangeably, but data streaming was the concept of trying to understand that data that flows in from a source system goes through Confluent, and then you take action. We look at even streaming as a way to look at each of those data points as an event that occurs within the business process. That is propagated into the downstream systems. Data platform requirements have been changing over a period of time. That’s primarily because of the way businesses scale. Traditionally we built systems or historical data. Systems and apps that integrate or are used daily are built around capturing things in real-time and servicing the customer in real-time. So, it’s changed from being able to design systems for historical data to using systems for real-time events being generated as part of the business process. The other is in terms of scalability for your transactional data, which means that data volumes have increased, just because of the number of users that are interacting with the system, the way that systems are being built with integrations to external and internal systems, the volume of data has increased over a period of time. The back-end systems need to be able to scale and cater to that volume of data as well. The other perspective is that traditionally, systems were built transiently. Something that needs to be maintained can be changed over time. But what we require now is to have one single system that can handle both historical and real-time data coming in from your source systems. Aside from these, the idea of having modern data systems is to have all kinds of data in a single central point; it could be structured or unstructured data. So that different lines of businesses can use it when they need to build new features or applications. Event streaming systems cater to all these four different kinds of major requirements of data processing systems that today’s business applications require. They are built for real-time events. So as and when the events occur, it’s captured and propagated. For large volumes of data, this system can scale and store historical data and real-time data as and when it posts the system. You can transform and enrich the data. All of these together constitute what’s called a streaming system.
Why is this whole need for event streaming and data streaming systems?
It is primarily the need for speed. Today’s applications are being designed to support real-time interaction with customers. For example, consider telco systems. There are different business systems in the background. You have operational systems in the back-end that integrate with each other so that the end-user, the customer subscribing to a specific network, will be served adequately. Traditionally, in case of a network failure or issue, we wait for the customer to complain about that particular issue. Then we go back and try to identify the root cause of that particular issue and try to fix that for that customer. But what we could do with real-time event streaming systems is we can integrate data from all these different operations and business systems. For example, in the network activity stream, we identify a couple of events that we have captured in real-time that indicate there’s potentially some failure with some network device. From the operational systems where the customer interacts through different applications, we can potentially identify that the application is not being served the normal way. These are events that can be captured in real-time. You can trigger an alert to an end-user or business analyst who understands that the customer will potentially have a failure or an issue with interacting with the network. So, before they complain, you could reach out to them proactively and let them know that these are some things that we have observed. Potentially if there is an issue that you see coming at a later point, this would be something that we’re currently looking into. This is where proactive customer experience comes in, and real-time systems are helping to capture these events and use that in business decision-making processes. Data streaming is being able to capture data from source systems or applications in real-time.
Several people coined different definitions for the data mesh. We will probably focus on what data mesh has in context with the whole concept of designing modern real-time systems. The principles of data mesh are from Domain Driven Decentralization, meaning each line of business owns the data and shares data across the organization with a central platform. Anyone who needs to have data from a couple of different lines of businesses first needs to design a data product. You know the requirements, where the data needs to be sourced from, then capture that from there, and then build the data product on the other parts like sets of a data platform. You want to identify how the integrations must be done through various lines of business. You put that together through a single source of truth data platform, and you’ll be able to share data across domains as required.
Regarding data sharing, a key thing that needs to be taken into account is federal governance. You need to control who has access to what kind of data. That needs to be built-in into your data platform systems as well.
Principles of designing data mesh at Confluent
Principle 1: Domain-driven decentralization
Traditionally, analytic systems or data systems were built within organizations such that you have different applications and databases loading all of that data into a data warehouse and then running your analytics over there. We are breaking that into decentralized data ownership. Each team or line of business can have its own technology stack they can work with. They share the data as and when it occurs in real-time with the central streaming platform. All of the data is available on the central streaming platform, and whoever needs to access it should request and be able to access it.
Principle 2: Data as a first-class product
It is bringing this whole concept of data sharing and building new data products across the organization.
The concept of being able to share data and confident we have various patterns of using this. Once you decide you need a database to store some of the data, you can use “connectors” from Confluent. You would be able to sync that information into a database and then build your report on top of that. If you want to directly consume that data from Confluent into your application, you can do that. There are different patterns through which you can construct data-sharing models within your organization.
Principle 3: Self-serve Data Platform
The whole idea is to have all of this through a single data platform, and that’s where Confluent is in the middle of all the different systems being built. It is like a central nervous system where your data from the multiple source systems comes through Confluent, and anyone who needs to consume it will be able to do so as long as they have the right tags.
Principle 4: Federated Governance
We have things like cataloging, where you can have a view of all the data that is present within your confluent platform. And then, you’ll also be able to give access to specific datasets across different teams.
Confluent sits as a central hub for events. Event streaming is a good fit for the whole data mesh concept. The reason is that data mesh is all about bringing various data systems together, sharing data across the organization, and then giving access to the right people to build new services for your customer. Event streaming is a perfect fit because business systems are designed for real-time decision-making. For real-time decision making, what’s needed is to be able to capture data in real-time from all these different data systems and deliver data in real-time to the end consumer.