The article talks about data streaming and data mesh.
We must understand how traditional architectures worked and what we mean by data at rest. Consider a database. You can run two kinds of jobs against this database. You could have simple static real-time queries. Every time you want some update, you need to run that query. The other job you could run is a batch job, which you would run at every x interval. You extract the data, transform it and load it into another system. The database is at the heart of the architecture. It leads to a spaghetti architecture, where you have a lot of point-to-point connections. Architectures have been devised around this, but there are two problems with that –
- The Middleware layer can become a bottleneck when multiple teams ask them to change things at their end, but they may not have the capacity to do that.
We want to make this data available, but there should be no bottleneck. The second most important requirement is the data should not be at rest.
If you look at the architecture today, every action is an event. Events come in from the point of sale terminals or e-commerce orders. Everything in today’s world is an event, and we want to ensure that we process these events in real-time. This is where the data in motion or even streams come into the picture. This is where Confluent can help you.
We capture all the events coming in from different sources. We now want to transform these events, and create value out of these events, so that I can provide some rich customer experiences. This is where the whole value lies of this entire platform.
Event streaming platform
An event streaming platform should satisfy four properties.
- It should be available in real-time.
- It should be scalable.
- It should be persistent and durable so that once a message is consumed, it should not just disappear.
- The platform should be capable of enrichment.
An event streaming platform satisfies all these requirements. Confluent suggests Kafka in this particular case. It becomes the central nervous system, wherein all the data in your enterprise flows in and out of that particular system. So all these applications can connect to and tap into the flowing events. There is an entire ecosystem around it. There are connectors that can connect to different systems and bring data into Kafka. On top of that, you have a stream processing layer like SQL over Kafka. There are K Streams to write more advanced Java Code.
It is not just a single physical Kafka cluster; it could be multiple such clusters talking to each other at any given time.
Consider a scenario. There are multiple devices and multiple connectors, getting events from all these different sources and applications into Confluent. There is a key SQL DB layer running on top of that. So, I get contextual information whenever any events come through. The SQL statements, in this case, are against data in motion. You could have order data, which is coming in from multiple places; you could have login details, which are coming in, and you want to aggregate all that information to give you one consolidated view of what is happening in your system.
Consider a credit-card fraud detection scenario. We need to track frauds in real-time. We cannot do it post-facto. That is why real-time data analysis is important.
Consider a typical enterprise. There would be multiple domains like orders, inventory, etc. If we talk about microservices-based architecture or a monolith, all
these domains will send data to someplace, usually a data lake. So, you would send data into Kafka from there, offloading that data into a data lake. Another team can pick up that data, work on it, clean it, transform it and move it into a format that other applications can consume. It can then be put into a data warehouse or another OLAP system. A potential problem could be that the team working on the data is not the one that produced it or the one which is going to consume it. So they may do things with the data which are not intended. The first consequence is that the data quality may suffer. The second is because you are now relying on a centralized team to make all these changes on your behalf; what is happening is the centralized team itself may become the bottleneck. What ultimately suffers here is the agility and the speed to market. That is where the important concept of data mesh comes in.
Principles of Data Mesh
- Domain Driven decentralization.
- Data as a first-class product.
- Self-Serve data platform
- Federated governance.
Anti-pattern – I have some applications that send data, and all of them are going into some data warehousing. Responsibility for that domain data is with the data warehouse team, which may not understand that data. This is a problem because you need someone who understands that data and what that data denotes. So we want to evolve this into a pattern wherein the data ownership is with the domain itself.
Let us assume we have a data mesh. We have different domains, inventory and shipment orders, and teams working on these domains. Data will be published by each domain for other domains to consume. Now, the owners of this data are the domains that create that data.
Let us assume that the Bilking domain creates some audit data that is consumed by the inventory domain. If someone from the inventory domain finds some issue with the data, they should inform the billing domain and not change the data on their own. This will also correct it for other domains that are using the data.
Data as a first-class product
It is just not enough that you produce data. You need to make that data discoverable; it should be addressable, it should be trustworthy, and it should be secure. It should be shareable with other teams because what good is data if you cannot share it with anybody else? The ultimate goal of this entire approach is for each of these domains to expose the data they are the owners of and for the others to be able to consume it. You want to ensure that all this data is available to other domains in real-time.
You could share all these data products, possibly via Kafka. Kafka provides a unique solution called cluster linking, wherein you can just run a command, and it will securely transfer data from one topic in a Kafka cluster into another in a destination cluster. From the destination cluster, multiple such teams can subscribe to that data and make use of that data. The data can also be published further. You can keep data indefinitely in Kafka. In Kafka, you can replay data because it is pub/sub, not point-to-point.
Self-serve data platform
Assume I am in a domain. I want to get data from different data sources. I want to transform that data and publish it so others can consume it. I should not be dependent on some other team to get that data.
The first piece is getting the data from different data sources into Confluent. This is done via connectors. Connectors are Confluent-driven applications, so you do not need to code anything; just configure the settings. It is pretty much self-service. Once the data is in your platform, you can transform it. You could use either key SQL DB, which is SQL over Kafka. Suppose you want to have some more complex logic. In that case, you could use Kafka strings, which provide a domain-specific language using complicated logic to filter and join processes to aggregate data. The final step is publishing the data. You can do that over maybe a Kafka topic. While publishing data, it is important that the schemas and versioning are correct. This is where the conference schema registry will help you. It ensures that the data conforms to a certain schema.
We want to ensure that even though we are talking about decentralization, there has to be centralized governance on top of that. e.g., Customer ID needs to mean the same across domains to correlate data. That is why there needs to be some centralized governance. It is more of an organizational concern as opposed to a technological concern.
Be pragmatic. Do not expect governance systems to be perfect. It is a process. Beware of centralized data models which can become slow to change. Where they must exist, use processes and tooling like GitHub to collaborate and change.