Let’s start with a little bit about Cassandra, Stargate and what this technology is. It’s an Open Source Apache project. Let’s start off with Cassandra. If you haven’t heard of it, you’ve definitely had data stored in Cassandra. It was Open Sourced by Facebook in around 2008 and over those 12 years really has grown to power a lot of what we think of as the modern internet. For example, the last time Apple spoke about using Cassandra, they had over a quarter of a million servers, Netflix very much publicly scaled up on Cassandra, Spotify, Uber, all those types of companies really built their ability to scale over the last 10 years on Cassandra. It’s a peer-to-peer active replication model and one of our first noSQL databases. In fact, I think we almost coined the term. That doesn’t mean it has no SQL, it just means we don’t have a lot of the relational things. We’ll look at our CQL, which is very much like SQL, as we go along, as well as compare that to the APIs that we do have. Cassandra is very highly available, you can handle node failures, it’s very scalable, you add more nodes, you get more capacity, you get more throughput and the data model works well for a lot of different use cases. This means there are loads of companies (90% of the Fortune 100) around the world that depend on Cassandra or DataStax Enterprise to store their data and manage things at scale.
Downloading Cassandra
We built on Open Source and it’s been a challenging year and a half as we’ve ground out our version 4.0 that was released earlier this month, and is one of our most tested and stable versions of Cassandra ever. Some of the large users of Cassandra insisted that they be able to put it into production on day one of the GA release, and they did that. There’s a great new version out there and a lot of pent up features that will soon be worked on to get into the release of 4.0. So we’ve got this 12 year old Open Source database that is scalable and huge and able to handle all sorts of workloads and able to maintain availability when you lose nodes. Let’s now think about how that ecosystem looks. We also have a project called K8ssandra, DataStax has been working on over the last year with the community to move Cassandra into the Kubernetes world. We are reaching this point where we have a stable view of what Open Source Kubernetes operators should look like. The K8ssandra Project is an Open Source project that gives you a production ecosystem out of the box. From the SataStax perspective, we have our DataStax enterprise, which is our long running enterprise version of Apache Cassandra, and then Astra which is a fully managed database as a service with consumption based pricing and serverless architecture. That’s what we’re going to do our demos on today, because that has our Stargate API sitting in front of it. This API gives us different ways of talking to that same Cassandra, back end. Traditionally, when you wanted to use Cassandra, you used a language called CQL, the Cassandra Query Language, which is a subset of NCSQL. Actually, traditionally, when it first started, we used Thrift but then we decided that wasn’t good so we put CQL in front. It is great for some use cases, but limiting in terms of the developers who want to go and learn that language or the frameworks that language will plug into.
Choose any API
We’ve built several API sets in front and a framework about how we can add more APIs that allow you to talk to Apache Cassandra, and these are the APIs that sit in front of DataStax Astra. Everything we do today, you can go and do in an Open Source world. You can take Stargate, it’s an Open Source project and put that in front of K8ssandra that’s actually wrapped up in the Kubernetes.
If we look at this slide, we are moving left to right here from more structured to less structured. Cassandra Query Languages is an SQL subset, very good for structured and key value data types, strong typing and SQL compliant. We have support for GraphQL and our first example of that is how to do GraphQL over your existing Cassandra data models. This is great for structured data, key value data as well, the typing is reasonably strong, though not as strong, and the hierarchy in there is great for being able to join from table to table. Next, we’ve also added Rest to give the largest set of developers access to Cassandra. There’s traditionally over the years been microservices written that are basically: “Take a Rest endpoint, turn that into Cassandra Query Language, get data out, turn it into JSON and hand it back”. We wanted to get rid of doing that as it is too complicated. Lastly, we’ve added a document API. We took the JSON documents, which are semi-structured and weak compared to CQL, and added indexing onto that. By treating it like a proper JSon document API, where we index every key field, we are able to provide queries across the whole document, without you having to do any of the database-y things that you’d expect to do with a CQL.
You can compare these four different API styles doing roughly the same thing by writing tables using the same data in CQL, GraphQL, Rest and the schemaless document API. They are different ways to read and write, starting with the very structured way of CQL, all the way up to the document API, where we get full JSON documents support. Everything in every field is indexed, I don’t have to create my schema and I can have different documents, structures, all of that, sitting in front of the power and scalability of Apache Cassandra on the back end.
Sample Cassandra Use Cases
So what types of use cases do people have for Cassandra? Pretty much anything really, if there is a database, things where we don’t have triggers, we don’t have strong relation of referential integrity, I think over the last 11 years that those features for most applications are not as important. So some of the use cases are persistent session stores, high throughput, low latency. I’ve seen this as backends, for massive online games, for your shop or what’s in your inventory, things like that. Other use cases are for frequently accessed user data for apps and websites on a massive scale. Spotify has spoken a lot about their Cassandra usage and Netflix uses it for your wish list and your play history, bearing in mind Cassandra data model works really well with time series data and allows you to have a sliding window of things. If you want to track user activity over a period and have that window down, it works really well. We’ve also seen a bunch of use cases in AI ML, where we need to be able to pull out a bunch of data to enrich an event or enrich something that’s happened so we can put it through the model and be able to just have random access to your data at that millisecond latency over hundreds of terabytes. It is similar with business intelligence style workloads. For APIs, being able to connect microservices and front ends on to something like Cassandra without having to jump through all the hoops that a driver may present is a great advance and something I hope will make it a lot easier for people to get on to running Apache Cassandra.
Q&A Section
Q: How was security handled in this target APIs?
A: It’s a token based API. Behind the scenes, Cassandra has a role based access control, similar to any type of database. Stargate is put in front of that token base control, so that it’s easier to pass through the APIs, you’ve just got a token that you pass that matches through to a set of permissions.
Q: Is this only used for Apache Cassandra or any other sort of databases at the back for now?
A: Stargate at the moment is coded to run against Apache Cassandra, or DataStax Enterprise, our enterprise version of that. It will be interesting to see what we are going to expect to get out of the places we store data in the future, such as the database, data service or data platform. One of the reasons we’re doing this in Open Source, is to get more opinions on what sorts of other backends people should have access to. We see when we talk to enterprises around the world that they are building enterprise data platforms that have common API front ends, that talk to various different types of database back end. If we standardise APIs, we won’t need to have custom APIs for each type of database. If we’ve got a standard portal that everyone can get through, it’s easier for us to have a standard approach to security, to audit controls and to how we manage access.