This is going to be about Deletion Framework, which is how Facebook (or Meta as it is now called) upholds its commitments towards user data deletion. This team is responsible for building the infrastructure required to run deletions. In other words, Deletion Framework is the framework responsible for graph deletions. So imagine you want to delete your Facebook account, we need to identify what’s the logical subgraph that represents your data and this data is spread across many machines and many data stores. We are here to ensure that there is no trace of you after a given amount of time. In terms of scale, deletion frameworks handle 500 billion new graph deletions per day, which accounts for approximately a trillion individual objects that we delete per day and we handle a dozen of interconnected datastores, which hold the entirety of the Facebook graph. Finally, this talk is going to be a bird’s eye view of many systems rather than a deep dive.
Design overview
Without further ado, let’s get started with the design overview of Deletion Framework. First of all, I’d like to talk about how we rely on Data Definition Language. This was a very early learning in the lifetime of this team, in that we cannot require developers to care about privacy or care regulation. At first, when we started doing this, we asked developers to write their own deletion logic. This led to many issues, the first one being that people wrote bugs in the deletion logic, which can quickly become a problem as you want that logic to be extremely reliable so that it runs without any issue. The deletion logic and the data definition would also just drift apart, as product teams will update their products, and the data that they collect and hold, but they wouldn’t go and update the division logic. So we transitioned to Code Generation. There is a Data Definition Language at Facebook that we basically piggybacked on and added deletion annotations. The annotation we use is storage configuration which basically tells us where the data is stored. This already existed before us but we added the Deletion Edge Annotations, which tell us given edges (shallow/deep/recounted) in the graph, how to traverse them, and then Deletion Constraints, which is a way for developers to protect their data.
If we look at the example here, we have three schemas: user, posts, and comments and we can see that they have edges between them. The “from user” to “post” edge is what we call the deep edge. This means that once we see this edge, we want to traverse it, and recursively delete whatever is on the other side of that edge. Conversely, we have the “from post” to “user”, which is a shallow edge because when we delete the post, we actually don’t want to delete the author. Those edge annotations are what tell us how to iterate over the entire graph. We can see the other annotations on the schemas, like the storage, which in this case is Tao, our biggest graph database at Facebook, and then the constraints. If we take an example, here, the user constraint is “Never deep deleted”, which basically means that the user will never be deleted as part of another object’s deletion. Users will always be the top level object so you never want to delete the user when you delete a post. This basically safeguards against misconfiguration. As far as the schema “Post” is concerned, it says “deep deleted by any other entity”. This basically means that the post schema accepts deep edges from anyone but they require at least one. Comments is another type of constraint which is “deleted by post” which basically means that developers are able to safeguard what object types can delete their data.
We use all of these limitations to generate code. There is the {UserDeleter} and the {PostDeleter} which are completely generated. Deleters mostly delegate work to other parts of the infrastructure and in this case, they would end up delegating the work to what we call primitive deleters, our interaction points between Deletion Framework and different storages that we integrate with. Those primitive deleters are written by the Deletion Framework team in their efforts to expand the coverage of Deletion framework. Basically, those generated deleters iterate over the graph, and once we reach a leaf, we execute a primitive Deleter that issues the Delete to the datastore.
Walking the graph
Let’s now look at how we actually walk over the graph. We use a DFS because it is a simple algorithm to implement and it’s quite cheap to run. We need to be reentrant because we delete very large graphs, which we are not going to be able to delete in a single run. And we need at least once semantics so every object needs to be deleted at least once.
And if we take this example graph here, where we have a deletion request on a post, the framework is first going to read the data we’re trying to delete, so in this case, the post. We’re going to checkpoint on a persistent stack that you would see in the DFS, the only difference is that it’s in blob storage somewhere, and this is what allows us to be reentrant. Let’s assume that the job gets killed halfway through, we can restart the deletion and read the stack and know exactly where in the graph we were. The data that we need to iterate over the graph is also preserved in this persistent stack. Then we log restoration logs before actually doing any deletes to allow us to restore data. Then comes the self deletion, followed by the graph deletion. We read the schema definition, and list all the subtypes we need to fetch. Here, don’t traverse the association pointing to the author as it is a shallow edge. If we are talking about comments, those are deep edges, so we first traverse them, read what’s on the other side, put it on the persistent stack, log the restoration logs, delete the object and do the gratifications recursively from there. In this case, there is just an author, which deletes as it’s a shallow type. Once the deleter is done, we pop it from the stack, go back up, log the restoration logs that let us do this and finally delete the path that led us to delete those objects. After all that, we basically continue iterating, deleting objects on the way down, putting everything on the stack, popping objects on the way up and once the deleter pops from the stack, it is the sign that everything has been deleted. That’s basically how it works. As you’ve seen, there are a lot of round trips involved, which is why we introduced the concept of batching. All these actions will be done in an order like this: the framework is first going to read a lot of data and won’t stop until either we aggregated enough data that we reached a payload size that makes sense, or until we reach a deleter that cares about the ordering of the deletes. The vast majority of deleters don’t care about ordering, but some do. Then we checkpoint everything on the stack. This allows us to retry the deletion if any of the deletes fail. After that, we log the restoration logs in one much bigger record, delete everything at once, some things might fail, in which case the deletion bails and then restart, then pop everything and then keep on going. That’s basically how we can go much much faster than we would otherwise.
Scheduling deletions in the future
Now, let’s quickly take a look into how we schedule deletions in the future. This is essential to enable ephemerality at Facebook. This is what powers stories or the account deletion grace period (when you try to delete your account, you have 30 days to relog in to cancel deletion). We support scheduling years in the future, which is very useful for financial records for instance. We also support custom TTL logic which allows us for example to schedule the deletion of a post 9 days after the last comment. It requires us to continuously check, and in terms of scale, we process 160 billion events per day. This has been growing much, much faster than the rest of deletion.
This is a quick overview of the infrastructure. It’s really a tremendously big data pipeline but to recap, there is the aggregation pipeline that gathers all of the creations, the status pipeline is what looks at everything that’s in flight, and looks at things that were scheduled, things that failed, things that need to be scheduled. The retry and just in time pipelines are responsible for taking things from the status pipelines and basically schedule them to be executed later.
Guarantees
The first one I will talk about is the scheduling. The deletion is run in two phases: there is the sync part which runs in the web request, and there’s the async part which iterates over the graph. We have this because it’s not possible for us to execute over the graph during the web requests and in most cases, the sync part is just going to be responsible for deleting the top level objects. To do this, you just need to write the metadata, log the deletion request, delete the top level object, and then schedule the async job to delete the rest later. That’s how we guarantee that we can accept the deletion request and spend as little time on it so that it’s not interrupting users. The sync part is also responsible for hiding the subgraph so whenever we delete the post, we want all of the comments and replies to be hidden, even if you have a direct link to content, we want to have something that prevents people from reading that content. The way we achieve this is by using privacy policies. They are a feature that belongs to the privacy layer of Facebook, which is offered by the Data Definition Language. That privacy layer is what enables Facebook to say photos are hidden if you are not a friend with that person. It is the layer that executes all of those privacy checks that define if things are visible or not and it is what we use to do a real-time validation on that content, while it waits to be deleted. There’s one issue, which is that the privacy layer is distinct from the deletion graph. We are making efforts to try and merge them together so that, given the definition of the deep edges, and their presence -or absence-, we can define whether an item should be visible on the platform. However, it is a very complicated problem and it is not mature enough yet. Therefore, in order to make it work, whenever we run the deletion, we do sample checks on the data. We try to load the data using the viewer context of the person who scheduled the deletion and if we can load any of the data, it basically means that the privacy policies aren’t correct for this because we shouldn’t be able to load. We then escalate this for people in the company to resolve it, that’s how we guarantee that the data is hidden properly.
Eventual completion
This is a big one, because deletions might run into many issues and even though we have some reliability guarantees, our users want us to be 100% reliable. To do this, we need to build safety nets to go around all of the issues that we have. May it be infrastructure issues, bugs in the deletion code, deletions that get dropped by dependencies or that fail halfway through the scheduling and are left in a weird inconsistent state, all these don’t matter because every deletion started must eventually complete. Thus, we log every event from the lifecycle of deletion into a huge table that we call deletion history and then we have a pipeline that computes a day status for every deletion that is in flight. We extract two datasets from there. The first one is the idle deletion, which are the deletions that haven’t been executed in some amount of time. All we have to do for those is reschedule them. That’s the first layer of defence: try to mitigate the issue automatically. We also have some monitoring there that identifies the longest not-running deletion in the entire framework. Those alerts are there because even when you have safety nets, all the remediation that you have might fail. This is a key learning that we had. The second set is stuck deletions, which is basically aggregating all of the deletions that run into persistent infrastructure issues. We try to aggregate them into units of work in order to predict who in the company will be the best person to tackle this problem and assign a task to them with the debugging tool information they need to be able to debug that deletion. We also have a whole programme on top of this with SLAs so that we can make sure infrastructure issues preventing deletions from going forward are resolved with an SLA, which in turn allows us to provide some SLA around the completion time of applications.
Eventual completeness
This is the second part of eventual completion. Eventual completion is making sure that jobs finish. Eventual completeness is making sure that we deleted all the data regardless of whether the job has finished already or not. There are many issues that can lead to orphaned data, bugs in the deletion logic, race conditions between different systems or deletion settings misconfiguration and we need to clean that up retroactively. The solution we have is object re-deletion. We know of every object we deleted since the beginning of time, because we have an auto-implement column which is the identifier of the objects. From all those objects, we are able to fetch associations that are pointing to or from those objects and then basically reapply the deletion logic. If the association is deep, we walk the graph and re-delete everything that’s on the other side. If it is shallow, we just delete that. We run migrations on every object type that Facebook holds every other week, and they iterate over all of the data that we hold, to try to proactively mitigate often. It seems like a massive mess but not much more than you would expect, given the systems that we have. It is just a side effect of having so many different systems that interoperate and that are hard to synchronise.
Restorations
If you operate systems that do deletions on a large scale, you are going to have inevitable data losses. You might have a product bug, for instance, where there is a UI element that accidentally schedules deletion when it shouldn’t, or you might have engineers that misconfigured their deletion configurations. In order to alleviate those losses, we log restorations before every delete that we issue, and those are a bit different from your usual datastore backup because of the graph index. A graph in those logs is one entity that you can restore and it does also span multiple data sources so you can restore a graph that has been deleted from several different data stores at the same time. The way it works is that we have a write-ahead log from our deletion execution flow and a tailer that is responsible for aggregating them into bigger payloads so that it makes sense economically to write them to the store because we just cannot write them directly when they are way too small. This leads to an interesting issue in that we hold backups of deleted data but we want to be very strict about how long we retain user data. If we hold backups in a data store, that data store might have backups, and then you have a weird lineage problem that you need to try and solve. The way we solve this is we use an encryption service that provides us with a daily retaining encryption key, which we are the only person to have access to. We encrypt all of our data like this using the key of the day. That key has a TTL so that once the key expires, we have a deletion by encryption that happens and whoever might have copied or logged somewhere in a backup table or somewhere else would not be able to get data. That’s how we enforce that user data is deleted after some amount of time.
We have some ways of preventing data loss. None of them is foolproof but we do static analysis on the deletion graph, we do dynamic checks at runtime based on deletion constraints and we run predictions on the edges’ deletion behaviours to try to find misconfigurations that might have been otherwise unnoticed.
Monitoring guarantees
We need to be able to prove to our users and to assessors that we are actually meeting the bar. Below are all the metrics we monitor:
We have a happy path and we need to measure its reliability but it is not enough and we need to measure how much falls through the gaps in order to establish safety nets that automatically remediate that. This also means we need to measure the safety nets reliability and, if possible, have more safety nets behind. Bonus points if your way to measure success is completely orthogonal from your system and if you have multiple layers of safety nets. In conclusion, deletion is really a hard problem, the happy path is not enough and we need to make it easier for developers to do the right thing. That’s a learning that also exists in security where you can train your engineers as much as you want to not disregard security flaws but it is still going to happen. We need to embed privacy at the infrastructure layer, the same way we embed security at the infrastructure layer, so that when engineers use those infrastructure, they create, by default, products that are privacy aware or security aware.
Q&A Section
Q: Are we talking about hard deletions or are we talking about deletion for a product from a production system?
A: This is hard deletion. And we are talking about hard deletion both coming from user-generated traffic (people pressing the Delete on the platform), but also other production systems that might want to duplicate entire data types and the graph associated with them.
Q: What is the granularity you have on deletion? Some countries, some regulations obligate companies to keep some of the data. When a user asks you to delete some data, you have to delete some, but you may also have to keep some, depending on the country or the user. How do you handle those differences?
A: We have an internal policy, which is a document that’s written by policy, lawyers and engineering and that defines what is the standard that the company is going to meet. Lawyers are responsible to make sure that that standard fits every regulation. That way, engineers don’t need to understand regulations, they just need to meet one standard.