Dr. Denis Bauer is the head of computing bioinformatics at Australia’s government research agency, CSIRO. Dennis is an internationally recognized expert in artificial intelligence and is passionate about improving health and understanding the secrets of our genome using cloud computing. Dennis is an AWS data hero, and she’s determined to bridge the gap between academic academia and industry. She is part of the eHealth Research Centre, which is unique and spans an entire value chain from basic science all the way up to bringing health technologies and services to the clinical practice. This article by Dr. Denis Bauer is about cloud-based bioinformatics or how APIs enable global collaborations and accelerate health and medical research.
Let me tell you three stories, and each one of those stories has its own API element to it.
The first one is around understanding the genome and using genome data. The second story is about how we can manipulate the genome – taking it from understanding to doing something with actionable insights from the healthcare system, therefore powering new therapeutics applications. And the last one is around how we need to modernize the way we work together and how APIs specifically help build things that are larger than the sum of their parts.
Everyone has mutations in their genome that should inform clinical care. The genome is the blueprint that defines how our body is shaped and what kind of future disease risks we have. And therefore, it also encodes how we react to certain drugs. Therefore, there shouldn’t be any adverse drug reactions that we have to experience ourselves. Reading the genome and interpreting it the right way should guide us to what kind of drugs we should be using.
Similarly, with cancer therapy, how our cancer has evolved and what it is susceptible to should be or is encoded in the genome. So with this wealth of information, it’s not surprising that it’s used more and more in clinical practice, creating so-called mega biobanks, where the genome of multiple individuals are packed together into one massive data resource.
With 3 billion letters in each of our genomes, this resource is astounding. Therefore, the data has become too large to be moved around. Therefore, the analytics needs to go to the data. This is an entirely new concept, and I would argue it is underpinned or powered by the clever use of APIs.
In this case, we have this massive data that can’t move, and the analysis between them needs to be brokered by APIs. These APIs not only broker the actual raw data access to it, but they can have clever use on top of that, around access privileges. For example, they partition the data to serve out smaller parts so that these massive chunks of data can be analyzed with analytics on the fly.
We have developed VariantSpark, a machine learning library, to analyze the human genome, mainly when using an ElasticMapReduce cluster, which isn’t on an AWS infrastructure. That is the core, but it interacts with the data, which is located on an S3 bucket through some API. It’s serving out small chunks of the genome that the machine learning method needs to read when required, rather than holding this whole information in memory. This architecture is powered by Jupyter Notebooks so that researchers can go in there and do visual analytics and more data-driven approaches. We’re using an ElasticMapReduce cluster, which is a Spark cluster because the traditional high-performance compute systems can’t handle these massive amounts of data. This is similar to The Bureau of Methodology, doing weather forecasts. They break down each weather cell into their little buckets and can be computed on these traditional high-performance compute clusters. No information needs to go between nodes. In the genomics space, where the data is so massive, and every location in the genome is important to bring together to make those risk D and predict predictions, it is vital that the data can flow freely between nodes, basically dissolving the boundary between nodes, and using all the CPUs available in this particular cluster. This is where Spark comes in with a distributed computing paradigm. Therefore, this is a data-driven approach rather than a compute-intensive task. So, with this, we developed VariantSpark to use this notion of distributed computing. It could process today’s genomic data 3.6 times faster than typical solutions in that space. But we also show that he can scale to tomorrow’s data, which could be 1 trillion genomic data points, and he can do that in 15 hours rather than using 100,000 years.
Bringing this incredible computing power to the biological space, we looked at cardiovascular disease, the number one killer in Australia and worldwide. So here we looked at case-control studies, so people who do not have cardiovascular disease and those suffering from it. And the question was can we find something specific to the genome of the people that suffer from it that we can then pinpoint and use in biomarkers and future risk predictions. So here we have a 50,000 case-control cohort; we’re using VariantSpark to identify the individual disease risks. With this approach, we know there is a genomic component to cardiovascular disease.
The surprising thing here was how these genome parts of the whole genome interact with each other. For example, there are elements that modulate the risk to make it personal so that someone with a specific misspelling in the genome has specific mutations in the genome are more at risk. With this kind of personalized risk prediction, going forward, the clinical care that we can give to individuals will be much better. This is powered by understanding the genome, breeding it, and building these complex machine learning models.
We’re also applying it outside the human genomics space, particularly in the infectious disease COVID space. Because here again, the virus is mutating. It’s changing its blueprint and is particularly during that once it changed to a different host. We know that it has originated from bats and that there was some intermediary, and then it jumped to humans. And when it’s moving from human to human and adapting to its new host, it’s picking up these mutations, making it more adaptable to that space, making it better to fit in the human environment. As it’s doing that, it can pick up changes that potentially could make it more infectious, and we saw it with the Delta variant. Or it could pick up mutations that make it more pathogenic, which caused the disease to be more severe. Therefore, understanding which one of those mutations can cause such changes in the clinical outcome is critical. We analyzed the genomic data of the virus from all around the world; we had in total 5000 samples, 2000 people with severe disease and 2000 people with mild outcomes of the disease or no symptoms at all. We use variants bug to identify which one of those changes is clinically relevant. Now, this is important in the future to understand what kind of mutations we need to look out for, say, at border screenings or in the future with our vaccine developments.
The message is, out of these sheer volumes of data; we will only annotate 5000 individuals within the data set that we could use. Why is that? When COVID started, right at the beginning, the largest database worldwide that collects this kind of data had a field called “patient status”. It would collect information on how the patient was going that had this particular virus strain. Back then, hardly anyone would fill in that field. In October, it was made mandatory, and people had to enter something. But it was free text. So people were entering “unknown” because they typically didn’t know. And this has not changed since then. So there’s this massive amount of information that is put in there. But it’s not very useful because it is free text, and people can put whatever they want in there.
We can argue that having an API that brokers between these data silos so it can translate between the free text and clinical annotation is critical. That’s precisely what we’ve done. We partnered with GISAID, and we proposed a way of reading this data in a more structured way. This is using FHIR, a clinical terminology standard for capturing that kind of information. So it is capturing the specific information that is relevant to COVID. For example, the vaccine status, the background, or what kind of symptoms people had. People can type in their free text but then, on the fly, translate that to clinical terminology, where each term has a parent term, and it’s linked within this whole medical dictionary to make it more useful. So hopefully, from now on, we can use all of this information. By now, from 3 million samples that have been collected from around the world, we can salvage a little bit more than just 5000 to understand how the genome is influencing the outcome of the disease. The Ontoserver is a one-stop-shop for advanced technology, which supports is FHIR-based an enables the syndication of clinical terminologies. It supports the advanced use of SNOMED CT and underpins the UK’ and Australia’s national health systems.s
Frost and Sullivan predicted that by 2030 50% of the world’s population will have been sequenced. This is expected to create more data in genomics than the traditional Big Data disciplines, YouTube, Twitter, and astronomy combined. To handle this kind of workload efficiently, you can’t wait longer in the medical space as the patient needs to be treated on priority only thing you can do is throw more computing at it. And that’s what we are trying to enable using serverless or more cloud-native solutions.
We’re all familiar with desktop computing. The focus is to have complete control over what you’re doing, what kind of programs you’re installing, and have the flexibility to do whatever you want with the system. Also, it’s cost-effective. But you only have that one system. When you have a car; you are responsible for that car; you need to bring it to service and look after it. If you want to be more flexible, or more scalable, like have multiple cars or different kinds of cars and don’t want to look after that car, you could hire a chauffeur, that person would bring the car to the service or could exchange it to a bigger car when you need it. And the equivalent of that will be to use auto-scaling groups in the cloud. But like a chauffeur, auto-scaling groups come with a price tag; it’s not a cheap option. Once you scale up to those massive resources, they don’t simply go away when you don’t need them anymore; it takes time to downscale them. Similarly, you can’t just snap your finger and have a more extensive infrastructure.
Serverless is filling this gap, where the focus is on agility. You have the focus, flexibility, and scalability. You don’t have any overlap, but it doesn’t come with a price tag. Because here, the systems come on and go away spontaneously. Like a ride-sharing app, where you can demand a car right here and now, and it goes away when you don’t need it anymore.
With this kind of flexibility, we went into the genomic data sharing space and tried to reinvent that paradigm. We developed a mechanism where clinicians can go in and query this large amount of available data around the world and ask for a specific location, whether that specific location has seen a specific genotype which is a specific letter there. The system would come back with a yes or no or frequency.
This is important for Rare Disease Research, where conditions might have multiple potential mutations or misspelling in the genome that could cause the disease. Each one of those would have a different treatment outcome. And the clinician needs to confirm that for this particular patient, which one of those is the culprit, that is causing the disease. The clinician does that by checking whether that mutation is present in the other data sets around the world. Because if it is, the chances are that this mutation is not that bad to cause such a severe genetic disease, so it can be ruled out. By doing that, the clinician can narrow down to the actual culprit and then develop a treatment strategy that best works for the patient. To do that, we are using Serverless technology. We have multiple lambda functions that work together, which is the core of Serverless, or functional compute; it’s sucking in data from an S3 bucket and it serves out the data to the individual clinician through an API, namely the API gateway.
We were able to reduce the cost of sharing genomic data 300-fold. So rather than paying $4,000 a month, we brought that down to $15. For less than a cup of coffee per day, researchers or consortium around the world can contribute their valuable information to the clinicians that need it most to the patients to help narrow down what kind of diseases they have. I hope that with this approach, more organizations worldwide are inspired to share their data through this privacy-preserving method that is cheap and easy to set up.
The core message here is that an API empowers this brokering of the analytics to another analytics system. It will be the API from the Serverless speaking system to an API on the clinician side to match that information with the patient outcome. We have the ambitious goal of making this possible to handle population-scale data sets. For example, the population of the US, which is 350 million samples, each of those has 3 billion letters that they need to contribute. This will create a data source of one quintillion data points. We are confident that our Beacon protocol with the APIs being developed can handle this in real-time.
This brings me to the second story about manipulating the genome. Genome writing is typically done with genome engineering processes like CRISPR. Jennifer Doudna, who received, together with Emmanuelle Charpentier, the Nobel prize last year for their work in CRISPR, said the world around us is being revolutionized by CRISPR, whether we’re ready for it or not. I think this is a very telling quote because, in a very short period, this ability to edit the genome of a living cell has taken over many disciplines or enabled a lot of disciplines –
- Designing new model organisms in medical research, understanding the disease course, and the interplay between the genome and the clinical outcome.
- Doing large-scale screens to identify the functional consequences of individual mutations in particular organisms.
- Biosecurity approaches where it’s helping to keep invasive species at bay or prevent malaria from spreading.
- Gene therapy, where genetic diseases might be cured one day. Cancer could be treated in a more tailored, more efficient, and less destructive way to the rest of the organism.
So CRISPR is like a mini postman whose core aim is to go to a specific location in the genome-link to a specific address in that genome and deliver some mechanism via either cutting the genome or inserting something in it or changing it repairing it. Like a postman needing a specific unique address, CRISPR does that too. It is a huge task to come up with a unique address that is only at a specific location rather than having similar addresses around the genome, causing the machinery to accidentally go to other places and destroy genes rather than repair them. So, there need to be new methodologies to make this process safe and securely deliver to that particular location.
Together with the Children’s Medical Research Institute in Sydney, we are working on that. The task here is to cure a specific genetic liver disease that affects children. And the task is first to bring this genome editing machinery to the liver specifically and then within the liver cells to this particular location to correct the misspelling. Now, this is like a rocket landing somewhere in the galaxy. Because when you think about it, our genome has 3 billion letters and stretched out; it’s about two meters long; there are 3 trillion cells in our body. And together, the space is larger than our current galaxy. And therefore, for this local machinery, going to a specific location in a specific tissue is a massive and compute-intensive task.
Getting to a point where Gene therapy is part of the clinical routine is still a long way ahead. We need to know which location or which tissue is affected. We also need to package it up to only go to that particular tissue to then only go to a specific location in the genome of those tissues. It also needs to be safe so that different patients with different genomic profiles can use the same drug. There are 2 million differences between one person and the next on average. And each one of those differences can change how this CRISPR mechanism interacts with your genome. Of course, once all of this is worked out, in theory, it needs to go into clinical practice or clinical trials to work out whether all that theory is working in practice. We specifically focus on multiple areas to make this approach safer, better, and more efficient, and we’re doing this by using APIs.
Specifically, our system is designed to be Serverless, which makes it modular. Each of those lambda functions can be exchanged with a new lambda function that is either specific to the tissue or specific to new information that comes out. Therefore, we can easily exchange these modules through the APIs that we developed so that the whole system does not have to be reinvented every time new information comes out. With this concept, we developed GT-scan, which we think of as a search engine for the genome, where researchers can type in the gene they want to edit and how they want to edit it. It comes up with the best, safest, and most efficient strategy to do it. GT-scan was the first Serverless application back in 2016. It showed that this concept of cloud-native or serverless could be used in complex workflows, enabling genomic research or even clinical practice.
The API of GT-scan underpins this flexibility, which means it can be accessed from the outside to answer even more complex questions. Then can you find the perfect editing spot? For example, you can hit that API with a question, “Can you find target sites for the editing machinery specific to the heart, but not affect the liver or the rest of the organism?” With GT-scan, you can develop a Jupyter notebook to hit the API with precisely that question and then do a follow-on analysis. We also extended that to the COVID space, or the disease, infectious disease space at large, where you can ask to develop a diagnostic platform. We can also use CRISPR rather than editing or changing something to bind. And this particular location that it binds, if that is unique to a particular organism, then that is a diagnostic to differentiate between one strain of COVID, for example and another. With GT-scan, we can exactly do that; we can differentiate between the strains in the background that are “harmless” versus the potentially more pathogenic or more infectious.
This brings me to the last story around how to modernize collaborations; I would argue that the cloud has made talents and solutions much more accessible.
Individual solutions can be brought in to boost the pipeline, the workflow, or the architecture that a company already has. With that concept and inspiration, we brought VariantSpark, our genomic analysis toolkit, to the AWS Marketplace. It was the first digital health product from a public sector organization in 2019. And it enables not only the distribution of a commercial, academic solution into the commercial world, but for us, it also enables the holy grail of reproducible research. In the medical space, the findings you have are only as good as if someone else can replicate them. Therefore, whatever disease gene you find, someone else needs to replicate it before it becomes a piece of accepted knowledge. Replicating that is typically quite complex because the other person has to install the software and run it similarly to have the same workflow, the same architecture, and the same libraries, which is typically a nightmare. Variability is completely removed with the marketplace because you can spin it up on the same hardware under the same conditions and in the same way that the developer intended. Therefore, you can replicate, which means running the same analysis as it was on the original data set. You can also reproduce it, which is run on a slightly different data set to see if it’s still working and if the information identified in one data set still holds on a verification dataset. Therefore, bringing our academic research to the marketplace allows reproducible research. To reiterate, by having everything packaged up in one file, the actual algorithm, the workflow, the security setups, and everything in this one file that you can subscribe to. It spins up that exact architecture automatically, in the same way in a security-hardened way through the cloud providers. I would argue that this kind of marketplace is some sort of API because it enables one computer to talk to another computer.
The three things to remember are –
- Bringing genomic data into the clinical practice requires APIs; it is built on APIs. The data needs to be brokered with APIs and how to facilitate the analytics onto the data. Therefore, we developed Variant Spark, a machine learning method for finding new disease genes. Similarly, we developed our service Beacon protocol, which allows sharing genomic data worldwide using the API gateway or APIs in general.
- Exploring new treatment avenues – We developed GT-scan, which allows us to explore new treatment avenues through gene therapy. It makes economic editing safer and faster.
- Enabling collaboration between academia and industry, between individual research groups or academic groups or industry groups, has grown. There is demand for APIs that seamlessly talk to each other because we need to build solutions that are larger than the sum of their parts.