Santiago Ramón y Cajal, Glial cells of the mouse spinal cord, 1899. Ink and pencil on paper, 5 7/8 x 7 1/8 in. Cajal Institute (CSIC), Madrid
A VC recently told me that everyone’s looking for an “OpenAI for Bio.” At first, I found this obvious: of course they are! However, the more I thought about the problem the more interesting I found it. The operative question: could an OpenAI for biology even exist? And if so, what will it look like?
I implicitly argued in my last post that an OpenAI for bio (let’s call it BioAI) can’t exist because of the domain-specificity of biology. I said that collaboration between domain-expert biologists and computationalists is key to the best new innovations we’ve seen (I still agree with this), and that this innovation cycle cannot be replaced by large AI models (I might now disagree with this!). The relevant comment from that post:
(As a side note, this also makes it harder for non-domain-specific AI models to eat biologists because there’s an irreproducible element of human ingenuity and spark to this biological innovation, aided by years of biological training. While LLMs can “read” every paper ever, biologists can tell you which papers are BS, and more importantly create new technologies and ideas undergirded by that intuition.)
But this remark from an experienced investor I admire made me reconsider things! I questioned my base assumption that domain specificity is impossible to overcome, and actually asked: what would it take to overcome domain specificity? In this post (which I might extend into a series), I’ll try to answer that — starting with what I think BioAI would actually look like.
The biggest challenge in biology: domain specificity
I propose that BioAI will be a (almost certainly AI-driven) company making tools/products that generalize across many different types of biological work. If it's a tool, it'll automate a lot of work that biologists/computationalists do. If it's a more general product, drug, or piece of research, it'll be applicable to many diseases/biological problems.
First, quick question we need to answer: what is OpenAI? Or rather, what makes it special? One huge and obvious differentiator of OpenAI is its generalized impact — whereas a lot of prior AI applications were very domain-specific, OpenAI’s tools are useful across many domains (and historically, the fact that our strongest neural net architectures came from the ImageNet competition — a generalization test — speaks to this).
OpenAI has created tools that convert simple prompts to image and language. There’s a lot of (relatively) high quality image and language training data on the internet across a variety of fields, and so diffusion models and LLMs can be trained reasonably well. "Image" and "language" are the most general forms of expression you could imagine, applicable to every type of work and play, and so diffusion models and LLMs are useful for a variety of tasks.
Let’s envision a similarly generally-useful BioAI. It's hard to think about types of biological work that are as general as “image” or “language”. Obviously we have a lot of image and language data from different biological fields, but their analysis and interpretation is very different depending on your field — it’s domain specific. And experimental biological data, which we can consider a production unit of biological work, comes in formats much more complicated than just "image" or "language." The barrier to generating new biological data is also much higher than for generating other types of data, and even then it might not be high-quality data (see: the replication crisis).
As a result, it’s unclear how to straightforwardly train a BioAI over lots of different kinds of biological data, and how to apply BioAI to many different kinds of biological work. I’ll propose three potential paths to BioAI below.
BioAI is a generalizable set of tools
To me, it's most natural to think of BioAI as a suite of tools that generalizes across a lot of different fields (though this might be the natural epistemic consequence of working in a technology development lab, lol). In biology, these tools that everyone can use across a variety of different fields tend to be very powerful, and help us solve big, complex, heterogeneous biological problems like aging.
One of my favorite writings about this is Laura Deming's Sequencing is the new microscope which discusses the importance of toolmaking to science. Deming says the best scientists made "new experimental apparatuses to answer the questions that came to mind." When these apparatuses are very useful across many different fields, that's extremely powerful.
So, whence come the tools? Three ideas:
Multimodal assays, accompanying predictive (and generative) software. I talked about these quite a bit in my last post, but that's because I think they're so promising! I've previously discussed companies building predictive software to characterize cell state (insitro, Spring Discovery, and probably like 10 more that have sprung up since then). The biological component is also very important: creative multimodal biological assays will give us more perfect views of the central dogma and allow us to determine how and when disease sets in. I think new computational-biological tech that is able to seamlessly integrate different modes of data (imaging, transcriptomics, genomics, proteomics, timescale) could be our BioAI — giving us insights across many different diseases and biological states.
Examples of software building the groundwork for this: DNA Diffusion (generative model for gene sequence, linking genomics + epigenomics + transcriptomics + phenotype), ProGen (generative protein model for raw amino acid sequence, could be supercharged if paired with e.g. epigenomic data), MegaMap (predictive drug discovery tool, imaging + proteomics)
Examples of biological assays building the groundwork for this: in-situ genome sequencing (imaging + genomics), SHARE-seq (ATAC + RNA), CITE-seq (transcriptomics + proteomics), Perturb-seq (genetic screening + transcriptomics)
A lot of these papers were published just a few years ago (in fact, single-cell multiomics methods were Nature Method's 2019 Method of the Year) so I think that as these techniques get refined, produce more data/research output, and motivate more computational tool creation - we will see more and more biotech/techbio companies leverage them for drug discovery/software creation.
In silico simulations. If you asked me what would be BioAI in the early 2010s, I would have probably said something about in silico molecular/atomic simulations, like those developed by the Blue Brain project and D.E. Shaw Research. I think these are even more interesting now because we finally have the computational power and techniques to actually create and leverage these simulations. Unlike some other simulation projects, D.E. Shaw Research has never been "overly ambitious" because they have always focused a lot of energy on creating custom supercomputers that can perform these precise simulations. Now, many others will be able to make a big dent in this space. The BioAI potential is high because these simulations will ideally generalize across many different types of biological experiments (and probably be easier to automate than real-life labs).
Software aiding biologists and computationalists. BioAI could be a superpowered ELN/lab manager. Companies like Latch, Sphinx, and of course Benchling are doing interesting work here and could work in tandem with innovative lab automation hardware in transformative ways (obvious, classic end scenario: automated experimentation, logging, analysis - basically, all biological work). I imagine that a BioAI in this space would function most similarly to existing consumer AI tools (ChatGPT, G-Suite/MS Office, Copilot) by automating away entire pipelines of work — you provide the cells and an experiment prompt, and get the result in 2-3
business dayshours. Of course, the idea of “automating all biological work” is very lofty. But I believe the most likely path there starts with AI-powering the infrastructures already undergirding our biological work.
Agree? Disagree? Find this all painfully obvious? I'd love to get your thoughts — as usual, feel free to comment here or email me at amu.garimella@gmail.com. I look forward to all the conversations and developments ahead!