Imagine a tool that can read and write the genetic code of any living organism. Say Hi to Evo 2, which scientists are calling the biggest AI model made in the field of biology. Evo 2 is a generative AI model, similar to ChatGPT but designed for DNA.
What is Evol 2?
Evo 2 is an advanced AI model trained on the DNA sequences of 128,000 genomes, which allows it to recognize patterns in genetic sequences. It is developed by researchers from Stanford University, UC Berkeley, UC San Francisco, Arc Institute, and Nvidia.
You can say it is like the ChatGPT of genetic data. Unlike traditional language models that process words, Evo 2 is trained on DNA sequences from various organisms.
For more context, you provide ChatGPT with a text prompt, and the model autocompletes the sentence based on learned patterns. Evo 2 applies this concept to DNA. To design a new gene, you start with a sequence of base pairs, and Evo 2 autocompletes the gene.
With all this training, it can now predict how changes in DNA might affect an organism and generate complete chromosomes.
It is built on a training dataset containing 9.3 trillion DNA base pairs, powered by NVIDIA’s DGX Cloud platform. This includes bacteria, animals, plants, and even extinct species. For comparison, Evo 2 is trained on 30 times more data than Evo 1.
“We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution.“
The team of researchers has developed 2 versions, one with with 7 billion parameters and the second with 40 billion parameters respectively.
You can learn more about the “ChatGPT for DNA” from the Standford’s Scientists here:
The model uses a new architecture called StripedHyena 2, which enabled training that was “nearly three times faster than optimized transformer models,” according to Dave Burke, PhD, CTO at Arc Institute.
Potential Use Cases of Evo 2
One of its key abilities is predicting how mutations in DNA might affect living organisms. For example, it was able to accurately identify harmful mutations in the BRCA1 gene, which is linked to breast cancer, without requiring additional training. Evo 2 achieved more than 90% accuracy in predicting which mutations are not harmful versus disease-causing.
There are many other things it will be able to do, which include:
- Healthcare: With the ability to predict genetic mutations, it can assist in identifying disease-causing genes.
- Agriculture: It can help in designing disease-resistant crops for better yields.
- Synthetic Biology: It can be used to create entirely new organisms or biological systems with desired systems.
Tests show that it can independently recognize various biological traits and generate complete mitochondrial genomes, prokaryotic genomes, and eukaryotic chromosomes that match the length and complexity of those found in nature.
Takeaways
Not only that it is the biggest AI model for biological, Evo 2 researchers have also made it publicly available. It is open-source, which means that scientists from anywhere in the world can download the data, parameters, and software code to use for their research.
Overall, this new model can pave the way for significant breakthroughs in medicine and agriculture. Also, to ensure it couldn’t be used to design harmful viruses. the team has tested it to treat genetic information from all human populations fairly.
A lot is happening in AI for biology, we have already talked about the Open Medical-LLM Leaderboard, to rank each Medical LLM, based on its knowledge and question-answering capabilities.