Like ChatGPT, it generates new text by predicting the word most likely to follow in the sequence Artificial Intelligence (AI) model can write new proteins that do not occur naturally from scratch.
The researchers used the new ESM3 model to create a new fluorescent protein that shares only 58% of its sequence with naturally occurring fluorescent proteins, they reported in a study published July 2 in preprint. bioRxiv database. Representatives from EvolutionaryScale, a company created by former Meta researchers, also outlined details on June 25 in declaration.
The research team released a small version of the model under a non-commercial license and will make a large version of the model available to commercial researchers. According to EvolutionaryScale, the technology could be useful in areas ranging from drug discovery to designing new chemicals to degrade plastics.
ESM3 is a large language model (LLM) similar to OpenAI GPT-4 that powers the ChatGPT chatbot, and the researchers trained their largest version on 2.78 billion proteins. For each protein, they extracted information about sequence (the order of the amino acid building blocks that make up the protein), structure (the three-dimensional folded shape of the protein), and function (what the protein does). They randomly masked pieces of information about these proteins and asked ESM3 to predict the missing pieces.
They changed this model based on research the same team was doing back at Meta. In 2022 they announced EMSFold — a precursor to ESM3 that predicted unknown microbial protein structures. That year, Alphabet’s DeepMind too predicted protein structures to 200 million proteins.
Related: DeepMind’s AlphaFold3 AI program can predict the structure of every protein in the universe – and show how they work
Scientists subsequently pointed out that they exist limiting the predictions of these AI models and that protein predictions need to be validated. However, the methods can still significantly speed up the search for protein structures, as the alternative is to use X-rays to map protein structures one by one – which is slow and expensive.
However, ESM3 goes beyond just predicting existing proteins. Using information gleaned from 771 billion unique pieces of structure, function and sequence information, the model can generate new proteins with specific functions. It was described as a “ChatGPT moment for biology” by one EvolutionaryScale supporter.
In the new study, researchers interrogated the model to create a new fluorescent protein — a type of protein that captures light and releases it back at a longer wavelength, so it glows a new shade of green. These proteins are important to biological researchers who attach them to the molecules they are interested in studying in order to track and image them; their discovery and development won a Nobel Prize in Chemistry in 2008.
The model produced 96 proteins with sequences and structures likely to produce fluorescence. The researchers then selected the one with the fewest sequences in common with naturally fluorescent proteins. Although this protein was 50 times less bright than natural green fluorescent proteins, ESM3 made another iteration that led to new sequences that increased the brightness – and the result was a green fluorescent protein not found in nature, called “esmGPF”. The EvolutionaryScale team estimated that it would take 500 million years of evolution to achieve these iterations, performed in a few moments by artificial intelligence.
“Right now, we still lack a fundamental understanding of how proteins, especially those that are ‘new’ to science, behave when introduced into a living system, but this is a great new step that allows us to approach synthetic biology in a new way.” AI modeling like ESM3 will enable the discovery of new proteins that the constraints of natural selection would never allow, and will create innovations in protein engineering that evolution cannot.However, this is an exciting claim to simulate 500 million years of evolution, which does not account for the many stages of natural selection that create diversity life as we know it today is an interesting AI-driven protein engineering, but I can’t help but feel that we might be too confident that we can outwit the complex processes honed by millions of years of natural selection.”