Alphafold – A Game Changer in Machine Learning Research

3 May 2022 By Paul Taylor

Introduction

Proteins are a class of macromolecules which carry out a wide range of functions. For example, proteins are essential for photosynthesis, producing energy from food, contracting muscle fibres and DNA replication, and form the building blocks of bone, skin and the immune system.

Proteins are formed of chains of amino acids. There are twenty different amino acids which are typically found in proteins. The sequence of the amino acids in a protein is determined by the sequence of the DNA in the gene for the protein, and is encoded in the genetic code. DNA is first transcribed into RNA, and the RNA serves as a template for protein synthesis through translation.

Knowing the DNA sequence of a gene for a protein therefore allows you to determine the sequence of the amino acids in the protein. The amino acid sequence of a protein in turn determines the structure of the protein, which ultimately determines the protein’s function. Knowing the structure of the protein therefore helps biologists understand its function.

Many more gene sequences are known than protein structures: over 200 million protein sequences have been deposited in the UniProt database of protein sequences, whereas the Protein Data Bank contains fewer than 200,000 solved protein structures. There is, therefore, a gap between the number of known protein sequences and structures. Determining the structure of even a part of a single protein experimentally can take many years and can be very costly. Biologists have therefore turned to in silico protein structure prediction techniques to try to bridge this gap.

Protein structure prediction

Inferring the structure of a protein from its amino acid sequence is notoriously difficult; an unfolded chain of amino acids can adopt a huge number of potential conformations; a chain of 100 amino acids has been estimated to have approximately 3¹⁹⁸ different conformations, and it would take longer than the age of the universe for a protein to fold by randomly sampling all possible conformations.

A number of different protein structure prediction techniques have been developed, including homology modelling (predicting the structure of a protein based on a known structure of a closely-related protein), threading (modelling a protein structure based on a known structure of a protein with the same predicted fold) and de novo protein structure prediction, which make use of general principles to predict the structure of a protein, rather than using any known protein structures. The main benefit of de novo techniques is that structures can be predicted for proteins for which there are no known structural homologues. The downside to this is that de novo techniques can require huge amounts of computer time to determine the structure of a protein.

Critical Assessment of Protein Structure Prediction (CASP)

Since 1994, CASP, a “world championship” of protein structure prediction techniques, has been held every two years. Participants are required to predict the structure of proteins for which the structures have recently been solved experimentally, but not yet published. The predictions are then compared with the experimental results.

DeepMind, an artificial intelligence subsidiary of Google best-known for developing the AlphaZero and AlphaGo computer programmes which have beaten the world’s best human chess and Go players, entered the CASP13 and CASP14 events held in 2018 and 2020 with their AlphaFold and AlphaFold 2 programmes.

At the most recent competition held in December 2020, AlphaFold 2 achieved scores far superior to the other participants. A number of the best predicted structures were indistinguishable from the structures obtained experimentally, and across all of the targets the average error achieved by AlphaFold 2 was approximately 1.6 Ångstroms. This has led to some of commentators saying that the problem of accurately predicting protein structures has in some sense been solved, as well as some suggestions that this could result in a decline in the field of structural biology itself.

The AlphaFold 2 algorithm

AlphaFold 2 makes applies deep learning to structural and sequence information to predict the structure of a protein. The first stage of the method involves identifying amino acids which are likely to be situated in proximity in the protein structure. The programme carries out a multiple sequence alignment on the input amino acid sequence to identify related amino acid sequences, and from this attempts to identify pairs of amino acids which are highly correlated with one-another (a “MSA representation”). The programme also attempts to identify known protein structures which may have at least elements which have a similar structure to the input, and from this generates a model of the structure of the input to identify amino acids which may be in proximity to one-another (a “pair representation”).

The multiple alignment and the initial model are then used to generate a model of which amino acids are likely to be in proximity in the final structure. One of the main improvements in the AlphaFold 2 programme is that the MSA representation and pair representation are carried out iteratively; the MSA representation informs the generation of the pair representation, and the pair representation informs the MSA representation. Together, these two elements generate a consensus pair representation of which amino acids in a protein are likely to be located in proximity in the protein structure.

The pair representation and the original input sequence are then combined in the structure module. The structure of the protein backbone is generated initially, by modelling each amino acid as a triangle, and predicting the relative orientation and position of each of these triangles. Following this step, the orientation of the amino acid side-chains is determined, and the φ and ψ angles of each of the amino acids are calculated to yield the overall structure of the protein.

What next?

The ability to accurately predict protein structures has numerous implications in the fields of biology, chemistry and biotechnology.

Currently, structure-based drug discovery and rational drug design generally require a structure of a potential protein target. The ability to predict the structure of a protein for which there is no experimentally determined structural information could allow the development of new generations of drugs to these new protein targets.

Efforts to engineer proteins with new or improved activities can also benefit from the improvement in protein structure prediction. The structures of native enzyme active sites could be predicted, which could identify amino acids which might be suitable candidates for mutagenesis. It may also be possible to identify whether substituting a particular amino acid might disrupt the fold of a protein, or induced any structural changes to an enzyme active site that could affect activity.

DeepMind themselves have even reported that the models generated by AlphaFold 2 have allowed structural biologists to solve protein structures that could not previously be solved. For example, in order to determine the structure of a protein by x-ray crystallography, it is necessary to solve the so-called ‘phase problem’. This can require some knowledge of the structure of the protein, or at least a closely-related protein. An accurate model of a protein structure generated using tools such as AlphaFold 2 could now be used in this process, which could allow the process of solving the structures of proteins experimentally to be simplified.

Limitations of AlphaFold 2

AlphaFold 2 uses sequences from the main UniProt and Protein Data Bank databases to form the MSA and pair references. Unsurprisingly, the quality of the output from AlphaFold 2 is only as good as the reference materials available to it. DeepMind has suggested that the accuracy of the simulated structure may drop where the mean alignment depth is less than 30 sequences. The accuracy and quality of the protein structures that are used to generate the pair references may also have an impact on the overall accuracy of the simulated structure.

Whilst the achievements of AlphaFold 2 in predicting the structures of proteins are hugely impressive, this work also does little to help understand how proteins actually fold. The accuracy of AlphaFold 2 is also somewhat variable, with some predictions not conforming particularly well to the experimentally derived structures. AlphaFold 2 also struggles to model protein-protein interactions, and thus how proteins may form multi-unit complexes.

With this in mind, it seems that calls that the core problem of structural biology has been ‘solved’ could be somewhat premature. Rather, it seems that software such as AlphaFold 2 could simply become another tool in the structural biologist’s toolkit, and could help lead to significant academic and industrial breakthroughs in the future.