General-Purpose Sequence Modeling is Unlocked?

The way people study the language of life has been fundamentally altered by comparing the syntax-semantics of natural languages and the sequence function of proteins. Although this comparison has inherent value when seen as a historical milestone that helped improve NLP’s application to the domain of proteins (such as language models), results from the area of NLP do not entirely translate to protein language. In addition to scaling up NLP model sizes, scaling up protein language models may have a much greater impact than scaling up NLP model sizes.

The observation of language models with a huge number of parameters trained on a huge number of steps still undergoing noticeable learning gradients and therefore perceived as under-fitted tends to encourage the proportionality between the model size and the richness of its learned representations rather -falsely-. As a result, choosing more accurate or relevant protein representations has gradually changed to choosing bigger models, which require more computing power and are therefore less accessible. Notably, PLM sizes recently increased from 106 to 109 parameters. They base their size-performance benchmark utilizing ProtTrans’s ProtT5-XL-U50, an encoder-decoder transformer pre-trained on the UniRef50 database, whose parameters are 3B for training and 1.5B for inference, shedding light historically on protein language model state-of-the-art (SOTA).

To develop scaling principles for protein sequence modeling, the RITA family of language models, which is a first step in that direction, was used to show how the performance of a model changes about its size. RITA presents four alternative models with performance-proportional increases in size from 85M to 300M, to 680M, to 1.2B parameters. A similar pattern was later confirmed by ProGen2, a collection of protein language models trained on various sequencing datasets and including 6.4B parameters. Finally, and as of the time this study was published, ESM-2, a survey of general-purpose protein language models that similarly shows a proportionate performance rise in size from 650M to 3B to 15B parameters, is the most recent addition encouraging model up-scaling.

The simple relationship between larger and ostensibly better PLMs ignores several factors, including computing costs and the design and deployment of task-agnostic models. This increases the entrance hurdle for innovative research and limits its capacity to scale. Although model size unquestionably influences achieving the goals above, it is not the only one. Pre-training dataset scaling in the same direction is conditional, i.e., larger datasets are not always preferable to smaller datasets of greater quality. They argue that scaling up language models is conditional and continues in the same approach (i.e., bigger models are not necessarily better than smaller models of protein knowledge guided means of optimization).

The primary goal of this study is to incorporate knowledge-guided optimization into an iterative empirical framework that encourages access to research innovation through practical resources. Because their model “unlocks” the language of life by learning better representations of its “letters,” the amino acids, they named their project “Ankh” (a reference to the Ancient Egyptian sign for the key to life). This is further developed into two pieces of evidence for assessing Ankh’s generality and optimization.

A generation study for protein engineering on High-N (family-based) and One-N (single sequence-based) applications, where N is the number of input sequences, is the first step in outperforming the performance of the SOTA in a wide range of structure and function benchmarks. The second step is to achieve this performance by a survey of optimum attributes, including not only the model architecture but also the software and hardware used for the model’s creation, training, and deployment. According to the application’s needs, they provide two pre-trained models called Ankh big and Ankh base, each offering two ways of computation. They call their flagship model, Ankh big, Ankh, for convenience’s sake. The pretrained models are available on their GitHub page. It also has details on how to run the codebase.