Modeling a Long-Range Genomic Foundation with up to 1 Million Tokens at Single Nucleotide Resolution


Over the past few years, there have been rapid advancements in the field of artificial intelligence (AI) that have the potential of completely transforming industries and pushing the boundaries of what is possible. One area that has garnered significant attention from researchers is the development of more robust and efficient models for natural language tasks. In this context, researchers are constantly making efforts to develop models capable of handling longer tokens, as the number of tokens in a model determines its capacity to process and comprehend text. Moreover, a higher token count allows the model to account for a broader context, thereby enabling the model to process extensive sequences of data. However, in terms of long context models, most attention has been directed towards natural language, and there has been a significant oversight from the field that inherently deals with long sequences: genomics, which entails the study of different aspects of an organism’s genetic material, like structure, evolutionary elements, etc. Similar to the approach taken in natural language models, researchers have proposed the use of foundation models (FMs) in genomics to acquire generalizable features from unstructured genome data. These FMs can then be fine-tuned for various tasks, such as gene localization, regulatory element identification, etc. 

However, existing genomic models based on the Transformer architecture face unique challenges when dealing with DNA sequences. One such limitation is the quadratic scaling of attention which restricts the modeling of long-range interactions within DNA. Moreover, prevalent approaches rely on fixed k-mers and tokenizers to aggregate meaningful DNA units, often resulting in a loss of individual DNA characteristics. However, unlike natural language, this loss is crucial, as even subtle genetic variations can profoundly impact protein functions. Hyena, a recently introduced LLM, has emerged as a promising alternative to attention-based models by utilizing implicit convolutions. This innovative approach demonstrated comparable quality to attention-based models by allowing longer context lengths to be processed while significantly reducing computational time complexity. Inspired by these findings, a team of Stanford and Harvard University researchers embarked on investigating whether Hyena’s capabilities could be leveraged to effectively capture the essential long-range dependencies and individual DNA characteristics necessary for analyzing genomic sequences.

This led to the development of HyenaDNA, a genomic FM with an unprecedented ability to process context lengths of up to 1 million tokens at the single nucleotide level, representing a remarkable 500x increase over existing attention-based models. Harnessing the power of Hyena’s long-range capabilities, HyenaDNA exhibits unparalleled scalability, training up to 160x faster than Transformers equipped with FlashAttention. HyenaDNA utilizes a stack of Hyena operators as its foundation to model DNA and its intricate interactions. The model uses unsupervised learning to learn the distribution of DNA sequences and understand how genes are encoded and how non-coding regions perform regulatory functions in gene expression. The model performs exceptionally on several challenging genomic tasks like long-range species classification tasks. Moreover, it achieves state-of-the-art results on 12 out of 17 datasets compared to the Nucleotide Transformer while utilizing models with significantly fewer parameters and pre-training data.

As mentioned previously, during pre-training, HyenaDNA achieves an impressive context length of up to 1 million tokens, enabling the model to effectively capture long-range dependencies within genomic sequences. Moreover, the model’s ability is further enhanced by utilizing single nucleotide resolution and tokenization with global context available at each layer. To address training instability and expedite the process further, the researchers have also thoughtfully introduced a sequence length warmup scheduler, resulting in a 40% reduction in training time for species classification-related tasks. Another significant advantage of HyenaDNA is its parameter efficiency. The researchers also make a groundbreaking observation regarding the relationship between model size and quality, indicating that with longer sequences and a smaller vocabulary, HyenaDNA exhibits superior performance despite its significantly reduced size compared to previous genomic FMs. 

The researchers evaluated the performance of HyenaDNA on several downstream tasks. On the GenomicBenchmarks dataset, the pretrained models achieved new state-of-the-art (SOTA) performance on all eight datasets, significantly surpassing previous approaches. Additionally, on the benchmarks from the Nucleotide Transformer, HyenaDNA achieved SOTA results on 12 out of 17 datasets with considerably fewer parameters and less pre-training data. In order to explore the potential of in-context learning (ICL) in genomics, the researchers also conducted a series of experiments. They introduced the concept of soft prompt tokens, allowing the input to guide the output of a frozen pre-trained HyenaDNA model without the need for updating model weights or attaching a decoder head. Increasing the number of soft prompt tokens remarkably improved the accuracy on the GenomicBenchmarks datasets. The model also demonstrated remarkable performance in ultralong-range tasks. HyenaDNA competed effectively against BigBird, a SOTA sparse transformer model, on a challenging chromatin profile task. Moreover, in an ultralong-range species classification task, the model proved its efficiency by achieving successful outcomes when the context length was increased to 450 K and 1 M tokens. 

These results highlight the remarkable capabilities of HyenaDNA in handling complex genomic tasks and its potential for addressing long-range dependencies and species differentiation. They anticipate this progress will be crucial in driving AI-assisted drug discovery and therapeutic innovations. Furthermore, it has the potential to enable genomic foundation models to learn and analyze complete patient genomes in a personalized manner, further enhancing the understanding and application of genomics.