Gene42: Long-Range Genomic Foundation Model With Dense Attention

We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available atthis http URL.
View on arXiv@article{vishniakov2025_2503.16565, title={ Gene42: Long-Range Genomic Foundation Model With Dense Attention }, author={ Kirill Vishniakov and Boulbaba Ben Amor and Engin Tekin and Nancy A. ElNaker and Karthik Viswanathan and Aleksandr Medvedev and Aahan Singh and Maryam Nadeem and Mohammad Amaan Sayeed and Praveenkumar Kanithi and Tiago Magalhaes and Natalia Vassilieva and Dwarikanath Mahapatra and Marco Pimentel and and Shadab Khan }, journal={arXiv preprint arXiv:2503.16565}, year={ 2025 } }