349

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Main:23 Pages
6 Figures
Bibliography:1 Pages
14 Tables
Appendix:17 Pages
Abstract

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on a single omic-either proteins or nucleic acids and have seen incredible success in downstream tasks in each domain with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pre-training limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabelled sequence data, OmniBioTE learns joint representations consistent with the central dogma of molecular biology. We further demonstrate that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, compared to single-omic controls trained with identical compute, OmniBioTE demonstrates superior performance-per-FLOP and absolute accuracy across both multi-omic and single-omic benchmarks, highlighting the power of a unified modeling approach for biological sequences.

View on arXiv
Comments on this paper