GENERanno: A Genomic Foundation Model for Metagenomic Annotation

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

Authors

Li, Q.; Wu, W.; Zhu, Y.; Feng, F.; Ye, J.; Wang, Z.

Abstract

The rapid growth of genomic and metagenomic data has underscored the pressing need for advanced computational tools capable of deciphering complex biological sequences. In this study, we introduce GENERanno, a compact yet powerful genomic foundation model specifically optimized for metagenomic annotation. Trained on an extensive dataset comprising 715 billion base pairs (bp) of prokaryotic DNA, GENERanno employs a transformer encoder architecture with 500 million parameters, enabling bidirectional attention over sequences up to 8192 nucleotides at single-nucleotide resolution. This design addresses key limitations of existing methods, including the inability of traditional Hidden Markov Models (HMMs) to handle fragmented DNA sequences, as well as the suboptimal tokenization schemes of current genomic foundation models that compromise fine-grained analysis. To evaluate the model performance, we curated the Prokaryotic Gener Tasks, a biologically meaningful benchmark encompassing gene fitness prediction, antibiotic resistance prediction, gene classification, and taxonomic classification. Across these tasks, GENERanno consistently outperforms its counterparts, establishing itself as a leading genomic foundation model in the prokaryotic domain. For metagenomic annotation, GENERanno achieves superior accuracy compared to traditional HMM-based methods (e.g., GLIMMER3, GeneMarkS2, Prodigal) and recent LLM-based approaches (e.g., GeneLM), while demonstrating exceptional generalization ability on archaeal genomes. Notably, GENERanno pioneers the prediction of pseudogenes based solely on sequence data, leveraging its contextual understanding to differentiate non-functional sequences from active coding regions. Overall, GENERanno represents a significant advancement in genomic foundation modeling, bridging the gap between large-scale sequence analysis and fine-grained biological insights. By providing a versatile tool for metagenomic annotation and broader genomic exploration, this work lays the groundwork for future research in functional genomics and related fields. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERanno.

Follow Us on

0 comments

Add comment