LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language
LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language
He, Y.; Fang, P.; Shan, Y.; Pan, Y.; Wei, Y.; Chen, Y.; Chen, Y.; Liu, Y.; Zeng, Z.; Zhou, Z.; Zhu, F.; Holmes, E. C.; Ye, J.; Li, J.; Shu, Y.; Shi, M.; Li, Z.
AbstractIn recent years, significant advancements have been observed in the domain of NLP with the introduction of pre-trained foundational models, paving the way for utilizing similar AI technologies to interpret the language of biology. In this research, we introduce \"LucaOne\", a novel pre-trained foundational model designed to integratively learn from the genetic and proteomic languages, encapsulating data from 169,861 species encompassing DNA, RNA, and proteins. This work illuminates the potential for creating a biological language model aimed at universal bioinformatics application. Remarkably, through few-shot learning, this model efficiently learns the central dogma of molecular biology and demonstrably outperforms competing models. Furthermore, in tasks requiring inputs of DNA, RNA, proteins, or a combination thereof, LucaOne exceeds the state-of-the-art performance using a streamlined downstream architecture, thereby providing empirical evidence and innovative perspectives on the potential of foundational models to comprehend complex biological systems.