CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma

Avatar
Poster
Voices Powered byElevenlabs logo
Connected to paperThis paper is a preprint and has not been certified by peer review

CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma

Authors

Zhu, X.; Qin, C.; Wang, F.; Yang, F.; He, B.; Zhao, Y.; Yao, J.

Abstract

The central dogma serves as a fundamental framework for understanding the flow and expression of genetic information within living organisms, facilitating the connection of diverse biological sequences across molecule types. In this study, we present CD-GPT (Central Dogma Generative Pretrained Transformer), a generative biological foundation model comprising 1 billion parameters, aiming to capture the intricate system-wide molecular interactions in biological systems. We introduce the concept of a unified representational space and employ a shared, multi-molecule vocabulary to effectively represent biological sequences and narrow their distance in the embedding space. Through extensive pretraining on comprehensive full molecular level data, CD-GPT exhibits exceptional performance in a wide range of predictive and generative downstream tasks, encompassing mono-molecular and multi-molecular analyses. Notably, CD-GPT excels in tasks such as genomic element detection, protein property prediction, RNA-protein interaction identification and also generative tasks like de novo protein generation and reverse translation. The versatility of CD-GPT opens up promising avenues for advanced multi-omics analysis.

Follow Us on

0 comments

Add comment