Scaling Large Language Models for Next-Generation Single-Cell Analysis
Scaling Large Language Models for Next-Generation Single-Cell Analysis
Rizvi, S. A.; Levine, D.; Patel, A.; Zhang, S.; Wang, E.; He, S.; Zhang, D.; Tang, C.; Lyu, Z.; Darji, R.; Li, M.; Sun, E.; Jeong, D.; Zhao, L.; Kwan, J.; Braun, D.; Hafler, B.; Ishizuka, J.; Dhodapkar, R.; Chung, H.; Azizi, S.; Perozzi, B.; van Dijk, D.
AbstractSingle-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual "cell sentences," to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. By scaling model size to 27 billion parameters, we observe consistent improvements in predictive and generative capabilities, as well as the capacity for advanced downstream tasks requiring synthesis of information across multicellular contexts. Through targeted fine-tuning supported by modern reinforcement learning techniques, our approach excels in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. By unifying transcriptomic and textual data at unprecedented scales, this approach not only surpasses both specialized single-cell models and general-purpose LLMs, but also establishes a powerful platform for next-generation single-cell analysis, paving the way for the development of "virtual cells."