Scaling Large Language Models for Next-Generation Single-Cell Analysis

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

Scaling Large Language Models for Next-Generation Single-Cell Analysis

Authors

Rizvi, S. A.; Levine, D.; Patel, A.; Zhang, S.; Wang, E.; He, S.; Zhang, D.; Tang, C.; Lyu, Z.; Darji, R.; Li, M.; Sun, E.; Jeong, D.; Zhao, L.; Kwan, J.; Braun, D.; Hafler, B.; Ishizuka, J.; Dhodapkar, R.; Chung, H.; Azizi, S.; Perozzi, B.; van Dijk, D.

Abstract

Single-cell RNA sequencing has transformed our understanding of cellular diversity, yet current single-cell foundation models (scFMs) remain limited in their scalability, flexibility across diverse tasks, and ability to natively integrate textual information. In this work, we build upon the Cell2Sentence (C2S) framework, which represents scRNA-seq profiles as textual "cell sentences," to train Large Language Models (LLMs) on a corpus comprising over one billion tokens of transcriptomic data, biological text, and metadata. By scaling model size to 27 billion parameters, we observe consistent improvements in predictive and generative capabilities, as well as the capacity for advanced downstream tasks requiring synthesis of information across multicellular contexts. Through targeted fine-tuning supported by modern reinforcement learning techniques, our approach excels in tasks such as perturbation response prediction, natural language interpretation, and complex biological reasoning. By unifying transcriptomic and textual data at unprecedented scales, this approach not only surpasses both specialized single-cell models and general-purpose LLMs, but also establishes a powerful platform for next-generation single-cell analysis, paving the way for the development of "virtual cells."

Follow Us on

0 comments

Add comment