GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads
GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads
Deng, S.; Sha, L.; Jin, Y.; Zhou, T.; Wang, C.; Liu, Q.; Guo, H.; Xiong, C.; Xue, Y.; Li, X.; Li, Y.; Gao, Y.; Hong, M.; Xu, J.; Chen, S.; Wang, P.
AbstractPlasma cell-free RNA (cfRNA) has recently emerged as a promising biomarker for non-invasive early cancer detection and treatment monitoring. Here, we introduce GeneLLM, a novel large language model designed to interpret cfRNA sequences directly, bypassing the need for genome annotations. GeneLLM significantly advances the detection accuracy of various cancer types. Our study demonstrates that this method achieves higher accuracy than traditional biomarkers and effectively handles large datasets from different centres, even with low sequencing depth. By avoiding the use of bioinformatics tools to count known genes, GeneLLM also discovered cfRNAs from previously unknown genes, referred to as \'dark matters\' in the genome, as cancer detection \'pseudo-biomarkers\'. Our results showcase the potential of GeneLLM to revolutionise cancer detection, making it more accessible and cost-effective. By offering a method that does not depend on bioinformatics tools to count known genes, GeneLLM opens new avenues for biomarker discovery and enhances our understanding of intercellular communication through novel RNA molecules.