GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

Avatar
Poster
Voice is AI-generated
Connected to paperThis paper is a preprint and has not been certified by peer review

GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

Authors

Fan, Y.; Li, Y.; Ding, J.; Li, Y.

Abstract

Single-cell ATAC-seq (scATAC-seq) has emerged as a powerful technique for investigating open chromatin landscapes at the single-cell level, thus has been applied to a wide range of bio-medical applications. However, scATAC-seq cell representation learning and its downstream tasks remain challenging due to the inherent high dimensional, sparse and noisy properties of data. Furthermore, transfer learning and generalization between scATAC-seq data from different species and tissues is extremely challenging due to the misalignment of scATAC-seq data features caused by different experimental batches and peak calling algorithms. Recent advancements in Genome Foundation Models (GFMs) have demonstrated their superiority in a range of downstream applications such as regulatory element prediction that involves text-like genomic sequence information. However, their potential in other areas, especially in single-cell biology research, has not been well explored. Considering that features in scATAC-seq data correspond to specific genome regions exhibiting high chromatin accessibility, which are represented by DNA sequences, incorporating nucleotide sequence embeddings derived from GFMs, which are pre-traiend on the human reference genome, has the potential to enhance the quality of feature embeddings and improve the modelling of scATAC-seq data. In this paper, we present Genome Foundation Embedded Topic Model (GFETM), an interpretable and transferable deep neural network framework that integrates GFM and Embedded Topic Model (ETM) to perform scATAC-seq data analysis. We show that by probing and integrating the DNA sequence embedding extracted by GFMs from open chromatin regions, GFETM not only achieves state-of-the-art performance on benchmarking datasets of various scales but also demonstrates generalizability and transferability across different batches, tissues, species, and omics. Code is available at https://github.com/fym0503/GFETM.

Follow Us on

0 comments

Add comment