(F1066) ATACOMPASS: CROSS-SPECIES DECIPHERING OF CELL IDENTITES FOR SCATAC-SEQ USING LARGE LANGUAGE MODELS INDEPENDENT OF GENE ANNOTATIONS

Friday, June 13, 2025

5:00 PM - 6:00 PM HK Time

Presenting Author(s)

YD

Yali Ding, PhD

Student
INSTITUTE OF ZOOLOGY, CHINESE ACADEMY OF SCIENCES, Beijing, China (People's Republic)

Abstract: The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology enables the investigation of chromatin accessibility and epigenetic heterogeneity at single-cell resolution, providing critical insights into the regulatory mechanisms driving cellular heterogeneity and gene expression. However, the sparse and noisy nature of scATAC-seq data poses significant challenges in identifying features that accurately define cell states, complicating reliable annotation—one of the fundamental challenges in scATAC-seq analysis. Traditional gene activity score-based methods depend on accurate peak-to-gene mapping, which poses a substantial barrier for non-model organisms with incomplete or imperfect gene annotations. Furthermore, to our knowledge, no existing software currently supports cross-species cell type annotation for scATAC-seq data. In this study, we developed an end-to-end generative model, termed ATACompass, which leverages large language models to annotate cell types based on single-cell enriched peak sequences without relying on species gene annotation. By introducing systematic shifts and mutations to scATAC-seq peak sequences, we effectively expanded the training dataset, enhancing the model's annotation efficiency and enabling robust cross-species annotation capabilities. Finally, using datasets from five species, we established a comprehensive strategy for annotating scATAC-seq data across the majority of mammalian lineages. This represents a significant advancement over previous models. Overall, by leveraging peak sequence information, ATACompass achieves superior accuracy for cell type annotation tasks across diverse scenarios and, for the first time, enables cross-species cell type annotation for scATAC-seq datasets.

Funding Source:

Clinical Trial ID number: