Student INSTITUTE OF ZOOLOGY, CHINESE ACADEMY OF SCIENCES, Beijing, China (People's Republic)
Abstract: The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology enables the investigation of chromatin accessibility and epigenetic heterogeneity at single-cell resolution, providing critical insights into the regulatory mechanisms driving cellular heterogeneity and gene expression. However, the sparse and noisy nature of scATAC-seq data poses significant challenges in identifying features that accurately define cell states, complicating reliable annotation—one of the fundamental challenges in scATAC-seq analysis. Traditional gene activity score-based methods depend on accurate peak-to-gene mapping, which poses a substantial barrier for non-model organisms with incomplete or imperfect gene annotations. Furthermore, to our knowledge, no existing software currently supports cross-species cell type annotation for scATAC-seq data. In this study, we developed an end-to-end generative model, termed ATACompass, which leverages large language models to annotate cell types based on single-cell enriched peak sequences without relying on species gene annotation. By introducing systematic shifts and mutations to scATAC-seq peak sequences, we effectively expanded the training dataset, enhancing the model's annotation efficiency and enabling robust cross-species annotation capabilities. Finally, using datasets from five species, we established a comprehensive strategy for annotating scATAC-seq data across the majority of mammalian lineages. This represents a significant advancement over previous models. Overall, by leveraging peak sequence information, ATACompass achieves superior accuracy for cell type annotation tasks across diverse scenarios and, for the first time, enables cross-species cell type annotation for scATAC-seq datasets.