计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (24): 83-89.DOI: 10.3778/j.issn.1002-8331.2204-0340

• 理论与研发 • 上一篇    下一篇

基于异构并行计算的单细胞测序数据聚类算法

谢林娟,李荔瑄,张少强   

  1. 天津师范大学 计算机与信息工程学院,天津 300387
  • 出版日期:2022-12-15 发布日期:2022-12-15

Clustering of Single-Cell RNA-Seq Data Based on Heterogeneous Parallel Computing

XIE Linjuan, LI Lixuan, ZHANG Shaoqiang   

  1. College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
  • Online:2022-12-15 Published:2022-12-15

摘要: 随着单细胞RNA测序技术的发展,目前单细胞测序通量由上千细胞发展到主流上万细胞的规模。基于单细胞RNA测序数据的细胞分型是研究细胞的重要问题之一,该问题主要运用无监督聚类方法。现有针对大规模单细胞测序数据的聚类方法通过简化细胞关系网络来降低时间复杂度,从而导致细胞分型准确度降低。而常见较高准确度的细胞分型方法无法处理大规模数据。为此,采用将[k]最近邻与细胞相似度阈值结合构建全新的细胞关系网络,并采用CPU+GPU异构并行计算提高运算速度,通过改进的马尔科夫聚类算法进行细胞聚类。通过在七个较大规模单细胞数据集上实验,发现该算法比现有主要算法具有更好的聚类准确度,从而适合基于主流单细胞测序技术数据的细胞分型。

关键词: 单细胞RNA测序, 无监督聚类, 并行计算, 细胞分型

Abstract: With the development of single-cell RNA sequencing(scRNA-seq) technology, the mainstream scRNA-seq throughput has grown from thousands of cells to tens of thousands of cells. Cell typing based on scRNA-seq data is one of the important problems in cell research, which mainly uses unsupervised clustering methods. The existing clustering methods for large-scale single-cell sequencing data reduce the time complexity by simplifying the single-cell network, which leads to the accuracy decline of cell typing. However, the common cell typing methods with high accuracy cannot handle large-scale data. For this reason, this study adopts the combination of [k]-nearest neighbors(KNN) and cell-cell similarity threshold to construct a new single-cell network, uses CPU+GPU heterogeneous parallel computing to improve the computing speed, and finally performs cell clustering by an improved Markov clustering algorithm. Through experiments on seven large-scale single-cell datasets, it is found that the algorithm has better clustering accuracy than the main algorithms, and thus is suitable for cell typing of scRNA-seq data produced by mainstream technologies.

Key words: single-cell RNA-seq, unsupervised clustering, parallel computing, cell typing