计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 104-112.doi: 10.11896/jsjkx.221000167

• 数据库&大数据&数据科学 • 上一篇    下一篇

联合ZINB模型与图注意力自编码器的自优化单细胞聚类

孔凤玲, 吴昊, 董庆庆   

  1. 云南大学信息学院 昆明 650500
  • 收稿日期:2022-10-21 修回日期:2023-03-14 出版日期:2023-12-15 发布日期:2023-12-11
  • 通讯作者: 吴昊(haowu1982@vip.163.com)
  • 作者简介:(2378015890@qq.com)
  • 基金资助:
    国家自然科学基金(62061049);云南省基础研究项目(2018FB100)

Self-optimized Single Cell Clustering Using ZINB Model and Graph Attention Autoencoder

KONG Fengling, WU Hao, DONG Qingqing   

  1. School of Information Science and Engineering,Yunnan University,Kunming 650500,China
  • Received:2022-10-21 Revised:2023-03-14 Online:2023-12-15 Published:2023-12-11
  • About author:KONG Fengling,born in 1997,postgra-duate,is a student member of China Computer Federation.Her main research interests include biological information technology,image processing,etc.
    WU Hao,born in 1982,Ph.D,lecturer,is a senior member of China Computer Federation.His main research interests include image processing,computer vision and bioinformatics analysis,etc.
  • Supported by:
    National Natural Science Foundation of China(62061049) and Yunnan Fundamental Research Projects(2018FB100).

摘要: 单细胞数据聚类在生物信息分析中具有重要作用,但受测序原理和测序平台的限制,单细胞数据集普遍存在高维稀疏性、高方差噪声和基因数据缺失的问题,导致单细胞数据在聚类分析和应用方面仍面临诸多挑战。现有的单细胞聚类方法主要针对细胞和基因表达间的关系进行建模,忽略了对细胞间潜在特征关系的充分挖掘以及对噪声的去除,导致聚类结果不理想,从而阻碍了后期对数据的分析。针对上述问题,提出了一种联合零膨胀负二项(Zero Inflated Negative Binomial,ZINB)模型与图注意力自编码器的自优化单细胞聚类算法(Self-optimized Single Cell Clustering Using ZINB Model and Graph Attention Autoencoder,scZDGAC)。该算法首先使用ZINB模型并结合可扩展的DCA去噪算法,通过ZINB分布更好地拟合数据特征分布,提升自编码器的去噪性能,并减小噪声和数据丢失对KNN算法输出的影响;然后通过图注意力自编码器在不同权重的细胞之间传播信息,更好地捕获细胞间的潜在特征进行聚类;最后scZDGAC采用自优化的方法使原本两个独立的聚类模块和特征模块相互受益,不断迭代更新聚类中心,进一步提升聚类性能。为了对聚类结果进行评价,文中使用调整兰德指数(ARI)和标准化互信息(NMI)两个通用评价指标。在6个不同规模的单细胞数据集上与其他算法进行对比实验,结果表明,所提聚类算法在聚类性能上较其他方法有很大提高,很好地展现了该算法的鲁棒性。

关键词: 深度聚类, scRNA-Seq, ZINB模型, 自优化, DCA, 图注意力自编码器

Abstract: One of the most important aspects of single-cell data analysis is the clustering of individual cells into clusters of subpopulations.However,due to the limitation of sequencing principle and sequencing platform,the obtained single cell dataset ge-nerally has high-dimensional sparsity,high variance noise and a large amount of data loss,which lead to many challenges in cluster analysis and application of single cell data.Single-cell clustering methods proposed in recent years mainly model the relationship between cell and gene expression,ignoring the full mining of the potential characteristic relationship between cells and the remo-val of noise,resulting in unsatisfactory clustering results,which hinders the later analysis of data.In view of the above problems,a self-optimized single-cell clustering algorithm(scZDGAC) combining zero expansion negative binomial(ZINB) model with graph attention autoencoder is proposed.The algorithm firstly uses ZINB model combined with extensible DCA denoising algorithm,better fit data feature distribution through ZINB distribution,to improve the denoising performance of autoencoder,and reduce the impact of noise and data loss on the output of KNN algorithm.And then using the graph attention autoencoder to spread the information between cells of different weights,which can better capture the potential features between cells for clustering.Finally,scZDGAC uses the self-optimization method to make the originally two independent clustering modules and feature modules benefit from each other,and constantly update the clustering center iteratively to further improve the clustering performance.In order to evaluate the clustering results,this paper uses adjusted RAND index(ARI) and standardized mutual information(NMI) as two general evaluation indicators.Compared with six single cell datasets of different scales,experimental results show that the clustering performance of the proposed clustering algorithm has greatly improved.

Key words: Deep clustering, scRNA-seq, ZINB model, Self-optimization, DCA, Graph attention autoencoder

中图分类号: 

  • Q811.4
[1]HWANG B,LEE J H,BANG D.Single-cell RNA sequencingtechnologies and bioinformatics pipelines[J].Experimental & Molecular Medicine,2018,50(8):1-14.
[2]GUO M,DU Y,GOKEY J J,et al.Single cell RNA analysisidentifies cellular heterogeneity and adaptive responses of the lung at birth[J].Nature Communications,2019,10(1):1-16.
[3]HU H,LI Z,LI X,et al.ScCAEs:deep clustering of single-cell RNA-seq via convolutional autoencoder embedding and soft K-means[J].Briefings in Bioinformatics,2022,23(1):bbab321.
[4]MACOSKO E Z,BASU A,SATIJA R,et al.Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nano-liter Droplets[J].Cell,2015,161(5):1202-1214.
[5]ANGERER P,SIMON L,TRITSCHLER S,et al.Single cellsmake big data:New challenges and opportunities in transcriptomics[J].Current Opinion in Systems Biology,2017,4:85-91.
[6]WANG B,ZHU J,PIERSON E,et al.Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning[J].Nature Methods,2017,14(4):414-416.
[7]SATIJA R,FARRELL J A,GENNERT D,et al.Spatial recon-struction of single-cell gene expression data[J].Nature Biotechnology,2015,33(5):495-502.
[8]LIN P,TROUP M,HO J W K.CIDR:Ultrafast and accurate clustering through imputation for single-cell RNA-seq data[J].Genome Biology,2017,18(1):1-11.
[9]MEI Q,LI G,SU Z.Clustering single-cell RNA-seq data by rankconstrained similarity learning[J].Bioinformatics(Oxford,England),2021,37(19):3235-3242.
[10]KISELEV V Y,KIRSCHNER K,SCHAUB M T,et al.SC3:consensus clustering of single-cell RNA-seq data[J].Nature Methods,2017,14(5):483-486.
[11]YANG Y,HUH R,CULPEPPER H W,et al.SAFE-clustering:single-cell aggregated(from ensemble) clustering for single-cell RNA-seq data[J].Bioinformatics(Oxford,England),2019,35(8):1269-1277.
[12]HU H R,YANG Y,JIANG Y,et al.SAME-clustering:Single-cell Aggregated Clustering via Mixture Model Ensemble[J].Nucleic Acids Research,2020,48(1):86-95.
[13]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[14]ERASLAN G,AVSEC Ž,GAGNEUR J,et al.Deep learning:new computational modelling techniques for genomics[J].Nature Reviews Genetics,2019,20(7):389-403.
[15]HINTON G E,SALAKHUTDINOV R R.Reducing the dimen-sionality of data with neural networks[J].Science(New York),2006,313(5786):504-507.
[16]TIAN T,WAN J,SONG Q,et al.Clustering single-cell RNA-seq data with a model-based deep learning approach[J].Nature Machine Intelligence,2019,1(4):191-198.
[17]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487.
[18]LI X,WANG K,LYU Y,et al.Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq ana-lysis[J].Nature Communications,2020,11(1):1-14.
[19]CHEN L,WANG W,ZHAI Y,et al.Deep soft K-means clustering with self-training for single-cell RNA sequence data[J].NAR Genomics and Bioinformatics,2020,2(2):lqaa039.
[20]GAN Y,HUANG X,ZOU G,et al.Deep structural clustering for single-cell RNA-seq data jointly through autoencoder and graph neural network[J].Briefings in Bioinformatics,2022,23(2):bbac018.
[21]CHENG Y,MA X.scGAC:a graph attentional architecture for clustering single-cell RNA-seq data[J].Bioinformatics(Oxford,England),2022,38(8):2187-2193.
[22]BO D,WANG X,SHI C,et al.Structural deep clustering network[C]//Proceedings of the Web Conference 2020.2020:1400-1410.
[23]WANG J,MA A,CHANG Y,et al.scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses[J].Nature Communications,2021,12(1):1-11.
[24]KIPF T N,WELLING M.Semi-Supervised Classification withGraph Convolutional Networks[J].arXiv:1609.02907,2016.
[25]ERASLAN G,SIMON L M,MIRCEA M,et al.Single-cellRNA-seq denoising using a deep count autoencoder[J].Nature Communications,2019,10(1):1-14.
[26]ZHAO J,WANG N,WANG H,et al.SCDRHA:A scRNA-Seq Data Dimensionality Reduction Algorithm Based on Hierarchical Autoencoder[J].Frontiers in Genetics,2021,12:733906.
[27]VELICKOVIC P,CUCURULL G,Casanova A,et al.Graph attention networks[J].arXiv:1710.10903,2017.
[28]HARTIGAN J A,WONG M A.Algorithm AS 136:A k-means clustering algorithm[J].Journal of the Royal Statistical Society,Series c(Applied Statistics),1979,28(1):100-108.
[29]ROUSSEEUW P J.Silhouettes:A graphical aid to the interpretation and validation of cluster analysis[J].Journal of Computational and Applied Mathematics,1987,20:53-65.
[30]LOPEZ R,REGIER J,COLE M B,et al.Deep generative mode-ling for single-cell transcriptomics[J].Nature Methods,2018,15(12):1053-1058.
[31]VAN DER MAATEN L,HINTON G.Visualizing data using t-SNE[J].Journal of machine learning research,2008,9(11):2579-2605.
[32]TANG Y W.Research on an adaptive clustering Algorithmbased on K-Means[J].Science and Technology Wealth Guide,2012(2):143-143.
[33]HUBERT L,ARABIE P.Comparing partitions[J].Journal ofClassification,1985,2(1):193-218.
[34]STREHL A,GHOSH J.Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions[J].Journal of Machine Learning Research,2002,3(Dec):583-617.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!