基于随机矩阵理论的高维数据线性判别分析方法

摘要/Abstract

摘要： 线性判别分析(LDA)是机器学习和数据挖掘中一种常用的基于模型的分类方法。尽管该分类方法在许多实际应用中表现良好,但在处理高维数据时其效果却很不理想。其原因在于:当变量数目p接近或者大于样本数目n时,样本协方差矩阵不再是真实协方差矩阵的一个良好估计,导致线性判别函数值产生了较大的偏差。文中提出了一种基于随机矩阵理论的高维数据分类器正则化方法。首先,利用随机矩阵理论,分别以旋转不变估计法(当p≤n时)或者特征值截取法(当p>n时)对高维协方差矩阵进行一致估计;然后,使用估计出的高维协方差矩阵计算判别函数值。在模拟数据集和3个微阵列数据集上进行的分类实验的结果表明,所提线性判别分析方法在处理高维数据时不但适用范围更广,而且具有较高的分类正确率。

关键词: 分类, 高维数据, 随机矩阵理论, 线性判别分析, 协方差矩阵

Abstract: Linear discriminant analysis (LDA) is an important theoretical and analytic tool for many machine learning and data mining tasks.As a parametric classification method,it performs well in many applications.However,LDA is impractical for high-dimensional data sets which are now routinely generated everywhere in modern society.A primary reason for the inefficiency of LDA for high-dimensional data is that the sample covariance matrix is no longer a good estimator of the population covariance matrix when the dimension of feature vector is close to or even larger than the sample size.Therefore,this paper proposed a high-dimensional data classifier regularization method based on random matrix theory.Firstly,a truly consistent estimation was conducted for high-dimensional covariance matrix through rotation invariance estimation and eigenvalue interception.Secondely,the estimated high-dimensional covariance matrix was used to calculate the discrimination function value.Numerical experiments on the artificial datasets,as well as some real world datasets such as the microarray datasets,demonstrate that the proposed discriminant analysis method has wider applications and yields higher accuracies than existing competitors.

Key words: Classification, Covariance matrix, High-dimensional data, Linear discriminant analysis, Random matrix theory

中图分类号:

TP181

刘鹏, 叶宾. 基于随机矩阵理论的高维数据线性判别分析方法[J]. 计算机科学, 2019, 46(6A): 423-426. https://doi.org/

LIU Peng, YE Bin. Linear Discriminant Analysis of High-dimensional Data Using Random Matrix Theory[J]. Computer Science, 2019, 46(6A): 423-426. https://doi.org/

参考文献

[1]霍中花,陈莹.采用增量式线性判别分析的行人再识别[J].小型微型计算机系统,2017,38(3):595-600.
[2]尹洪涛,付平,沙学军.基于DCT和线性判别分析的人脸识别[J].电子学报,2009,37(10):2211-2214.
[3]余建波,卢笑蕾,宗卫周.基于局部与非局部线性判别分析和高斯混合模型动态集成的晶圆表面缺陷探测与识别[J].自动化学报,2016,42(1):47-59.
[4]DUDOIT S,FRIDLYAND J,SPEED T P.Comparison of discrimination methods for the classification of tumors using gene expression data[J].Journal of the American Statistical Association,2002,97(457):77-87.
[5]蒋胜利.高维数据的特征选择与特征提取研究[D].西安:西安电子科技大学,2011.
[6]朱蔚恒,印鉴,邓玉辉,等.大数据环境下高维数据的快速重复检测方法[J].计算机研究与发展,2016,53(3):559-570.
[7]杨静,赵家石,张健沛.一种面向高维数据挖掘的隐私保护方法[J].电子学报,2013,41(11):2187-2192.
[8]白志东,郑术蓉,姜丹丹.大维统计分析[M].北京:高等教育出版社,2012:1-4.
[9]TREVOR H,ROBERT T,JEROME F.The elements of statistical learning [M].Springer,2009:106-117.
[10]FRIEDMAN J H.Regularized discriminant analysis[J].Journal of the American Statistical Association,1989,84(405):165-175.
[11]YE J,WANG T.Regularized discriminant analysis for high dimensional,low sample size data[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:454-463.
[12]黄国宏,刘东峰.一种新的高维小样本情况下的线性判别分析[J].科学技术与工程,2008,8(10):2575-2578.
[13]崔振,山世光,陈熙霖.结构化稀疏线性判别分析[J].计算机研究与发展,2014,51(10):2295-2301.
[14]GORECKI T,LUCZAK M.Linear discriminant analysis with a generalization of the Moore-Penrose pseudoinverse[J].International Journal of Applied Mathematics and Computer Science,2013,23(2):463-471.
[15]BUN J,BOUCHAUD J P,POTTERS M.Cleaning large correlation matrices:tools from random matrix theory [J].Physics Reports,2017,666:1-109.
[16]BAI J,SHI S.Estimating high dimensional covariance matrices and its applications [J].Annals of Economics and Finance,2011,12(2):199-215.
[17]王磊,郑宝玉,李雷.基于随机矩阵理论的协作频谱感知[J].电子与信息学报,2009,31(8):1925-1929.
[18]韩华,吴翎燕,宋宁宁.基于随机矩阵的金融网络模型[J].物理学报,2014,63(13):138901.
[19]许帅.复杂网络的随机矩阵理论分析[D].徐州:中国矿业大学,2014.
[20]BUN J,ALLEZ R,BOUCHAUD J P.Rotational invariant estimator for general noisy matrices[J].IEEE Transactions on Information Theory,2016,62(12):7475-7490.
[21]EDELMAN A,RAO N R.Random matrix theory[J].ActaNumerica,2005,14:233-297.
[22]SRIVASTAVA M S,KUBOKAWA T.Comparison of discrimination methods for high dimensional data[J].Journal of the Japan Statistical Society,2007,37(1):123-134.
[23]TONG T,CHEN L,ZHAO H.Improved mean estimation and its application to diagonal discriminant analysis[J].Bioinformatics,2012,28(4):531-537.
[24]GUO Y,HASTIE T,TIBSHIRANI R.Regularized linear discriminant analysis and its application in microarrays[J].Biostatistics,2007,8(1):86-100.
[25]Interdisciplinary Computing and Complex BioSystems (ICOS) Research Group [EB/OL].http://ico2s.org/datasets/microarray.html.
[26]Gene Expression Model Selector [EB/OL].http://www.gems-system.org.

相关文章 15

[1]	陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2]	周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[5]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[6]	武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[7]	高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[8]	杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[9]	张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10]	邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[11]	陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[12]	杨健楠, 张帆. 一种结合双注意力机制和层次网络结构的细碎农作物分类方法 Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure 计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[13]	杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行. 基于步态分类辅助的虚拟IMU的行人导航方法 Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification 计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[14]	庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
[15]	王杉, 徐楚怡, 师春香, 张瑛. 基于CNN-LSTM的卫星云图云分类方法研究 Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM 计算机科学, 2022, 49(6A): 675-679. https://doi.org/10.11896/jsjkx.210300177

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed