多重假设检验及其在大数据特征降维中的应用

摘要/Abstract

摘要： 现有的特征降维方法大致可分为特征提取和特征选择。在特征提取过程中,数据中的原始特征通过某些数据变换被映射到一个低维空间。提取出的特征尽管与原始特征相关,但不再具有原始特征的物理意义,即特征提取改变了原始数据的表达形式。与特征提取不同,特征选择则在原有的特征集中选择一个子集,选择出的特征子集中不再含有与数据分析任务相关性不大或冗余的那部分特征,其结果可能引起信息丢失。因而现有的数据降维方法几乎都不是保真降维,其降维后的数据仅适合特定的后续数据分析任务,因而只能算是特定数据分析任务的前期数据预处理。从多重假设检验方法的角度分析了高维数据保真降维的方法及研究的关键所在。

Abstract: The existing feature dimension reduction methods can roughly be categorized into two classes:feature extraction and feature selection.In feature extraction problems,the original features in the measurement space are initially transformed into a new dimension-reduced space via some specified transformation.Although the significant variables determined in the new space are related to the original variables,the physical interpretation in terms of the original variables may be lost.So,feature extraction will change the description of the original data.Unlike feature extraction,feature selection aims to seek optimal or suboptimal subsets of the original features by preserving the main information carried by the complete data to facilitate future analysis for high dimensional problems.Often,the selected features are a subset of the original features,and those insignificant and redundant features may be discarded.It is worth mentioning that almost all of the existing dimensionality reduction methods are not high fidelity methods.The result of these me-thods is only suitable for specific subsequent data analysis tasks,which is only a particular task under the preprocess.In this paper,with the technique of multiple hypothesis testing,we studied the dimensionality high fidelity reduction problem.The processing results can save all the useful information and eliminate the irrelevant features from the original data.

Key words: Feature selection,Dimension reduction,Multiple hypothesis testing

潘舒,祁云嵩. 多重假设检验及其在大数据特征降维中的应用[J]. 计算机科学, 2015, 42(Z6): 89-93. https://doi.org/

PAN Shu and QI Yun-song. Multiple Hypothesis Testing and its Application in Feature Dimension Reduction[J]. Computer Science, 2015, 42(Z6): 89-93. https://doi.org/

参考文献

[1] Bellman,Richard.Adaptive Control Processes:A Guided Tour[M].Princeton University Press,2000
[2] 于玲,吴铁军.LS-Ensem:一种用于回归的集成算法[J].计算机学报,2006,29(5):719-726
[3] 钱叶魁,陈鸣,叶立新,等.基于多尺度主成分分析的全网络异常检测方法[J].软件学报,2012,23(2):361-377
[4] 黄雅平,罗四维,陈恩义.基于独立分量分析的虹膜识别方法[J].计算机研究与发展,2003,40(10):1451-1457
[5] Jie H.Survey on feature dimension reduction for high-dimensional data[J].Application Research of computers,2008,9(8)
[6] 杨静,于旭,谢志强.改进向量投影的支持向量预选取方法[J].计算机学报,2012,35(5):1002-1010
[7] 宋枫溪,高秀梅,刘树海,等.统计模式识别中的维数削减与低损降维[J].计算机学报,2005,28(11):1915-1922
[8] Huber P.Projection pursuit[J].The annals of Statistics,1985,13(2):435-475
[9] 徐峻岭,周毓明,陈林,等.基于互信息的无监督特征选择[J].计算机研究与发展,2012,49(2):372-382
[10] Wang H,Das S R,Suh J W,et al.A learning-based wrappermethod to correct systematic errors in automatic image segmentation:Consistently improved performance in hippocampus,cortex and brain segmentation[J].NeuroImage,2011,55(3):968-985
[11] Cheng M,Fang B,Pun C M,et al.Kernel-view based discriminant approach for embedded feature extraction in high-dimensional space[J].Neurocomputing,2011,74(9):1478-1484
[12] Qian Y,Zhang H,Sang Y,et al.Multigranulation decision-theoretic rough sets[J].International Journal of Approximate Reasoning,2014,55(1):225-237
[13] Qian Yuh-ua,Liang Ji-ye,Pedrycz W,et al.Positive approximation:an accelerator for attribute reduction in rough set theory[J].Artificial Intelligence,2010,174:597-618
[14] 王丽娟,杨习贝,杨静宇,等.基于覆盖的粗糙集模型比较[J].计算机科学,2012,39(7):229-232
[15] Wang Feng,Liang Ji-ye,Qian Yu-hua.Attribute reduction:A dimension incremental strategy,Knowledge-Based Systems,2013,9:95-108
[16] Liang Ji-ye,Wang Feng,Dang Chuang-yin,et al.Incremental approach to feature selection based on rough set theory[J].IEEE Transactions on Knowledge and Data Engineering,2013
[17] Meng X.Posterior predictive values[J].The Annals of Statistics,1994(3):1142-1160
[18] Ausin M C,Gomez-Villegas M A,Gonzalez-Perez B,et al.Bayesian Analysis of Multiple Hypothesis Testing with Applications to Microarray Experiments[J].Communications in Statistics-Theory and Methods,2011,40(13):2276-2291
[19] Li J D.Testing each hypothesis marginally at alpha while still controlling FWER:how and when[J].Statistics in Medicine,2012,32(10):1730-1738
[20] Benjamini Y,Hochberg Y.Controlling the false discovery rate:a practical and powerful approach to multiple testing[J].Journal of the Royal Statistical Society.Series B(Methodological),1995,57(1):289-300
[21] Qin W,Liu Y,Jiang T,et al.The Development of Visual Areas Depends Differently on Visual Experience[J].PloS one,2013,8(1):e53784
[22] 刘晋,张涛,李康.多重假设检验中 FDR 的控制与估计方法[J].中国卫生统计,2012,29(2):305-308
[23] Bilgin B,Brenner L.Context affects the interpretation of low but not high numerical probabilities:A hypothesis testing account of subjective probability[J].Organizational Behavior and Human Decision Processes,2013,121(1):118-128
[24] Wang Y,Mei Y.A Multistage Procedure for Decentralized Sequential Multi-Hypothesis Testing Problems[J].Sequential Analysis,2012,31(4):505-527
[25] 刘乐平,张龙,蔡正高.多重假设检验及其在经济计量中的应用[J].统计研究,2007,24(4):26-30
[26] Yekutieli D,Benjamini Y.Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics[J].Journal of Statistical Planning and Inference,1999,82(1):171-196
[27] Benjamini Y,Liu W.A step-down multiple hypotheses testing procedure that controls the false discovery rate under indepen-dence[J].Journal of Statistical Planning and Inference,1999,82(1):163-170
[28] Benjamini Y,Hochberg Y.On the adaptive control of the false discovery rate in multiple testing with independent statistics[J].Journal of Educational and Behavioral Statistics,2000,25(1):60-83

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed