计算机科学 ›› 2017, Vol. 44 ›› Issue (12): 58-63.doi: 10.11896/j.issn.1002-137X.2017.12.011
郑奇斌,刁兴春,曹建军
ZHENG Qi-bin, DIAO Xing-chun and CAO Jian-jun
摘要: 数据的完整性是数据可用性的重要维度。由于数据采集等过程中存在的问题,现实中的数据往往存在缺失。现有的聚类算法在面对不完整数据时一般采用忽略缺失或填补缺失的策略,但是当数据缺失属于非随机缺失时,这样的处理策略会导致聚类精度严重下降。当数据缺失属于非随机缺失时,数据缺失模式与缺失属性的取值相关,因此在不完整对象的相似度量中加入缺失模式相似的度量,提出了两种结合缺失模式的 PCM(Possibilistic c-means)模糊聚类算法:最小化缺失模式距离之和的 PatDistPCM 算法和基于缺失模式聚类的 PatCluPCM 算法。在两个公开数据集上的实验证明, 考虑缺失模式的模糊聚类PatDistPCM和PatCluPCM算法,在对存在非随机缺失的数据进行聚类时 ,能有效提高聚类结果的准确性。
[1] HAN J W,KAMBER M,PEI J.Data Mining:Concepts andTechniques(3rd ed)[M].Morgan Kaufmann Publishers,2011:288-293. [2] GU Y,YU G,LI X J,et al.RFID data interpolation algorithm based on dynamic probabilistic path-event model[J].Journal of Software,2010,1(3):438-451. [3] DIXON J K.Pattern recognition with partly missing data[J].IEEE Transactions on Systems,Man and Cybernetics,1979,9(10):617-621. [4] BEZDEK J C.Pattern recognition with fuzzy objective function algorithms[M].Plenum Press,1981. [5] HATHAWAY R J,BEZDEK J C.Fuzzy c-Means Clustering ofIncomplete Data[J].IEEE Transactions on System,Man,and Cybernetics,2001,1(5):735-744. [6] BALKIS A,YAHIA S B.A new algorithm for fuzzy clustering handling incomplete dataset[J].International Journal on Artificial Intelligence Tools,2014,3(4):1460012. [7] KRISHNAPURAM R,KELLER J M.A possibilistic Approach to clustering[J].IEEE Transactions on Fuzzy Systems,1993,1(2):98-110. [8] ZHANG Q,CHEN Z.A distributed weighted Possibilistic c-Means algorithm for clustering incomplete big sensor data[J].International Journal of Distributed Sensor Networks,2014,2014(2):4. [9] LITEEL R J A,RUBIN D B.Statistical Analysis with Missing Data[M].John Wiley & Sons,Inc.New Jersey,2002. [10] Donald D B.Inference and Missing Data[J].Biometrika,1976,3(3):581-592. [11] ALLISON P D.数据缺失[M].林毓玲,译.上海:格致出版社,2012. [12] MARLIN B M.Missing Data Problems in Machine Learning[D].Toronto:University of Toronto,2008. [13] MARLIN B M,ZEMEL R S.Collaborative Prediction and Ranking with Non-Random Missing Data[C]∥RecSys’09.New York,USA,2009:23-25. [14] WANG H,WANG S.Discovering patterns of missing data inservey databases:An application of rough sets[J].Expert Systems with Applications,2009,36(3):6256-6260. [15] TIMM H,BORGELT C,KRUSE R.An Extension of Possibilistic Fuzzy Cluster Analysis[J].Fuzzy Sets and Systems,2004,7(1):3-16. [16] BAGGA A,BALDWIN B.Entity-based cross-document corefe-rencing using the vector space model[C]∥Proc.1998 Annual Meeting of the Association for Computational Linguistics and Int.Conf.Computational Linguistics (COLING-ACL’98).Montreal,Quebec,Canada,1998. |
No related articles found! |
|