计算机科学 ›› 2017, Vol. 44 ›› Issue (12): 58-63.doi: 10.11896/j.issn.1002-137X.2017.12.011

• 第四届CCF大数据学术会议 • 上一篇    下一篇

结合缺失模式的不完整数据模糊聚类

郑奇斌,刁兴春,曹建军   

  1. 解放军理工大学指挥信息系统学院 南京210007,南京电讯技术研究所 南京 210007,南京电讯技术研究所 南京 210007
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61371196)资助

Fuzzy Clustering Algorithm for Incomplete Data Considering Missing Pattern

ZHENG Qi-bin, DIAO Xing-chun and CAO Jian-jun   

  • Online:2018-12-01 Published:2018-12-01

摘要: 数据的完整性是数据可用性的重要维度。由于数据采集等过程中存在的问题,现实中的数据往往存在缺失。现有的聚类算法在面对不完整数据时一般采用忽略缺失或填补缺失的策略,但是当数据缺失属于非随机缺失时,这样的处理策略会导致聚类精度严重下降。当数据缺失属于非随机缺失时,数据缺失模式与缺失属性的取值相关,因此在不完整对象的相似度量中加入缺失模式相似的度量,提出了两种结合缺失模式的 PCM(Possibilistic c-means)模糊聚类算法:最小化缺失模式距离之和的 PatDistPCM 算法和基于缺失模式聚类的 PatCluPCM 算法。在两个公开数据集上的实验证明, 考虑缺失模式的模糊聚类PatDistPCM和PatCluPCM算法,在对存在非随机缺失的数据进行聚类时 ,能有效提高聚类结果的准确性。

关键词: 数据完整性,模糊聚类,非随机缺失,缺失模式,可能性c-均值算法

Abstract: Data integrality is an important metric for data availability.For the problems in data acquisition,datasets in real world are always incomplete.Missing data are usually ignored or imputed in common clustering algorithm.When data missing is missing not at random,ignorance or imputation will result poor clustering accuracy.Considering the relationship of the data missing pattern and the missing value,two PCM (Possibilistic c-means) clustering algorithms were proposed:PatDistPCM based on minimizing the sum of missing pattern distance and PatCluPCM based on missing pattern clustering.The experiments on public datasets show that the two proposed fuzzy clustering algorithms PatDistPCM and PatCluPCM can improve clustering precision and recall when clustering data are of missing not at random.

Key words: Data integrality,Fuzzy clustering,MNAR,Missing pattern,Possibilistic c-means

[1] HAN J W,KAMBER M,PEI J.Data Mining:Concepts andTechniques(3rd ed)[M].Morgan Kaufmann Publishers,2011:288-293.
[2] GU Y,YU G,LI X J,et al.RFID data interpolation algorithm based on dynamic probabilistic path-event model[J].Journal of Software,2010,1(3):438-451.
[3] DIXON J K.Pattern recognition with partly missing data[J].IEEE Transactions on Systems,Man and Cybernetics,1979,9(10):617-621.
[4] BEZDEK J C.Pattern recognition with fuzzy objective function algorithms[M].Plenum Press,1981.
[5] HATHAWAY R J,BEZDEK J C.Fuzzy c-Means Clustering ofIncomplete Data[J].IEEE Transactions on System,Man,and Cybernetics,2001,1(5):735-744.
[6] BALKIS A,YAHIA S B.A new algorithm for fuzzy clustering handling incomplete dataset[J].International Journal on Artificial Intelligence Tools,2014,3(4):1460012.
[7] KRISHNAPURAM R,KELLER J M.A possibilistic Approach to clustering[J].IEEE Transactions on Fuzzy Systems,1993,1(2):98-110.
[8] ZHANG Q,CHEN Z.A distributed weighted Possibilistic c-Means algorithm for clustering incomplete big sensor data[J].International Journal of Distributed Sensor Networks,2014,2014(2):4.
[9] LITEEL R J A,RUBIN D B.Statistical Analysis with Missing Data[M].John Wiley & Sons,Inc.New Jersey,2002.
[10] Donald D B.Inference and Missing Data[J].Biometrika,1976,3(3):581-592.
[11] ALLISON P D.数据缺失[M].林毓玲,译.上海:格致出版社,2012.
[12] MARLIN B M.Missing Data Problems in Machine Learning[D].Toronto:University of Toronto,2008.
[13] MARLIN B M,ZEMEL R S.Collaborative Prediction and Ranking with Non-Random Missing Data[C]∥RecSys’09.New York,USA,2009:23-25.
[14] WANG H,WANG S.Discovering patterns of missing data inservey databases:An application of rough sets[J].Expert Systems with Applications,2009,36(3):6256-6260.
[15] TIMM H,BORGELT C,KRUSE R.An Extension of Possibilistic Fuzzy Cluster Analysis[J].Fuzzy Sets and Systems,2004,7(1):3-16.
[16] BAGGA A,BALDWIN B.Entity-based cross-document corefe-rencing using the vector space model[C]∥Proc.1998 Annual Meeting of the Association for Computational Linguistics and Int.Conf.Computational Linguistics (COLING-ACL’98).Montreal,Quebec,Canada,1998.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!