Computer Science ›› 2017, Vol. 44 ›› Issue (7): 210-214.doi: 10.11896/j.issn.1002-137X.2017.07.037

Previous Articles     Next Articles

K-Nearest Neighbor Algorithm Based on Hash Technology and MapRecuce

ZHAI Jun-hai, ZHANG Ming-yang, WANG Ting-ting and HAO Pu   

  • Online:2018-11-13 Published:2018-11-13

Abstract: K-nearest neighbor(K-NN) is a famous classification algorithm.Because the idea of K-NN is simple and it is easy to implement,K-NN has been widely applied to many fields,such as face recognition,gene classification and decision making,etc.However,in the big data environment,the efficiency of K-NN is very low,even it is not workable.In order to deal with this problem,based on hash technology and MapRecuce,this paper proposed an improved K-nearest neighbor algorithm.In order to verify the effectiveness of the proposed algorithm,some experiments were conducted on 4 big data sets.The experimental results show that the proposed algorithm is effective and efficient.

Key words: K-nearest neighbor,Hash technology,Classification algorithms,Big data sets

[1] WU X D,KUMAR V,QUINLAN J R,et al.Top 10 algorithms in data mining [J].Knowledge & Information Systems,2007,14(1):1-37.
[2] COVER T,HART P.Nearest neighbor pattern classification[J].IEEE Transactions on Information Theory,1967,13(1):21-27.
[3] ZHANG X G.Pattern Recognition(Third Edition)[M].Bei-jing:Tsinghua University Press,2010.(in Chinese) 张学工.模式识别(第3版) [M].北京:清华大学出版社,2010.
[4] 王晓东.计算机算法设计与分析(第4版)[M].北京:电子工业出版社,2012.
[5] ALSUWAIYEL M H.Algorithms Design Techniques and Ana-lysis(English Edition,Second Edition)[M].Beijing:Publishing House of Electronics Industry,2013.(in Chinese) ALSUWAIYEL M H.算法设计与分析(英文版,第2版)[M].北京:电子工业出版社,2013.
[6] CORMEN T H,LEISERON C E,RIVEST R L,et al.Introduction to Algorithms(English Edition,Second Edition)[M].Beijing:Higher Education Press,2001.(in Chinese) CORMEN T H,LEISERON C E,RIVEST R L,等.算法导论(英文版,第2版)[M].北京:高等教育出版社,2001.
[7] HART P E.The condensed nearest neighbor rule[J].IEEE Transaction on Information Theory,1968,14(5):515-516.
[8] WILSON D R,MARTINEZ T R.Reduction techniques for instance-based learning algorithms[J].Machine Learning,2000,38(3):257-286.
[9] BRIGHTON C,MELLISH C.Advances in instance selection forinstance-based learning algorithm[J].Data Mining and Know-ledge Discovery,2002,6(2):153-172.
[10] OLVERA-LPEZ J A,CARRASCO-OCHOA J A,MART-NEZ-TRINIDAD J F,et al.A review of instance selection me-thods[J].Artificial Intelligence Review,2010,34(2):133-143.
[11] BENTLEY J L.Multidimensional Binary Search Trees Used for Associative Searching[J].Communications of the Acm,1975,18(9):509-517.
[12] OMOHUNDRO S M.Efficient algorithms with neural network behavior[J].Journal of Complex Systems,1987,1(2):273-347.
[13] UHLMANN J K.Satisfying general proximity/similarity que-ries with metric trees[J].Information Processing Letters,1991,40(4):175-179.
[14] KNUTH D.Art of Computer Programming,Volume 3:Sorting and searching[M].New Jersey:Addison-Wesley Professional,1997.
[15] GIONIS A,INDYK P,MOTWANI R.Similarity search in high dimensions via hashing [C]∥Proc.of 25th International Con-ference on Very Large Data Bases.1999:518-529.
[16] SALAKHUTDINOV R,HINTON G.Semantic hashing [J].International Journal of Approximate Reasoning,2009,50(7):969-978.
[17] JUN W,SANJIV K,SHIH F C.Semi-supervised hashing forlarge-scale search [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2012,34(12):2393-2406.
[18] LI W J,ZHOU Z H.Learning to hash for big data:Current status and future trends [J].Chinese Science Bulletin,2015,60(5/6):485-490.(in Chinese) 李武军,周志华.大数据哈希学习:现状与趋势[J].科学通报,2015,60(5/6):485-490.
[19] WANG J,LIU W,KUMAR S,et al.Learning to Hash for Indexing Big Data-A Survey [J].Proceedings of the IEEE,2016,104(1):34-57.
[20] MANKU G S,JAIN A,SARMA A D.Detecting near-duplicates for web crawling [C]∥International Conference on World Wide Web.ACM,2007:141-150.
[21] DEAN J,GHEMAWAT S.MapReduce:Simplified data proces- sing on large clusters [J].Communications of the ACM,2008,51(1):107-113.
[22] 黄宜华,苗凯翔.深入理解大数据-大数据处理与编程实践[M].北京:机械工业出版社,2014.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .