Computer Science ›› 2017, Vol. 44 ›› Issue (1): 219-225.doi: 10.11896/j.issn.1002-137X.2017.01.042

Previous Articles     Next Articles

Efficiency Optimization of Jaccard’s Similarity Coefficient Based on Two Dimensional Partition

LIAO Bin, ZHANG Tao, YU Jiong, GUO Bing-lei and LIU Ji   

  • Online:2018-11-13 Published:2018-11-13

Abstract: With the exponential growth of Internet users and content,the efficiency of the Jaccard’s similarity coefficient algorithm under big data scenario is more important than ever before.In order to improve the efficiency of Jaccard’ssimilarity computing process,the implementation that the algorithm was analyzed under MapReduce architecture.Combining the characteristics of the Spark is more suitable for the iterative and interactive tasks,we transformed the algorithm from the MapReduce platform to Spark based on two dimensional partition algorithm.And we improved the efficiency of the algorithm by parameter adjustment,memory optimization and other.methods.With two data sets running on 3 clusters with different number of datanodes,the experimental results show that,compared with MapReduce,the algorithm execution efficiency under Spark platform improves more than 4 times,and energy efficiency improves more than 3 times.

Key words: Green computing,MapReduce,Task scheduling,Temperature-aware

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] DEAN J,GHEMAWAT S.MapReduce:a flexible data proces-sing tool[J].Communications of the ACM,2010,53(1):72-77.
[3] WANG P,MENG D,ZHAN J F,et al.Review of Programming models for data-Intensive computing[J].Journal of Computer Research and Development,2010:47(11),1993-2002.(in Chinese) 王鹏,孟丹,詹剑锋,等.数据密集型计算编程模型研究进展[J].计算机研究与发展,2010,47(11):1993-2002.
[4] CHEN S,SCHLOSSER S W.Map-Reduce Meets Wider Varieties of Applications[R].Intel Research Pittsburgh,Technical Report IRPTR-08-05,2008.
[5] WHITE B,YEN T,LIN J,et al.Web-Scale Computer Visionusing MapReduce for Multimedia Data Mining[C]∥Proceedings of the International Workshop on Multimedia Data Mining.2010:1-10.
[6] X-RIME.[2015-10-15].
[7] SHI J,XUE W,WANG W,et al.Scalable Community Detection in Massive Social Networks using MapReduce[J].IBM Journal of Research and Development,2013,57(3/4):1-12.
[8] MATSUNAGE A,TSUGAWA M,FORTES J.CloudBLAST:Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications[C]∥Proceedings of the IEEE International Conference on e-Science.IEEE,2008:222-229.
[9] WILEY K,CONNOLLY A,Gardner J,et al.Astronomy in the Cloud:Using Mapreduce for Image Co-addition[J].Astronomy,2011,123(901):366-380.
[10] LU W,HUANG J,HONG L.Massive Data MapReduce Fingerprint Discriminant Algorithm Based on Hadoop[J].Applied Mechanics and Materials,2012,263:2655-2660.
[11] TIMES N Y.Power,Pollution and the Internet [EB/OL].[2016-1-2]. data-ceneters-waste-vast-amounts-of-energy-belying-industry-image.html.
[12] BARROSO L A,HLZLE U.The datacenter as a computer:An introduction to the design of warehouse-scale machines [R].Morgan:Synthesis Lectures on Computer Architecture,Morgan &Claypool Publishers,2009
[13] HUANG S,WANG B T,WANG G R,et al.A Survey on MapReduce Optimization Technologies[J].Journal of Frontiers of Computer Science and Technology,2013,7(10):865-885.(in Chinese)黄山,王波涛,王国仁,等.MapReduce优化技术综述[J].计算机科学与探索,2013,7(10):865-885.
[14] LIU Y,JING N,CHEN L,et al.Algorithm for processing k-nearest join based on R-tree in MapReduce[J].Journal of Software,2013,24(8):1836-1851.(in Chinese) 刘义,景宁,陈荦,等.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851.
[15] GOIRI ,LE K,NGUYEN T D,et al.GreenHadoop:leveraging green energy in data-processing frameworks[C]∥Proceedings of the 7th ACM European Conference on Computer Systems.ACM,2012:57-70.
[16] CARDOS M,SINGH A,PUCHA H,et al.Exploiting spatio-temporal tradeoffs for energy efficient MapReduce in the cloud[R].Department of Computer Science and Engineering,University of Minnesota,2010.
[17] CHE Y,GANAPATHI A,KATZ R H.To compress or not to compress-compute tradeoffs for mapreduce energy efficiency[C]∥Proceedings of the First ACM SIGCOMM Workshop on Green Networking.ACM,2010:23-28
[18] SONG J,LI T T,ZHU Z L,et al.Benchmarking and analyzing the energy consumption of cloud data management system [J].Chinese Journal of Computers,2013,36(7):1485-1499.(in Chinese) 宋杰,李甜甜,朱志良,等.云数据管理系统能耗基准测试与分析[J].计算机学报,2013,6(7):1485-1499.
[19] LIAO B,YU J,SUN H,et al.Energy-efficient algorithms for distributed storage system based on data storage structure reconfiguration[J].Journal of Computer Research and Development,2013:50(1):3-18.(in Chinese) 廖彬,于炯,孙华,等.基于存储结构重配置的分布式存储系统节能算法[J].计算机研究与发展,2013,50(1):3-18.
[20] LIAO B,YU J,ZHANG T,et al.Energy-efficient algorithms for distributed file system HDFS[J].Chinese Journal of Compu-ters,2013,36(5):1047-1064.(in Chinese) 廖彬,于炯,张陶,等.基于分布式文件系统HDFS的节能算法[J].计算机学报,2013,6(5):1047-1064.
[21] LIN J C,LEU F Y,CHEN Y.Impact of MapReduce Policies on Job Completion Reliability and Job Energy Consumption[J].IEEE Transactions on Parallel and Distributed Systems,2015,26(5):1364-1378.
[22] LIAO B,YU J,ZHANG Tao,et al.Energy-Efficient Algorithms for Distributed Storage System Based on Block Storage Structure Reconfiguration [J].Journal of Network and Computer Applications,2015,48(2):71-86.
[23] WANG H X,WU B,LIU Y.Parallel graph data analysis system based on Spark[J].Journal of Frontiers of Computer Science and Technology,2015,9(9):1066-1074.(in Chinese) 王虹旭,吴斌,刘旸.基于Spark的并行图数据分析系统[J].计算机科学与探索,2015,9(9):1066-1074.
[24] QIU R C.Parallel design and application of CURE algorithm based on Spark platform[D].South China University of Technology,2014.(in Chinese) 邱荣财.基于Spark平台的CURE算法并行化设计与应用[D].广州:华南理工大学,2014.
[25] WANG Z Y,WANG H J,XING H L,et al.Ant colony optimization algorithm based on Spark[J].Journal of Computer Applications,2015,35(10):2777-2780.(in Chinese) 王诏远,王宏杰,邢焕来,等.基于Spark的蚁群优化算法[J].计算机应用,2015,35(10):2777-2780.
[26] ZHENG F F,HUANG W P,JIA M Z.Matrix factorization re-commendation algorithm based on Spark[J].Journal of Compu-ter Applications,2015,35(10):2781-2783.(in Chinese) 郑凤飞,黄文培,贾明正.基于Spark的矩阵分解推荐算法[J].计算机应用,2015,35(10):2781-2783.
[27] YAN Y L,DONG Y H,HE X M,et al.FSMBUS:A frequent subgraph mining algorithm in single large-scale graph using spark[J].Journal of Computer Research and Development,2015(8):1768-1783.(in Chinese) 严玉良,董一鸿,何贤芒,等.FSMBUS:一种基于Spark的大规模频繁子图挖掘算法[J].计算机研究与发展,2015(8):1768-1783.
[28] SAMANTHULA B K,JIANG W.Secure Multiset Intersection Cardinality and its Application to Jaccard Coefficient[J].IEEE Transactions on Dependable & Secure Computing,2016,13(5):1.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .