Computer Science ›› 2019, Vol. 46 ›› Issue (11A): 220-223.

• Data Science • Previous Articles     Next Articles

Design of Distributed News Clustering System Based on Big Data Computing Framework

LU Xian-hua1, WANG Hong-jun2   

  1. (Beijing Information Science and Technology University,Beijing 100101,China)1;
    (Beijing TRS Information Technology Co.,Ltd.,Beijing 100101,China)2
  • Online:2019-11-10 Published:2019-11-20

Abstract: Rapid clustering of massive Internet news to generate hot topic is an important research direction.Aiming at several key problems of large-scale text clustering:similarity calculation,distributed clustering and clustering result summary generation,this paper designed and implemented a Spark-based distributed news clustering system.Firstly,the GPU-accelerated deep similarity algorithm is used to calculate the similarity relationship of news texts.Then the graph clustering algorithm is used for news clustering.Finally,a short title for each class is generated as the class description.Experiments show that the proposed system has high performance and good scalability,and can effectively handle hotspot clustering tasks of large-scale news.

Key words: Big data, Depth similarity calculation, Distributed graph clustering, GPU acceleration, Title compression

CLC Number: 

  • TP3
[1]Apache SparkTM- Unified Analytics Engine for Big Data [EB/OL].http://spark.apache.org/.
[2]GraphX | Apache Spark[EB/OL].http://spark.apache.org/graphx/.
[3]ROBERTSON S,ZARAGOZA H.The Probabilistic RelevanceFramework:BM25 and Beyond[J].Foundations and Trends® in Information Retrieval,2009,3(4):333-389.
[4]PONTE J M,BRUCE C W.A language modeling approach toinformation retrieval[J].Research and Development in Information Retrieval,1998:275-281.
[5]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]∥Proceedings of The 31st International Conference on Machine Learning (ICML 2014).2014:1188-1196.
[6]KUSNER M,SUN Y,KOLKIN N,et al.From Word Embeddings To Document Distances[C]∥Proceedings of the 32nd International Conference on Machine Learning(2015).2015:957-966.
[7]KIROS R,ZHU Y K,SALAKHUTDINOV R,et al.Raquel Urtasun and Sanja Fidler.Skip-Thought Vectors[C]∥NIPS,2015.Curran Associates,Inc,2015:3294-3302.
[8]孙昭颖,刘功申.面向短文本的神经网络聚类算法研究[J].计算机科学,2018,45(S1):392-395.
[9]梁吉业,乔洁,曹付元,等.面向短文本分析的分布式表示模型[J].计算机研究与发展,2018,55(8):37-46.
[10]faiss:A library for efficient similarity search and clustering of dense vectors[EB/OL].https://github.com/facebookresearch/faiss/.
[11]海沫.大数据聚类算法综述[J].计算机科学,2016,43(S1):380-383.
[12]李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11).
[13]刘鹏,滕家雨,丁恩杰,等.基于Spark的大规模文本k-means并行聚类算法[J].中文信息学报,2017(4):150-158.
[14]陈德华,解维,李悦.面向大规模图数据的分布式并行聚类算法研究[J].计算机研究与发展,2012(suppl 49):222-227.
[15]DORR B,ZAJIC D,SCHWARTZ R.Hedge trimmer:a parse-andtrim approach to headline generation[C]∥Proceedings of the HLT-NAACL 03 on Text Summarization Workshop,Stroudsburg.PA,USA:Association for Computational Linguistics,2003:1-8.
[16]WITBROCK M,MITTAL V.Ultra-Summarization:A Statisti-cal Approach to Generating Highly Condensed Non-Extractive Summaries[C]∥Proceedings of SIGIR 99.Berkeley:ACM,1999:315-316.
[17]FILIPPOVA K.Multi-sentence compression:Finding shortestpaths in word graphs[C]∥Proceedings of the 23rd International Conference on Computational Linguistics.Stroudsburg,PA,USA:Association for Computational Linguistics,2010:322-330.
[18]杨冰,孙锐,姬东鸿.融入显著性事件信息的标题生成方法[J].计算机工程与应用,2016(24):236-240.
[19]SUN F,JIANG P,SUN H,et al.Multi-Source Pointer Network for Product Title Summarization[C]∥Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM2018).Torino:ACM,2018:7-16.
[1] CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[2] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[3] WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[4] SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[5] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[6] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[7] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[8] WANG Xue-cen, ZHANG Yu, LIU Ying-jie, YU Ge. Evaluation of Quality of Interaction in Online Learning Based on Representation Learning [J]. Computer Science, 2021, 48(2): 207-211.
[9] TENG Jian, TENG Fei, LI Tian-rui. Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder [J]. Computer Science, 2021, 48(12): 195-203.
[10] ZHANG Yu-long, WANG Qiang, CHEN Ming-kang, SUN Jing-tao. Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems [J]. Computer Science, 2021, 48(12): 231-242.
[11] LIU Ya-chen, HUANG Xue-ying. Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data [J]. Computer Science, 2021, 48(11A): 258-264.
[12] ZHANG Guang-jun, ZHANG Xiang. Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” [J]. Computer Science, 2021, 48(10): 324-333.
[13] YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
[14] ZHAO Hui-qun, WU Kai-feng. Big Data Valuation Algorithm [J]. Computer Science, 2020, 47(9): 110-116.
[15] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!