计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 220-223.

• 数据科学 • 上一篇    下一篇

基于大数据计算框架的分布式新闻聚类系统设计

卢献华1, 王洪俊2   

  1. (北京信息科技大学 北京100101)1;
    (北京拓尔思信息技术股份有限公司 北京100101)2
  • 出版日期:2019-11-10 发布日期:2019-11-20
  • 作者简介:卢献华(1964-),女,硕士,主要研究方向为大数据、软件工程、项目管理。

Design of Distributed News Clustering System Based on Big Data Computing Framework

LU Xian-hua1, WANG Hong-jun2   

  1. (Beijing Information Science and Technology University,Beijing 100101,China)1;
    (Beijing TRS Information Technology Co.,Ltd.,Beijing 100101,China)2
  • Online:2019-11-10 Published:2019-11-20

摘要: 对海量的互联网新闻进行快速热点聚类是一个重要的研究方向。针对大规模文本聚类的几个关键问题(相似度计算、分布式聚类、聚类结果概要生成),文中设计并实现了一个基于Spark计算框架的分布式新闻聚类系统。该系统采用GPU加速的深度相似度算法进行新闻文本的相似度计算,得到新闻之间的相似关系,并采用图聚类算法进行新闻聚类,最后采用标题压缩技术形成热点描述,生成最终的聚类结果。实验结果证明,文中提出的系统具有较高的执行效率和良好的可扩展性,可以有效地处理大规模新闻的热点聚类任务。

关键词: GPU加速, 标题压缩, 大数据, 分布式图聚类, 深度相似度计算

Abstract: Rapid clustering of massive Internet news to generate hot topic is an important research direction.Aiming at several key problems of large-scale text clustering:similarity calculation,distributed clustering and clustering result summary generation,this paper designed and implemented a Spark-based distributed news clustering system.Firstly,the GPU-accelerated deep similarity algorithm is used to calculate the similarity relationship of news texts.Then the graph clustering algorithm is used for news clustering.Finally,a short title for each class is generated as the class description.Experiments show that the proposed system has high performance and good scalability,and can effectively handle hotspot clustering tasks of large-scale news.

Key words: Big data, Depth similarity calculation, Distributed graph clustering, GPU acceleration, Title compression

中图分类号: 

  • TP3
[1]Apache SparkTM- Unified Analytics Engine for Big Data [EB/OL].http://spark.apache.org/.
[2]GraphX | Apache Spark[EB/OL].http://spark.apache.org/graphx/.
[3]ROBERTSON S,ZARAGOZA H.The Probabilistic RelevanceFramework:BM25 and Beyond[J].Foundations and Trends® in Information Retrieval,2009,3(4):333-389.
[4]PONTE J M,BRUCE C W.A language modeling approach toinformation retrieval[J].Research and Development in Information Retrieval,1998:275-281.
[5]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]∥Proceedings of The 31st International Conference on Machine Learning (ICML 2014).2014:1188-1196.
[6]KUSNER M,SUN Y,KOLKIN N,et al.From Word Embeddings To Document Distances[C]∥Proceedings of the 32nd International Conference on Machine Learning(2015).2015:957-966.
[7]KIROS R,ZHU Y K,SALAKHUTDINOV R,et al.Raquel Urtasun and Sanja Fidler.Skip-Thought Vectors[C]∥NIPS,2015.Curran Associates,Inc,2015:3294-3302.
[8]孙昭颖,刘功申.面向短文本的神经网络聚类算法研究[J].计算机科学,2018,45(S1):392-395.
[9]梁吉业,乔洁,曹付元,等.面向短文本分析的分布式表示模型[J].计算机研究与发展,2018,55(8):37-46.
[10]faiss:A library for efficient similarity search and clustering of dense vectors[EB/OL].https://github.com/facebookresearch/faiss/.
[11]海沫.大数据聚类算法综述[J].计算机科学,2016,43(S1):380-383.
[12]李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11).
[13]刘鹏,滕家雨,丁恩杰,等.基于Spark的大规模文本k-means并行聚类算法[J].中文信息学报,2017(4):150-158.
[14]陈德华,解维,李悦.面向大规模图数据的分布式并行聚类算法研究[J].计算机研究与发展,2012(suppl 49):222-227.
[15]DORR B,ZAJIC D,SCHWARTZ R.Hedge trimmer:a parse-andtrim approach to headline generation[C]∥Proceedings of the HLT-NAACL 03 on Text Summarization Workshop,Stroudsburg.PA,USA:Association for Computational Linguistics,2003:1-8.
[16]WITBROCK M,MITTAL V.Ultra-Summarization:A Statisti-cal Approach to Generating Highly Condensed Non-Extractive Summaries[C]∥Proceedings of SIGIR 99.Berkeley:ACM,1999:315-316.
[17]FILIPPOVA K.Multi-sentence compression:Finding shortestpaths in word graphs[C]∥Proceedings of the 23rd International Conference on Computational Linguistics.Stroudsburg,PA,USA:Association for Computational Linguistics,2010:322-330.
[18]杨冰,孙锐,姬东鸿.融入显著性事件信息的标题生成方法[J].计算机工程与应用,2016(24):236-240.
[19]SUN F,JIANG P,SUN H,et al.Multi-Source Pointer Network for Product Title Summarization[C]∥Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM2018).Torino:ACM,2018:7-16.
[1] 陈晶, 吴玲玲.
多源异构环境下的车联网大数据混合属性特征检测方法
Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment
计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[2] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[3] 王美珊, 姚兰, 高福祥, 徐军灿.
面向医疗集值数据的差分隐私保护技术研究
Study on Differential Privacy Protection for Medical Set-Valued Data
计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[4] 孙轩, 王焕骁.
政务大数据安全防护能力建设:基于技术和管理视角的探讨
Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives
计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[5] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[6] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[7] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[8] 王雪岑, 张昱, 刘迎婕, 于戈.
基于表示学习的在线学习交互质量评价方法
Evaluation of Quality of Interaction in Online Learning Based on Representation Learning
计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042
[9] 滕建, 滕飞, 李天瑞.
基于3D卷积和LSTM编码解码的出行需求预测
Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder
计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022
[10] 张育龙, 王强, 陈明康, 孙静涛.
图像去雨算法在云物联网应用中的研究综述
Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems
计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055
[11] 曹萌, 于洋, 梁英, 史红周.
基于区块链的大数据交易关键技术与发展趋势
Key Technologies and Development Trends of Big Data Trade Based on Blockchain
计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163
[12] 刘亚臣, 黄雪莹.
卫星监测时空大数据蠕变特征提取及预警算法
Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data
计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071
[13] 张光君, 张翔.
应用“大数据+区块链”优化立法评估制度的机理与路径
Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain”
计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105
[14] 叶雅珍, 刘国华, 朱扬勇.
数据产品流通的两阶段授权模式
Two-step Authorization Pattern of Data Product Circulation
计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217
[15] 赵会群, 吴凯锋.
一种大数据估价算法
Big Data Valuation Algorithm
计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!