计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 220-223.
卢献华1, 王洪俊2
LU Xian-hua1, WANG Hong-jun2
摘要: 对海量的互联网新闻进行快速热点聚类是一个重要的研究方向。针对大规模文本聚类的几个关键问题(相似度计算、分布式聚类、聚类结果概要生成),文中设计并实现了一个基于Spark计算框架的分布式新闻聚类系统。该系统采用GPU加速的深度相似度算法进行新闻文本的相似度计算,得到新闻之间的相似关系,并采用图聚类算法进行新闻聚类,最后采用标题压缩技术形成热点描述,生成最终的聚类结果。实验结果证明,文中提出的系统具有较高的执行效率和良好的可扩展性,可以有效地处理大规模新闻的热点聚类任务。
中图分类号:
[1]Apache SparkTM- Unified Analytics Engine for Big Data [EB/OL].http://spark.apache.org/. [2]GraphX | Apache Spark[EB/OL].http://spark.apache.org/graphx/. [3]ROBERTSON S,ZARAGOZA H.The Probabilistic RelevanceFramework:BM25 and Beyond[J].Foundations and Trends® in Information Retrieval,2009,3(4):333-389. [4]PONTE J M,BRUCE C W.A language modeling approach toinformation retrieval[J].Research and Development in Information Retrieval,1998:275-281. [5]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]∥Proceedings of The 31st International Conference on Machine Learning (ICML 2014).2014:1188-1196. [6]KUSNER M,SUN Y,KOLKIN N,et al.From Word Embeddings To Document Distances[C]∥Proceedings of the 32nd International Conference on Machine Learning(2015).2015:957-966. [7]KIROS R,ZHU Y K,SALAKHUTDINOV R,et al.Raquel Urtasun and Sanja Fidler.Skip-Thought Vectors[C]∥NIPS,2015.Curran Associates,Inc,2015:3294-3302. [8]孙昭颖,刘功申.面向短文本的神经网络聚类算法研究[J].计算机科学,2018,45(S1):392-395. [9]梁吉业,乔洁,曹付元,等.面向短文本分析的分布式表示模型[J].计算机研究与发展,2018,55(8):37-46. [10]faiss:A library for efficient similarity search and clustering of dense vectors[EB/OL].https://github.com/facebookresearch/faiss/. [11]海沫.大数据聚类算法综述[J].计算机科学,2016,43(S1):380-383. [12]李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11). [13]刘鹏,滕家雨,丁恩杰,等.基于Spark的大规模文本k-means并行聚类算法[J].中文信息学报,2017(4):150-158. [14]陈德华,解维,李悦.面向大规模图数据的分布式并行聚类算法研究[J].计算机研究与发展,2012(suppl 49):222-227. [15]DORR B,ZAJIC D,SCHWARTZ R.Hedge trimmer:a parse-andtrim approach to headline generation[C]∥Proceedings of the HLT-NAACL 03 on Text Summarization Workshop,Stroudsburg.PA,USA:Association for Computational Linguistics,2003:1-8. [16]WITBROCK M,MITTAL V.Ultra-Summarization:A Statisti-cal Approach to Generating Highly Condensed Non-Extractive Summaries[C]∥Proceedings of SIGIR 99.Berkeley:ACM,1999:315-316. [17]FILIPPOVA K.Multi-sentence compression:Finding shortestpaths in word graphs[C]∥Proceedings of the 23rd International Conference on Computational Linguistics.Stroudsburg,PA,USA:Association for Computational Linguistics,2010:322-330. [18]杨冰,孙锐,姬东鸿.融入显著性事件信息的标题生成方法[J].计算机工程与应用,2016(24):236-240. [19]SUN F,JIANG P,SUN H,et al.Multi-Source Pointer Network for Product Title Summarization[C]∥Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM2018).Torino:ACM,2018:7-16. |
[1] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[2] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[3] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[4] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[5] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[6] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[7] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[8] | 王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042 |
[9] | 滕建, 滕飞, 李天瑞. 基于3D卷积和LSTM编码解码的出行需求预测 Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder 计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022 |
[10] | 张育龙, 王强, 陈明康, 孙静涛. 图像去雨算法在云物联网应用中的研究综述 Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems 计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055 |
[11] | 曹萌, 于洋, 梁英, 史红周. 基于区块链的大数据交易关键技术与发展趋势 Key Technologies and Development Trends of Big Data Trade Based on Blockchain 计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163 |
[12] | 刘亚臣, 黄雪莹. 卫星监测时空大数据蠕变特征提取及预警算法 Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data 计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071 |
[13] | 张光君, 张翔. 应用“大数据+区块链”优化立法评估制度的机理与路径 Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” 计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105 |
[14] | 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式 Two-step Authorization Pattern of Data Product Circulation 计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217 |
[15] | 赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156 |
|