计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 50-56.doi: 10.11896/j.issn.1002-137X.2019.04.008
黄建一, 李建江, 王铮, 方明哲
HUANG Jian-yi, LI Jian-jiang, WANG Zheng, FANG Ming-zhe
摘要: 在线社交网络已经成为人们信息交流的重要渠道和载体,形成了与现实世界交互影响的虚拟社会。众多的网络事件通过社交网络进行快速传播,可以在短时间内成为舆论热点,而负面事件会对国家安全和社会稳定造成冲击,从而引发一系列的社会问题。因此,挖掘社交网络中蕴含的热点信息,无论是从舆论监督方面还是舆情预警方面都具有重要的意义。文本聚类是挖掘热点信息的一种重要方法,然而,使用传统长文本聚类算法处理海量短文本时准确率将变低,复杂度急剧增长,从而导致耗时过长;现有的短文本聚类算法的准确率偏低、耗时过长。文中基于文本关键词,提出了结合上下文和相似度矩阵的关联模型,从而判断当前文本与上一文本的关联性。此外,根据该关联模型对文本关键词权重进行调整,以进一步降低噪声。最后,在Hadoop平台上实现了分布式的短文本聚类算法。与K-MEANS,SP-NN,SP-WC算法的比较实验验证了所提算法在话题挖掘速度、准确率和召回率等方面都具有更好的效果。
中图分类号:
[1]NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and Information Systems,2015,45(3):535-569. [2]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in microblog stream[J].World Wide Web,2017,20(2):325-350. [3]XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[C]∥International Conference on Data Mining.2013:837-846. [4]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348. [5]BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002:436-442. [6]SALLOUM S A,AL-EMRAN M,MONEM A A,et al.A survey of text mining in social media:facebook and twitter perspectives[J].Adv.Sci.Technol.Eng.Syst.J,2017,2(1):127-133. [7]ALI A,QADIR J,RASOOL R U,et al.Big data for development:applications and techniques[J].arXiv:Computers and Society,2016,1(2):1-24. [8]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in Microblog stream[J].World Wide Web,2017,20(2):325-350. [9]ALLAN J,CARBONELL J,DODDINGTON G,et al.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.1998:194-218. [10]WAYNE C L.Topic detection &tracking (TDT)[C].Workshop held at the University of Maryland on.1997. [11]CAPO M,PEREZ A,LOZANO J A,et al.An efficient approximation to the K-means clustering for massive data[J].Know-ledge Based Systems,2017,117:56-69. [12]ARORA P,VARSHNEY S.Analysis of K-Means and K-Medoids algorithm for big data[J].Procedia Computer Science,2016,78:507-512. [13]NG R T,HAN J.CLARANS:A method for clustering objects for spatial data mining[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(5):1003-1016. [14]ABUALIGAH L M,KHADER A T.Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering[J].The Journal of Supercomputing,2017,73(11):4773-4795. [15]PINTO,DAVID,et al.A Self-Enriching Methodology for Clustering Narrow Domain Short Texts[J].The Computer Journal,2011,54(7):1148-1165. [16]PINTO D,BENEDÍ J M,ROSSO P.Clustering narrow-domain short texts by using the Kullback-Leiblerdistance[M].Computational Linguistics and Intelligent Text Processing.Springer Berlin Heidelberg,2007:611-622. [17]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥Proceedings of the 18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928. [18]THOMAS R E,KHAN S S.Co-Clustering with side information for text mining[C]∥International Conference on Data Mining.2016:105-108. [19]BHANUSE S S,KAMBLE S D,KAKDE S,et al.text mining using metadata for generation of side information[J].Procedia Computer Science,2016,78:807-814. [20]HAHSLER M,BOLAOS M.Clustering data streams based on shared density between micro-clusters[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1449-1461. [21]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]∥KDD Workshop on Text Mining.2000:525-526. [22]KARYPIS G,HAN E H,KUMAR V.Chameleon:Hierarchical clustering using dynamic modeling[J].Computer,1999,32(8):68-75. [23]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited:why and how you should (still) use DBSCAN[J].ACM Transactions on Database Systems (TODS),2017,42(3):19. [24]GAO T,LI A,MENG F,et al.Research on data stream clustering based on FCM algorithm[J].Procedia Computer Science,2017,122:595-602. [25]REHIOUI H,IDRISSI A,ABOUREZQ M,et al.DENCLUE- IM:a new approach for big data clustering[J].Procedia Compu-ter Science,2016,83:560-567. [26]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21. [27]CONG Y,CHAN Y,RAGAN M A.A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF[J].Scientific Reports,2016,6(1):30308. [28]GUO L,VARGO C J,PAN Z,et al.Big social data analytics in journalism and mass communication:Comparing dictionary-based text analysis and unsupervised topic modeling[J].Journa-lism & Mass Communication Quarterly,2016,93(2):332-359. [29]ALLAHYARI M,POURIYEH S,ASSEFI M,et al.A brief survey of text mining:classification,clustering and extraction techniques[J].arXiv preprint arXiv:1707.02919,2017. [30]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31. [31]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348. [32]SHEN D,YANG Q,SUN J T,et al.Thread detection in dyna- mic text message streams[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:35-42. [33]KENTER T,DE RIJKE M.Short text similarity with word embeddings[C]∥Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.ACM,2015:1411-1420. [34]AKIMUSHKIN C,AMANCIO D R,OLIVEIRA J O N.Text authorship identified using the dynamics of word co-occurrence networks[J].PloS one,2017,12(1):e0170527. [35]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119. [36]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146. [37]LI C,WANG H,ZHANG Z,et al.Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]∥International Acm sigir Conference on Research and Development in Information Retrieval.2016:165-174. |
[1] | 袁得嵛, 陈世聪, 高见, 王小娟. 基于斯塔克尔伯格博弈的在线社交网络扭曲信息干预算法 Intervention Algorithm for Distorted Information in Online Social Networks Based on Stackelberg Game 计算机科学, 2021, 48(3): 313-319. https://doi.org/10.11896/jsjkx.200400079 |
[2] | 张浩洋, 周良. 改进的GHSOM算法在民航航空法规知识地图构建中的应用 Application of Improved GHSOM Algorithm in Civil Aviation Regulation Knowledge Map Construction 计算机科学, 2020, 47(6A): 429-435. https://doi.org/10.11896/JsJkx.190700161 |
[3] | 康雁,崔国荣,李浩,杨其越,李晋源,王沛尧. 融合自注意力机制和多路金字塔卷积的软件需求聚类算法 Software Requirements Clustering Algorithm Based on Self-attention Mechanism and Multi- channel Pyramid Convolution 计算机科学, 2020, 47(3): 48-53. https://doi.org/10.11896/jsjkx.190700146 |
[4] | 孙永樾, 李红燕, 张金波. RAISE:一种高效的社交网络影响成本最小化算法 RAISE:Efficient Influence Cost Minimizing Algorithm in Social Network 计算机科学, 2019, 46(9): 59-65. https://doi.org/10.11896/j.issn.1002-137X.2019.09.007 |
[5] | 袁得嵛, 高见, 叶萌熙, 王小娟. 基于拓扑扩展的在线社交网络恶意信息源定位算法 Malicious Information Source Locating Algorithm Based on Topological Extension in Online Social Network 计算机科学, 2019, 46(5): 129-134. https://doi.org/10.11896/j.issn.1002-137X.2019.05.020 |
[6] | 孙昭颖,刘功申. 面向短文本的神经网络聚类算法研究 Research on Neural Network Clustering Algorithm for Short Text 计算机科学, 2018, 45(6A): 392-395. |
[7] | 张晓阳,秦贵和,邹密,孙铭会,高庆洋. 基于LDA模型的餐厅推荐方法研究 Research on Recommendation Method of Restaurant Based on LDA Model 计算机科学, 2017, 44(7): 180-184. https://doi.org/10.11896/j.issn.1002-137X.2017.07.032 |
[8] | 张群,王红军,王伦文. 一种结合上下文语义的短文本聚类算法 Short Text Clustering Algorithm Combined with Context Semantic Information 计算机科学, 2016, 43(Z11): 443-446. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.099 |
[9] | 王有华,陈笑蓉. 基于Kolmogorov复杂性的文本聚类算法改进 Improved Text Clustering Algorithm Based on Kolmogorov Complexity 计算机科学, 2016, 43(5): 243-246. https://doi.org/10.11896/j.issn.1002-137X.2016.05.045 |
[10] | 李钊,李晓,王春梅,李诚,杨春. 一种基于MapReduce的文本聚类方法研究 Text Clustering Method Study Based on MapReduce 计算机科学, 2016, 43(1): 246-250. https://doi.org/10.11896/j.issn.1002-137X.2016.01.053 |
[11] | 朱烨行,李艳玲,崔梦天,杨献文. 一种改进K-means算法的聚类算法CARDBK Clustering Algorithm CARDBK Improved from K-means Algorithm 计算机科学, 2015, 42(3): 201-205. https://doi.org/10.11896/j.issn.1002-137X.2015.03.041 |
[12] | 贺超波,杨镇雄,洪少文,汤庸,陈国华,郑凯. 应用随机游走的社交网络用户分类方法 User Classification Method in Online Social Network Using Random Walks 计算机科学, 2015, 42(2): 198-203. https://doi.org/10.11896/j.issn.1002-137X.2015.02.042 |
[13] | 张星,於志文,梁韵基,郭斌. 基于交互相似度的细粒度社群发掘方法 Community Development Method Based on Interactive Similarity 计算机科学, 2014, 41(4): 215-218. |
[14] | 刘一松,杨玉成. 基于文本聚类和概念相似度的语义Web服务发现 Semantic Web Service Discovery Based on Text Clustering and Similarity of Concepts 计算机科学, 2013, 40(11): 211-214. |
[15] | 王刚,钟国祥. 一种基于本体相似度计算的文本聚类算法研究 Study on Text Clustering Algorithm Based on Similarity Measurement of Ontology 计算机科学, 2010, 37(9): 222-224. |
|