计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 50-56.doi: 10.11896/j.issn.1002-137X.2019.04.008

• 大数据与数据科学 • 上一篇    下一篇

基于上下文相似度矩阵的Single-Pass短文本聚类

黄建一, 李建江, 王铮, 方明哲   

  1. 北京科技大学计算机与通信工程学院 北京100083
  • 收稿日期:2018-08-27 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 李建江(1971-),男,博士,副教授,主要研究方向为并行计算、云计算、大数据,E-mail:lijianjiang@ustb.edu.cn(通信作者)
  • 作者简介:黄建一(1989-),男,博士生,主要研究方向为社交网络分析与人工智能,E-mail:huangjianyi_ustb@163.com;王 铮(1986-),男,讲师,主要研究方向为社交网络分析与人工智能;方明哲(1989-),男,博士生,主要研究方向为在线社交网络信息溯源分析。
  • 基金资助:
    本文受国家重点研发计划资助项目(2017YFB0803302),中央基本业务费(06116104)资助。

Single-Pass Short Text Clustering Based on Context Similarity Matrix

HUANG Jian-yi, LI Jian-jiang, WANG Zheng, FANG Ming-zhe   

  1. School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China
  • Received:2018-08-27 Online:2019-04-15 Published:2019-04-23

摘要: 在线社交网络已经成为人们信息交流的重要渠道和载体,形成了与现实世界交互影响的虚拟社会。众多的网络事件通过社交网络进行快速传播,可以在短时间内成为舆论热点,而负面事件会对国家安全和社会稳定造成冲击,从而引发一系列的社会问题。因此,挖掘社交网络中蕴含的热点信息,无论是从舆论监督方面还是舆情预警方面都具有重要的意义。文本聚类是挖掘热点信息的一种重要方法,然而,使用传统长文本聚类算法处理海量短文本时准确率将变低,复杂度急剧增长,从而导致耗时过长;现有的短文本聚类算法的准确率偏低、耗时过长。文中基于文本关键词,提出了结合上下文和相似度矩阵的关联模型,从而判断当前文本与上一文本的关联性。此外,根据该关联模型对文本关键词权重进行调整,以进一步降低噪声。最后,在Hadoop平台上实现了分布式的短文本聚类算法。与K-MEANS,SP-NN,SP-WC算法的比较实验验证了所提算法在话题挖掘速度、准确率和召回率等方面都具有更好的效果。

关键词: 短文本序列, 分布式处理, 文本聚类, 在线社交网络

Abstract: Online social network has become an important channel and carrier,and it has formed a virtual society interacting with the real world.Numerous network events rapidly spread through social networks,and they can become hot spots in a short period of time.However,the negative events vibrate national security and social stability,and may cause a series of social problems.Therefore,mining hotspot information contained in social networks is of great significance both in public opinion supervision and public opinion early warning.Text clustering is an important method for mining hotspot information.However,when the traditional long text clustering algorithms process massive short texts,their accuracy rate will become lower and the complexity will increase sharply,which will lead to long time-consuming.The exis-ting short text clustering algorithms also have low accuracy and takes too much time.Based on the keywords of text,this paper presented an association model combining context and similarity matrix to determine the relevance between the current text and the previous text.In addition,the text keyword weights were modified according to the association model to further reduce the noise.Finally,a distributed short text clustering algorithm on Hadoop platform was implemented.Through the experiments,it is verified that the proposed algorithm has better results and performance compared with K-MEANS,SP-NN and SP-WC algorithms in terms of the speed of mining topics,the accuracy and the recall rate.

Key words: Distributed processing, Online social network, Short text sequence, Text clustering

中图分类号: 

  • TP391
[1]NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and Information Systems,2015,45(3):535-569.
[2]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in microblog stream[J].World Wide Web,2017,20(2):325-350.
[3]XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[C]∥International Conference on Data Mining.2013:837-846.
[4]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
[5]BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002:436-442.
[6]SALLOUM S A,AL-EMRAN M,MONEM A A,et al.A survey of text mining in social media:facebook and twitter perspectives[J].Adv.Sci.Technol.Eng.Syst.J,2017,2(1):127-133.
[7]ALI A,QADIR J,RASOOL R U,et al.Big data for development:applications and techniques[J].arXiv:Computers and Society,2016,1(2):1-24.
[8]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in Microblog stream[J].World Wide Web,2017,20(2):325-350.
[9]ALLAN J,CARBONELL J,DODDINGTON G,et al.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.1998:194-218.
[10]WAYNE C L.Topic detection &tracking (TDT)[C].Workshop held at the University of Maryland on.1997.
[11]CAPO M,PEREZ A,LOZANO J A,et al.An efficient approximation to the K-means clustering for massive data[J].Know-ledge Based Systems,2017,117:56-69.
[12]ARORA P,VARSHNEY S.Analysis of K-Means and K-Medoids algorithm for big data[J].Procedia Computer Science,2016,78:507-512.
[13]NG R T,HAN J.CLARANS:A method for clustering objects for spatial data mining[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(5):1003-1016.
[14]ABUALIGAH L M,KHADER A T.Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering[J].The Journal of Supercomputing,2017,73(11):4773-4795.
[15]PINTO,DAVID,et al.A Self-Enriching Methodology for Clustering Narrow Domain Short Texts[J].The Computer Journal,2011,54(7):1148-1165.
[16]PINTO D,BENEDÍ J M,ROSSO P.Clustering narrow-domain short texts by using the Kullback-Leiblerdistance[M].Computational Linguistics and Intelligent Text Processing.Springer Berlin Heidelberg,2007:611-622.
[17]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥Proceedings of the 18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.
[18]THOMAS R E,KHAN S S.Co-Clustering with side information for text mining[C]∥International Conference on Data Mining.2016:105-108.
[19]BHANUSE S S,KAMBLE S D,KAKDE S,et al.text mining using metadata for generation of side information[J].Procedia Computer Science,2016,78:807-814.
[20]HAHSLER M,BOLAOS M.Clustering data streams based on shared density between micro-clusters[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1449-1461.
[21]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]∥KDD Workshop on Text Mining.2000:525-526.
[22]KARYPIS G,HAN E H,KUMAR V.Chameleon:Hierarchical clustering using dynamic modeling[J].Computer,1999,32(8):68-75.
[23]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited:why and how you should (still) use DBSCAN[J].ACM Transactions on Database Systems (TODS),2017,42(3):19.
[24]GAO T,LI A,MENG F,et al.Research on data stream clustering based on FCM algorithm[J].Procedia Computer Science,2017,122:595-602.
[25]REHIOUI H,IDRISSI A,ABOUREZQ M,et al.DENCLUE- IM:a new approach for big data clustering[J].Procedia Compu-ter Science,2016,83:560-567.
[26]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
[27]CONG Y,CHAN Y,RAGAN M A.A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF[J].Scientific Reports,2016,6(1):30308.
[28]GUO L,VARGO C J,PAN Z,et al.Big social data analytics in journalism and mass communication:Comparing dictionary-based text analysis and unsupervised topic modeling[J].Journa-lism & Mass Communication Quarterly,2016,93(2):332-359.
[29]ALLAHYARI M,POURIYEH S,ASSEFI M,et al.A brief survey of text mining:classification,clustering and extraction techniques[J].arXiv preprint arXiv:1707.02919,2017.
[30]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[31]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
[32]SHEN D,YANG Q,SUN J T,et al.Thread detection in dyna- mic text message streams[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:35-42.
[33]KENTER T,DE RIJKE M.Short text similarity with word embeddings[C]∥Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.ACM,2015:1411-1420.
[34]AKIMUSHKIN C,AMANCIO D R,OLIVEIRA J O N.Text authorship identified using the dynamics of word co-occurrence networks[J].PloS one,2017,12(1):e0170527.
[35]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.
[36]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146.
[37]LI C,WANG H,ZHANG Z,et al.Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]∥International Acm sigir Conference on Research and Development in Information Retrieval.2016:165-174.
[1] 袁得嵛, 陈世聪, 高见, 王小娟.
基于斯塔克尔伯格博弈的在线社交网络扭曲信息干预算法
Intervention Algorithm for Distorted Information in Online Social Networks Based on Stackelberg Game
计算机科学, 2021, 48(3): 313-319. https://doi.org/10.11896/jsjkx.200400079
[2] 张浩洋, 周良.
改进的GHSOM算法在民航航空法规知识地图构建中的应用
Application of Improved GHSOM Algorithm in Civil Aviation Regulation Knowledge Map Construction
计算机科学, 2020, 47(6A): 429-435. https://doi.org/10.11896/JsJkx.190700161
[3] 康雁,崔国荣,李浩,杨其越,李晋源,王沛尧.
融合自注意力机制和多路金字塔卷积的软件需求聚类算法
Software Requirements Clustering Algorithm Based on Self-attention Mechanism and Multi- channel Pyramid Convolution
计算机科学, 2020, 47(3): 48-53. https://doi.org/10.11896/jsjkx.190700146
[4] 孙永樾, 李红燕, 张金波.
RAISE:一种高效的社交网络影响成本最小化算法
RAISE:Efficient Influence Cost Minimizing Algorithm in Social Network
计算机科学, 2019, 46(9): 59-65. https://doi.org/10.11896/j.issn.1002-137X.2019.09.007
[5] 袁得嵛, 高见, 叶萌熙, 王小娟.
基于拓扑扩展的在线社交网络恶意信息源定位算法
Malicious Information Source Locating Algorithm Based on Topological Extension in Online Social Network
计算机科学, 2019, 46(5): 129-134. https://doi.org/10.11896/j.issn.1002-137X.2019.05.020
[6] 孙昭颖,刘功申.
面向短文本的神经网络聚类算法研究
Research on Neural Network Clustering Algorithm for Short Text
计算机科学, 2018, 45(6A): 392-395.
[7] 张晓阳,秦贵和,邹密,孙铭会,高庆洋.
基于LDA模型的餐厅推荐方法研究
Research on Recommendation Method of Restaurant Based on LDA Model
计算机科学, 2017, 44(7): 180-184. https://doi.org/10.11896/j.issn.1002-137X.2017.07.032
[8] 张群,王红军,王伦文.
一种结合上下文语义的短文本聚类算法
Short Text Clustering Algorithm Combined with Context Semantic Information
计算机科学, 2016, 43(Z11): 443-446. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.099
[9] 王有华,陈笑蓉.
基于Kolmogorov复杂性的文本聚类算法改进
Improved Text Clustering Algorithm Based on Kolmogorov Complexity
计算机科学, 2016, 43(5): 243-246. https://doi.org/10.11896/j.issn.1002-137X.2016.05.045
[10] 李钊,李晓,王春梅,李诚,杨春.
一种基于MapReduce的文本聚类方法研究
Text Clustering Method Study Based on MapReduce
计算机科学, 2016, 43(1): 246-250. https://doi.org/10.11896/j.issn.1002-137X.2016.01.053
[11] 朱烨行,李艳玲,崔梦天,杨献文.
一种改进K-means算法的聚类算法CARDBK
Clustering Algorithm CARDBK Improved from K-means Algorithm
计算机科学, 2015, 42(3): 201-205. https://doi.org/10.11896/j.issn.1002-137X.2015.03.041
[12] 贺超波,杨镇雄,洪少文,汤庸,陈国华,郑凯.
应用随机游走的社交网络用户分类方法
User Classification Method in Online Social Network Using Random Walks
计算机科学, 2015, 42(2): 198-203. https://doi.org/10.11896/j.issn.1002-137X.2015.02.042
[13] 张星,於志文,梁韵基,郭斌.
基于交互相似度的细粒度社群发掘方法
Community Development Method Based on Interactive Similarity
计算机科学, 2014, 41(4): 215-218.
[14] 刘一松,杨玉成.
基于文本聚类和概念相似度的语义Web服务发现
Semantic Web Service Discovery Based on Text Clustering and Similarity of Concepts
计算机科学, 2013, 40(11): 211-214.
[15] 王刚,钟国祥.
一种基于本体相似度计算的文本聚类算法研究
Study on Text Clustering Algorithm Based on Similarity Measurement of Ontology
计算机科学, 2010, 37(9): 222-224.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!