Computer Science ›› 2019, Vol. 46 ›› Issue (4): 50-56.doi: 10.11896/j.issn.1002-137X.2019.04.008

• Big Data & Data Science • Previous Articles     Next Articles

Single-Pass Short Text Clustering Based on Context Similarity Matrix

HUANG Jian-yi, LI Jian-jiang, WANG Zheng, FANG Ming-zhe   

  1. School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China
  • Received:2018-08-27 Online:2019-04-15 Published:2019-04-23

Abstract: Online social network has become an important channel and carrier,and it has formed a virtual society interacting with the real world.Numerous network events rapidly spread through social networks,and they can become hot spots in a short period of time.However,the negative events vibrate national security and social stability,and may cause a series of social problems.Therefore,mining hotspot information contained in social networks is of great significance both in public opinion supervision and public opinion early warning.Text clustering is an important method for mining hotspot information.However,when the traditional long text clustering algorithms process massive short texts,their accuracy rate will become lower and the complexity will increase sharply,which will lead to long time-consuming.The exis-ting short text clustering algorithms also have low accuracy and takes too much time.Based on the keywords of text,this paper presented an association model combining context and similarity matrix to determine the relevance between the current text and the previous text.In addition,the text keyword weights were modified according to the association model to further reduce the noise.Finally,a distributed short text clustering algorithm on Hadoop platform was implemented.Through the experiments,it is verified that the proposed algorithm has better results and performance compared with K-MEANS,SP-NN and SP-WC algorithms in terms of the speed of mining topics,the accuracy and the recall rate.

Key words: Online social network, Short text sequence, Text clustering, Distributed processing

CLC Number: 

  • TP391
[1]NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and Information Systems,2015,45(3):535-569.
[2]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in microblog stream[J].World Wide Web,2017,20(2):325-350.
[3]XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[C]∥International Conference on Data Mining.2013:837-846.
[4]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
[5]BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]∥Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002:436-442.
[6]SALLOUM S A,AL-EMRAN M,MONEM A A,et al.A survey of text mining in social media:facebook and twitter perspectives[J].Adv.Sci.Technol.Eng.Syst.J,2017,2(1):127-133.
[7]ALI A,QADIR J,RASOOL R U,et al.Big data for development:applications and techniques[J].arXiv:Computers and Society,2016,1(2):1-24.
[8]HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in Microblog stream[J].World Wide Web,2017,20(2):325-350.
[9]ALLAN J,CARBONELL J,DODDINGTON G,et al.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.1998:194-218.
[10]WAYNE C L.Topic detection &tracking (TDT)[C].Workshop held at the University of Maryland on.1997.
[11]CAPO M,PEREZ A,LOZANO J A,et al.An efficient approximation to the K-means clustering for massive data[J].Know-ledge Based Systems,2017,117:56-69.
[12]ARORA P,VARSHNEY S.Analysis of K-Means and K-Medoids algorithm for big data[J].Procedia Computer Science,2016,78:507-512.
[13]NG R T,HAN J.CLARANS:A method for clustering objects for spatial data mining[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(5):1003-1016.
[14]ABUALIGAH L M,KHADER A T.Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering[J].The Journal of Supercomputing,2017,73(11):4773-4795.
[15]PINTO,DAVID,et al.A Self-Enriching Methodology for Clustering Narrow Domain Short Texts[J].The Computer Journal,2011,54(7):1148-1165.
[16]PINTO D,BENEDÍ J M,ROSSO P.Clustering narrow-domain short texts by using the Kullback-Leiblerdistance[M].Computational Linguistics and Intelligent Text Processing.Springer Berlin Heidelberg,2007:611-622.
[17]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥Proceedings of the 18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.
[18]THOMAS R E,KHAN S S.Co-Clustering with side information for text mining[C]∥International Conference on Data Mining.2016:105-108.
[19]BHANUSE S S,KAMBLE S D,KAKDE S,et al.text mining using metadata for generation of side information[J].Procedia Computer Science,2016,78:807-814.
[20]HAHSLER M,BOLAOS M.Clustering data streams based on shared density between micro-clusters[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1449-1461.
[21]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]∥KDD Workshop on Text Mining.2000:525-526.
[22]KARYPIS G,HAN E H,KUMAR V.Chameleon:Hierarchical clustering using dynamic modeling[J].Computer,1999,32(8):68-75.
[23]SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited:why and how you should (still) use DBSCAN[J].ACM Transactions on Database Systems (TODS),2017,42(3):19.
[24]GAO T,LI A,MENG F,et al.Research on data stream clustering based on FCM algorithm[J].Procedia Computer Science,2017,122:595-602.
[25]REHIOUI H,IDRISSI A,ABOUREZQ M,et al.DENCLUE- IM:a new approach for big data clustering[J].Procedia Compu-ter Science,2016,83:560-567.
[26]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
[27]CONG Y,CHAN Y,RAGAN M A.A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF[J].Scientific Reports,2016,6(1):30308.
[28]GUO L,VARGO C J,PAN Z,et al.Big social data analytics in journalism and mass communication:Comparing dictionary-based text analysis and unsupervised topic modeling[J].Journa-lism & Mass Communication Quarterly,2016,93(2):332-359.
[29]ALLAHYARI M,POURIYEH S,ASSEFI M,et al.A brief survey of text mining:classification,clustering and extraction techniques[J].arXiv preprint arXiv:1707.02919,2017.
[30]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[31]LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]∥International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
[32]SHEN D,YANG Q,SUN J T,et al.Thread detection in dyna- mic text message streams[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:35-42.
[33]KENTER T,DE RIJKE M.Short text similarity with word embeddings[C]∥Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.ACM,2015:1411-1420.
[34]AKIMUSHKIN C,AMANCIO D R,OLIVEIRA J O N.Text authorship identified using the dynamics of word co-occurrence networks[J].PloS one,2017,12(1):e0170527.
[35]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.
[36]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146.
[37]LI C,WANG H,ZHANG Z,et al.Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]∥International Acm sigir Conference on Research and Development in Information Retrieval.2016:165-174.
[1] SUN Yong-yue, LI Hong-yan, ZHANG Jin-bo. RAISE:Efficient Influence Cost Minimizing Algorithm in Social Network [J]. Computer Science, 2019, 46(9): 59-65.
[2] YUAN De-yu, GAO Jian, YE Meng-xi, WANG Xiao-juan,. Malicious Information Source Locating Algorithm Based on Topological Extension in Online Social Network [J]. Computer Science, 2019, 46(5): 129-134.
[3] ZHANG Xiao-yang, QIN Gui-he, ZOU Mi, SUN Ming-hui and GAO Qing-yang. Research on Recommendation Method of Restaurant Based on LDA Model [J]. Computer Science, 2017, 44(7): 180-184, 214.
[4] ZHANG Qun, WANG Hong-jun and WANG Lun-wen. Short Text Clustering Algorithm Combined with Context Semantic Information [J]. Computer Science, 2016, 43(Z11): 443-446, 450.
[5] WANG You-hua and CHEN Xiao-rong. Improved Text Clustering Algorithm Based on Kolmogorov Complexity [J]. Computer Science, 2016, 43(5): 243-246.
[6] LI Zhao, LI Xiao, WANG Chun-mei, LI Cheng and YANG Chun. Text Clustering Method Study Based on MapReduce [J]. Computer Science, 2016, 43(1): 246-250, 269.
[7] WU Hai-tao and YING Shi. Classifying Interests of Social Media Users Based on Information Content and Social Graph [J]. Computer Science, 2015, 42(4): 185-189, 198.
[8] ZHU Ye-hang, LI Yan-ling, CUI Meng-tian and YANG Xian-wen. Clustering Algorithm CARDBK Improved from K-means Algorithm [J]. Computer Science, 2015, 42(3): 201-205.
[9] HE Chao-bo, YANG Zhen-xiong, HONG Shao-wen, TANG Yong, CHEN Guo-hua and ZHENG Kai. User Classification Method in Online Social Network Using Random Walks [J]. Computer Science, 2015, 42(2): 198-203.
[10] ZHANG Xing,YU Zhi-wen,LIANG Yun-ji and GUO Bin. Community Development Method Based on Interactive Similarity [J]. Computer Science, 2014, 41(4): 215-218,229.
[11] LIU Yi-song and YANG Yu-cheng. Semantic Web Service Discovery Based on Text Clustering and Similarity of Concepts [J]. Computer Science, 2013, 40(11): 211-214.
[12] ZHENG Qian-bing,ZHU Pei-dong,WANG Yong-wen,XU Ming. Research on Network Protocol Enhancing Mechanisms Based on Online Social Networks [J]. Computer Science, 2011, 38(6): 81-83,117.
[13] WANG Gang,ZHONG Guo-xiang. Study on Text Clustering Algorithm Based on Similarity Measurement of Ontology [J]. Computer Science, 2010, 37(9): 222-224.
[14] ZHU Zheng-yu LI Li-pei LUO Ying ZHOU Zhi ZHU Qing-sheng (Department of Computer Science, Chongqing University, Chongqing 400044, China). [J]. Computer Science, 2009, 36(5): 244-246.
[15] . [J]. Computer Science, 2008, 35(8): 134-137.
Full text



[1] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105, 130 .
[7] WANG Zhen-wu, LV Xiao-hua and HAN Xiao-hui. Survey of Terrain LOD Technology Based on Quadtree Segmentation[J]. Computer Science, 2018, 45(4): 34 -45 .
[8] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .
[9] LIAO Xing, YUAN Jing-ling and CHEN Min-cheng. Parallel PSO Container Packing Algorithm with Adaptive Weight[J]. Computer Science, 2018, 45(3): 231 -234, 273 .
[10] YANG Yu-qi, ZHANG Guo-an and JIN Xi-long. Dual-cluster-head Routing Protocol Based on Vehicle Density in VANETs[J]. Computer Science, 2018, 45(4): 126 -130 .