Computer Science ›› 2018, Vol. 45 ›› Issue (6A): 392-395.

• Big Date & Date Mining • Previous Articles     Next Articles

Research on Neural Network Clustering Algorithm for Short Text

SUN Zhao-ying,LIU Gong-shen   

  1. School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
  • Online:2018-06-20 Published:2018-08-03

Abstract: Short text has a small number of vocabularies and weak description of information,resulting in the characteris-tics of high dimensionality,sparse features and noise interference.The existing clustering algorithms have low accuracy and efficiency for the large-scale short text.A short text clustering algorithm based on deep learning convolution neural network was proposed to solve this problem.The proposed clustering algorithm uses the word2vec model to learn the potential semantic association between words in the short text,and the multidimensional vector to represent the single word based on the large-scale corpus,and then the short text is also expressed as the multidimensional original vector form.Using convolution neural network,the feature vector is extracted from the original vector of sparse and high dimension to the low-dimensional text vector with more effective characteristics.Finally,the traditional clustering algorithm is used to cluster the short text.The proposed clustering method is feasible and effective for the reduction of text vector,and has achieved good short text clustering effect with F-measure of over 75%.

Key words: Convolution neural network, Deep learning, Document clustering, Short text, Word2vec

CLC Number: 

  • TP183
[1]丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706.<br /> [2]YANG X,GHOTING A,RUAN Y,et al.A framework for summarizing and analyzing twitter feeds[C]∥18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2012:370-378.<br /> [3]ZHANG X,ZHU S,LIANG W.Detecting spam and promoting campaigns in the twitter social network[C]∥2012 IEEE 12th International Conference on Data Mining (ICDM).IEEE,2012:1194-1199.<br /> [4]LIN D.An information-theoretic definition of similarity[C]∥ICML.1998:296-304.<br /> [5]SCH TZE H,SILVERSTEIN C.Projections for efficient document clustering[C]∥International ACM Sigir Conference on Research & Development in Information Retrieval.ACM,1997:74-81.<br /> [6]RAMAGE D,HEYMANN P,MANNING C D,et al.Clustering the tagged Web[C]∥Second ACM International Conference on Web Search and Data Mining.ACM,2009:54-63.<br /> [7]FREEMAN R,YIN H.Self-organising maps for hierarchical tree view document clustering using contextual information[C]∥International Conference on Intelligent Data Engineering and Automated Learning.Springer Berlin Heidelberg,2002:123-128.<br /> [8]索红光,王玉伟.一种用于文本聚类的改进 k-means 算法[J].山东大学学报(理学版),2006,43(1):60-64.<br /> [9]刘金岭.基于主题的中文短信文本分类研究[J].计算机工程,2010,36(4):30-32.<br /> [10]杨震,王来涛,赖英旭.基于改进语义距离的网络评论聚类研究[J].软件学报,2014,25(12):2777-2789.<br /> [11]张群,王红军,王伦文.一种结合上下文语义的短文本聚类算法[J].计算机科学,2016,43(s2):443-446.<br /> [12]SAHAMI M,HEILMAN T D.A web-based kernel function for measuring the similarity of short text snippets[C]∥InternationalConference on World Wide Web.ACM,2006:377-386.<br /> [13]BOLLEGALA D,MATSUO Y,ISHIZUKA M.Measuring se- mantic similarity between words using web search engines[C]∥WWW 2007.2007:757-766.<br /> [14]BANERJEE S,RAMANATHAN K,GUPTA A.Clustering short texts using wikipedia[C]∥30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:787-788.<br /> [15]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.<br /> [16]TIAN Y,LI H,CAI Q,et al.Measuring the similarity of short texts by word similarity and tree kernels[C]∥2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT).IEEE,2010:363-366.<br /> [17]CHEN X,ZHANG Y,CAO L,et al.An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[M]∥Computer Engineering and Networking.Springer International Publishing,2014:635-642.<br /> [18]行小帅,潘进,焦李成.基于免疫规划的 K—means 聚类算法[J].计算机学报,2003,26(5):605-610.<br /> [19]卢玲,杨武,杨有俊,等.结合语义扩展和卷积神经网络的中文短文本分类方法[J].计算机应用,2017(12):3498-3503.<br /> [20]张绮琦,张树群,雷兆宜.基于改进的卷积神经网络的中文情感分类[J].计算机工程与应用,2017,53(22):111-115.<br /> [21]郭东亮,刘小明,郑秋生.基于卷积神经网络的互联网短文本分类方法[J].计算机与现代化,2017(4):78-81.<br /> [22]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.<br /> [23]KIM Y.Convolutional neural networks for sentence classification[J].arXiv preprint arXiv:1408.5882,2014.<br /> [24]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]∥Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1966:281-297.
[1] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[2] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[3] LYU Xiao-feng, ZHAO Shu-liang, GAO Heng-da, WU Yong-liang, ZHANG Bao-qi. Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network [J]. Computer Science, 2022, 49(9): 92-100.
[4] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[5] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[6] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[7] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[8] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[9] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[10] ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[11] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[12] SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[13] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[14] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
[15] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!