计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 392-395.

• 大数据与数据挖掘 • 上一篇    下一篇

面向短文本的神经网络聚类算法研究

孙昭颖,刘功申   

  1. 上海交通大学电子信息与电气工程学院 上海200240
  • 出版日期:2018-06-20 发布日期:2018-08-03
  • 作者简介:孙昭颖(1993-),女,硕士生,主要研究方向为机器学习、深度学习,E-mail:sunzy93@163.com(通信作者);刘功申 男,副教授,主要研究方向为内容安全、社交网络,E-mail:lgshen@sjtu.edu.cn。
  • 基金资助:
    国家自然科学基金项目(61472248,61431008)资助

Research on Neural Network Clustering Algorithm for Short Text

SUN Zhao-ying,LIU Gong-shen   

  1. School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
  • Online:2018-06-20 Published:2018-08-03

摘要: 词汇个数少、描述信息弱的缺陷,导致短文本具有维度高、特征稀疏和噪声干扰等特点。现有的众多聚类算法在对大规模短文本进行聚类时,存在精度较低和效率低下的问题。针对该问题,提出一种基于深度学习卷积神经网络的短文本聚类算法。所提算法以大规模语料为基础,利用word2vec 模型学习短文本中词语之间潜在的语义关联,用多维向量表示单个词语,进而将短文本也表示成多维的原始向量形式;结合深度学习卷积神经网络,对稀疏高维的原始向量进行特征提取,以此得到特征更为集中、有效的低维文本向量;最后,利用传统的聚类算法对短文本进行聚类。实验结果表明,所提聚类方法对文本向量的降维是可行、有效的,并且取得了F值达到75%以上的文本聚类效果。

关键词: word2vec, 短文本, 卷积神经网络, 深度学习, 文本聚类

Abstract: Short text has a small number of vocabularies and weak description of information,resulting in the characteris-tics of high dimensionality,sparse features and noise interference.The existing clustering algorithms have low accuracy and efficiency for the large-scale short text.A short text clustering algorithm based on deep learning convolution neural network was proposed to solve this problem.The proposed clustering algorithm uses the word2vec model to learn the potential semantic association between words in the short text,and the multidimensional vector to represent the single word based on the large-scale corpus,and then the short text is also expressed as the multidimensional original vector form.Using convolution neural network,the feature vector is extracted from the original vector of sparse and high dimension to the low-dimensional text vector with more effective characteristics.Finally,the traditional clustering algorithm is used to cluster the short text.The proposed clustering method is feasible and effective for the reduction of text vector,and has achieved good short text clustering effect with F-measure of over 75%.

Key words: Convolution neural network, Deep learning, Document clustering, Short text, Word2vec

中图分类号: 

  • TP183
[1]丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706.<br /> [2]YANG X,GHOTING A,RUAN Y,et al.A framework for summarizing and analyzing twitter feeds[C]∥18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2012:370-378.<br /> [3]ZHANG X,ZHU S,LIANG W.Detecting spam and promoting campaigns in the twitter social network[C]∥2012 IEEE 12th International Conference on Data Mining (ICDM).IEEE,2012:1194-1199.<br /> [4]LIN D.An information-theoretic definition of similarity[C]∥ICML.1998:296-304.<br /> [5]SCH TZE H,SILVERSTEIN C.Projections for efficient document clustering[C]∥International ACM Sigir Conference on Research & Development in Information Retrieval.ACM,1997:74-81.<br /> [6]RAMAGE D,HEYMANN P,MANNING C D,et al.Clustering the tagged Web[C]∥Second ACM International Conference on Web Search and Data Mining.ACM,2009:54-63.<br /> [7]FREEMAN R,YIN H.Self-organising maps for hierarchical tree view document clustering using contextual information[C]∥International Conference on Intelligent Data Engineering and Automated Learning.Springer Berlin Heidelberg,2002:123-128.<br /> [8]索红光,王玉伟.一种用于文本聚类的改进 k-means 算法[J].山东大学学报(理学版),2006,43(1):60-64.<br /> [9]刘金岭.基于主题的中文短信文本分类研究[J].计算机工程,2010,36(4):30-32.<br /> [10]杨震,王来涛,赖英旭.基于改进语义距离的网络评论聚类研究[J].软件学报,2014,25(12):2777-2789.<br /> [11]张群,王红军,王伦文.一种结合上下文语义的短文本聚类算法[J].计算机科学,2016,43(s2):443-446.<br /> [12]SAHAMI M,HEILMAN T D.A web-based kernel function for measuring the similarity of short text snippets[C]∥InternationalConference on World Wide Web.ACM,2006:377-386.<br /> [13]BOLLEGALA D,MATSUO Y,ISHIZUKA M.Measuring se- mantic similarity between words using web search engines[C]∥WWW 2007.2007:757-766.<br /> [14]BANERJEE S,RAMANATHAN K,GUPTA A.Clustering short texts using wikipedia[C]∥30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:787-788.<br /> [15]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.<br /> [16]TIAN Y,LI H,CAI Q,et al.Measuring the similarity of short texts by word similarity and tree kernels[C]∥2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT).IEEE,2010:363-366.<br /> [17]CHEN X,ZHANG Y,CAO L,et al.An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[M]∥Computer Engineering and Networking.Springer International Publishing,2014:635-642.<br /> [18]行小帅,潘进,焦李成.基于免疫规划的 K—means 聚类算法[J].计算机学报,2003,26(5):605-610.<br /> [19]卢玲,杨武,杨有俊,等.结合语义扩展和卷积神经网络的中文短文本分类方法[J].计算机应用,2017(12):3498-3503.<br /> [20]张绮琦,张树群,雷兆宜.基于改进的卷积神经网络的中文情感分类[J].计算机工程与应用,2017,53(22):111-115.<br /> [21]郭东亮,刘小明,郑秋生.基于卷积神经网络的互联网短文本分类方法[J].计算机与现代化,2017(4):78-81.<br /> [22]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.<br /> [23]KIM Y.Convolutional neural networks for sentence classification[J].arXiv preprint arXiv:1408.5882,2014.<br /> [24]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]∥Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1966:281-297.
[1] 吕晓锋, 赵书良, 高恒达, 武永亮, 张宝奇.
基于异质信息网的短文本特征扩充方法
Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network
计算机科学, 2022, 49(9): 92-100. https://doi.org/10.11896/jsjkx.210700241
[2] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[3] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[4] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[5] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[6] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[7] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[8] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[9] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[10] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[11] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[12] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[13] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[14] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[15] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!