计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 392-395.
孙昭颖,刘功申
SUN Zhao-ying,LIU Gong-shen
摘要: 词汇个数少、描述信息弱的缺陷,导致短文本具有维度高、特征稀疏和噪声干扰等特点。现有的众多聚类算法在对大规模短文本进行聚类时,存在精度较低和效率低下的问题。针对该问题,提出一种基于深度学习卷积神经网络的短文本聚类算法。所提算法以大规模语料为基础,利用word2vec 模型学习短文本中词语之间潜在的语义关联,用多维向量表示单个词语,进而将短文本也表示成多维的原始向量形式;结合深度学习卷积神经网络,对稀疏高维的原始向量进行特征提取,以此得到特征更为集中、有效的低维文本向量;最后,利用传统的聚类算法对短文本进行聚类。实验结果表明,所提聚类方法对文本向量的降维是可行、有效的,并且取得了F值达到75%以上的文本聚类效果。
中图分类号:
[1]丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706.<br /> [2]YANG X,GHOTING A,RUAN Y,et al.A framework for summarizing and analyzing twitter feeds[C]∥18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2012:370-378.<br /> [3]ZHANG X,ZHU S,LIANG W.Detecting spam and promoting campaigns in the twitter social network[C]∥2012 IEEE 12th International Conference on Data Mining (ICDM).IEEE,2012:1194-1199.<br /> [4]LIN D.An information-theoretic definition of similarity[C]∥ICML.1998:296-304.<br /> [5]SCH TZE H,SILVERSTEIN C.Projections for efficient document clustering[C]∥International ACM Sigir Conference on Research & Development in Information Retrieval.ACM,1997:74-81.<br /> [6]RAMAGE D,HEYMANN P,MANNING C D,et al.Clustering the tagged Web[C]∥Second ACM International Conference on Web Search and Data Mining.ACM,2009:54-63.<br /> [7]FREEMAN R,YIN H.Self-organising maps for hierarchical tree view document clustering using contextual information[C]∥International Conference on Intelligent Data Engineering and Automated Learning.Springer Berlin Heidelberg,2002:123-128.<br /> [8]索红光,王玉伟.一种用于文本聚类的改进 k-means 算法[J].山东大学学报(理学版),2006,43(1):60-64.<br /> [9]刘金岭.基于主题的中文短信文本分类研究[J].计算机工程,2010,36(4):30-32.<br /> [10]杨震,王来涛,赖英旭.基于改进语义距离的网络评论聚类研究[J].软件学报,2014,25(12):2777-2789.<br /> [11]张群,王红军,王伦文.一种结合上下文语义的短文本聚类算法[J].计算机科学,2016,43(s2):443-446.<br /> [12]SAHAMI M,HEILMAN T D.A web-based kernel function for measuring the similarity of short text snippets[C]∥InternationalConference on World Wide Web.ACM,2006:377-386.<br /> [13]BOLLEGALA D,MATSUO Y,ISHIZUKA M.Measuring se- mantic similarity between words using web search engines[C]∥WWW 2007.2007:757-766.<br /> [14]BANERJEE S,RAMANATHAN K,GUPTA A.Clustering short texts using wikipedia[C]∥30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:787-788.<br /> [15]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]∥18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.<br /> [16]TIAN Y,LI H,CAI Q,et al.Measuring the similarity of short texts by word similarity and tree kernels[C]∥2010 IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT).IEEE,2010:363-366.<br /> [17]CHEN X,ZHANG Y,CAO L,et al.An Improved Feature Selection Method for Chinese Short Texts Clustering Based on HowNet[M]∥Computer Engineering and Networking.Springer International Publishing,2014:635-642.<br /> [18]行小帅,潘进,焦李成.基于免疫规划的 K—means 聚类算法[J].计算机学报,2003,26(5):605-610.<br /> [19]卢玲,杨武,杨有俊,等.结合语义扩展和卷积神经网络的中文短文本分类方法[J].计算机应用,2017(12):3498-3503.<br /> [20]张绮琦,张树群,雷兆宜.基于改进的卷积神经网络的中文情感分类[J].计算机工程与应用,2017,53(22):111-115.<br /> [21]郭东亮,刘小明,郑秋生.基于卷积神经网络的互联网短文本分类方法[J].计算机与现代化,2017(4):78-81.<br /> [22]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.<br /> [23]KIM Y.Convolutional neural networks for sentence classification[J].arXiv preprint arXiv:1408.5882,2014.<br /> [24]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]∥Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1966:281-297. |
[1] | 吕晓锋, 赵书良, 高恒达, 武永亮, 张宝奇. 基于异质信息网的短文本特征扩充方法 Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network 计算机科学, 2022, 49(9): 92-100. https://doi.org/10.11896/jsjkx.210700241 |
[2] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[3] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[4] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[5] | 汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108 |
[6] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[7] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[8] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[9] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[10] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[11] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[12] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[13] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[14] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[15] | 周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044 |
|