计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 69-73.doi: 10.11896/jsjkx.190400107

• 大数据与数据科学 • 上一篇    下一篇

基于非负矩阵分解的短文本特征扩展与分类

黄梦婷, 张灵, 姜文超   

  1. (广东工业大学计算机学院 广州510006)
  • 收稿日期:2019-04-18 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 姜文超(1977-),男,博士,讲师,主要研究方向为云计算、高性能计算、分布式系统等,E-mail:june4567@21cn.com。
  • 作者简介:黄梦婷(1994-),女,硕士生,主要研究方向为数据挖掘与分析;张灵(1968-),女,博士,教授,主要研究方向为智能化信息处理、自动化装备、人工智能和计算机视觉等。
  • 基金资助:
    本文受广东省自然科学基金(2018A030313061),广东省科技计划(2017B030305003,2017B010124001),广东省产学研合作项目(2017B090901005)资助。

Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization

HUANG Meng-ting, ZHANG Ling, JIANG Wen-chao   

  1. School of Computers,Guangdong University of Technology,Guangzhou 510006,China
  • Received:2019-04-18 Online:2019-12-15 Published:2019-12-17

摘要: 针对短文本特征稀疏的问题,提出了一种基于非负矩阵分解的特征扩展方法(NMFFE)。该方法只考虑数据自身,不借助外部资源进行短文本的特征扩展。首先,把文本及单词的内部关系考虑到文本和单词的关系矩阵分解中,通过双正则化非负矩阵三分解(DNMTF)方法获取词聚类指示矩阵;然后,对词聚类指示矩阵进行降维处理以获取特征空间;最后,根据单词之间的相关程度,从特征空间中选取特征并将其加入短文本中,从而解决短文本特征稀疏的问题,提高文本分类的准确率。实验数据表明,与BOW算法和Char-CNN算法中表现较优者相比,基于NMFFE算法的短文本分类的准确率分别在Web snippets,Twitter sports和AGnews 数据集上提高了25.77%,10.89%和1.79%,这充分说明在分类准确率和算法鲁棒性方面,NMFFE算法优于BOW算法和Char-CNN算法。

关键词: 短文本分类, 非负矩阵分解, 特征空间, 特征扩展, 相关性

Abstract: In this paper,a feature extension method based on non-negative matrix factorization (NMFFE) was proposed to overcome the sparse of short text feature.This method only considers the data itself and does not rely on external resources for feature extension.Firstly,the internal relationship of text and word is taken into account in the factorization of the relationship matrix between text and word ,and word clustering instruction matrix is obtained by graph dual re-gularization non-negative matrix triple factorization (DNMTF) method.Then,word clustering instruction matrix is reduced in dimensionality to get the feature space.Finally,according to the degree of correlation between words,the feature in the feature space is added to the short text,thus solving the problem of feature sparse in short text and improving the accuracy of text classification.The experimental data show that compared with the better performance in BOW algorithm and Char-CNN algorithm,the accuracy of short text classification based on NMFFE algorithm is increased by 25.77%,10.89% and 1.79% on the three datasets,which are Web snippets,Twitter sports and AGnews,respectively.The experimental data fully demonstrate that NMFFE algorithm is superior to BOW algorithm and Char-CNN algorithm in terms of classification accuracy and algorithm robustness.

Key words: Correlation, Feature extension, Feature space, Non-negative matrix factorization, Short text classification

中图分类号: 

  • TP391
[1]TOMMASEL A,GODOY D.Short-text feature construction and selection in social media data:a survey[J].Artificial Intelligence Review,2018,49(3):301-338.
[2]BOLLEGALA D,MATSUO Y,ISHIZUKA M.A Web Search Engine-Based Approach to Measure Semantic Similarity between Words[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(7):977-990.
[3]LI X,SU Y,MA H,et al.Combining Statistical Information and Semantic Similarity for Short Text Feature Extension[C]//International Conference on Intelligent Information Processing.Springer,2016:205-210.
[4]LI J,CAI Y,CAI Z,et al.Wikipedia Based Short Text Classification Method[M]//Database Systems for Advanced Applications.Berlin:Springer,2017:275-286.
[5]LI P,HE L,WANG H,et al.Learning From Short Text Streams With Topic Drifts[J].IEEE Transactions on Cybernetics,2017,48(9):1-15.
[6]VO D T,OCK C Y.Learning to classify short text from scienti- fic documents using topic models with various types of know-ledge[J].Expert Systems with Applications,2015,42(3):1684-1698.
[7]ZHANG H,ZHONG G.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102(C):76-86.
[8]KIM K,CHUNG B S,CHOI Y R,et al.Language independent semantic kernels for short-text classification[J].Expert Systems with Applications,2014,41(2):735-743.
[9]ZHANG X,ZHAO J,LECUN Y.Character-level convolutional networks for text classification[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.ACM,2015,1:649-657.
[10]DING C H Q ,LI T ,PENG W ,et al.Orthogonal nonnegative matrix t-factorizations for clustering[C]//Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006.
[11]GU Q,ZHOU J.Co-clustering on manifolds[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:359-368.
[12]SHANG F ,JIAO L C ,WANG F .Graph dual regularization non-negative matrix factorization for co-clustering[J].Pattern Recognition,2012,45(6):2237-2250.
[13]BOYD S,VANDENBERGHE L.Convex Optimization[M]. Cambridge:Cambridge University Press,2004.
[14]PHAN X H ,NGUYEN L M ,HORIGUCHI S .Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceeding of the 17th International Conference on World Wide Web.Beijing:ACM,2008:91-100.
[15]HU Y ,ZHENG L ,YANG Y ,et al.Twitter100k:A Real-world Dataset for Weakly Supervised Cross-Media Retrieval[J].IEEE Transactions on Multimedia,2018,20(4):927-938.
[16]ZHAO Y ,KARYPIS G .Criterion functions for document clustering[C]//Proceedings of the Thirteenth ACM Conference on Information and knowledge Management.ACM,2005:1-30.
[17]STREHL A ,GHOSH J .Cluster ensembles — a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2003,3(3):583-617.
[18]HUBERT L ,ARABIE P .Comparing Partitions[J].Journal of Classification,1985,2(1):193-218.
[1] 陈莹, 郝应光, 王洪玉, 王坤.
基于局部梯度强度图的动态规划检测前跟踪算法
Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map
计算机科学, 2022, 49(8): 150-156. https://doi.org/10.11896/jsjkx.210700135
[2] 杨啸, 王翔坤, 胡浩, 朱敏.
面向设备状态监测的可视化技术综述
Survey on Visualization Technology for Equipment Condition Monitoring
计算机科学, 2022, 49(7): 89-99. https://doi.org/10.11896/jsjkx.210900167
[3] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[4] 赵耿, 王超, 马英杰.
基于混沌序列相关性的峰均比抑制研究
Study on PAPR Reduction Based on Correlation of Chaotic Sequences
计算机科学, 2022, 49(5): 250-255. https://doi.org/10.11896/jsjkx.210400292
[5] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[6] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[7] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[8] 官铮, 邓扬琳, 聂仁灿.
光谱重建约束非负矩阵分解的高光谱与全色图像融合
Non-negative Matrix Factorization Based on Spectral Reconstruction Constraint for Hyperspectral and Panchromatic Image Fusion
计算机科学, 2021, 48(9): 153-159. https://doi.org/10.11896/jsjkx.200900054
[9] 冯霞, 胡志毅, 刘才华.
跨模态检索研究进展综述
Survey of Research Progress on Cross-modal Retrieval
计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165
[10] 段菲, 王慧敏, 张超.
面向数据表示的Cauchy非负矩阵分解
Cauchy Non-negative Matrix Factorization for Data Representation
计算机科学, 2021, 48(6): 96-102. https://doi.org/10.11896/jsjkx.200700195
[11] 李雨蓉, 刘杰, 刘亚林, 龚春叶, 王勇.
面向语音分离的深层转导式非负矩阵分解并行算法
Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation
计算机科学, 2020, 47(8): 49-55. https://doi.org/10.11896/jsjkx.190900202
[12] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[13] 李向利, 贾梦雪.
基于预处理的超图非负矩阵分解算法
Nonnegative Matrix Factorization Algorithm with Hypergraph Based on Per-treatments
计算机科学, 2020, 47(7): 71-77. https://doi.org/10.11896/jsjkx.200200106
[14] 陈钱, 周杰, 邵根富.
角度域任意功率谱MIMO信道特征计算
MIMO Channels with Arbitrary AoA Power Spectrum for Various Wireless Environments
计算机科学, 2020, 47(6): 271-275. https://doi.org/10.11896/jsjkx.190500022
[15] 莫彩网, 常侃, 李恒鑫, 李明鸿, 覃团发.
基于通道间相关性和非局部自相似性的彩色图像超分辨率算法
Color Image Super-resolution Algorithm Based on Inter-channel Correlation and Nonlocal Self-similarity
计算机科学, 2020, 47(6): 138-143. https://doi.org/10.11896/jsjkx.190500047
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!