计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 69-73.doi: 10.11896/jsjkx.190400107
黄梦婷, 张灵, 姜文超
HUANG Meng-ting, ZHANG Ling, JIANG Wen-chao
摘要: 针对短文本特征稀疏的问题,提出了一种基于非负矩阵分解的特征扩展方法(NMFFE)。该方法只考虑数据自身,不借助外部资源进行短文本的特征扩展。首先,把文本及单词的内部关系考虑到文本和单词的关系矩阵分解中,通过双正则化非负矩阵三分解(DNMTF)方法获取词聚类指示矩阵;然后,对词聚类指示矩阵进行降维处理以获取特征空间;最后,根据单词之间的相关程度,从特征空间中选取特征并将其加入短文本中,从而解决短文本特征稀疏的问题,提高文本分类的准确率。实验数据表明,与BOW算法和Char-CNN算法中表现较优者相比,基于NMFFE算法的短文本分类的准确率分别在Web snippets,Twitter sports和AGnews 数据集上提高了25.77%,10.89%和1.79%,这充分说明在分类准确率和算法鲁棒性方面,NMFFE算法优于BOW算法和Char-CNN算法。
中图分类号:
[1]TOMMASEL A,GODOY D.Short-text feature construction and selection in social media data:a survey[J].Artificial Intelligence Review,2018,49(3):301-338.[2]BOLLEGALA D,MATSUO Y,ISHIZUKA M.A Web Search Engine-Based Approach to Measure Semantic Similarity between Words[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(7):977-990.[3]LI X,SU Y,MA H,et al.Combining Statistical Information and Semantic Similarity for Short Text Feature Extension[C]//International Conference on Intelligent Information Processing.Springer,2016:205-210.[4]LI J,CAI Y,CAI Z,et al.Wikipedia Based Short Text Classification Method[M]//Database Systems for Advanced Applications.Berlin:Springer,2017:275-286.[5]LI P,HE L,WANG H,et al.Learning From Short Text Streams With Topic Drifts[J].IEEE Transactions on Cybernetics,2017,48(9):1-15.[6]VO D T,OCK C Y.Learning to classify short text from scienti- fic documents using topic models with various types of know-ledge[J].Expert Systems with Applications,2015,42(3):1684-1698.[7]ZHANG H,ZHONG G.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102(C):76-86.[8]KIM K,CHUNG B S,CHOI Y R,et al.Language independent semantic kernels for short-text classification[J].Expert Systems with Applications,2014,41(2):735-743.[9]ZHANG X,ZHAO J,LECUN Y.Character-level convolutional networks for text classification[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.ACM,2015,1:649-657.[10]DING C H Q ,LI T ,PENG W ,et al.Orthogonal nonnegative matrix t-factorizations for clustering[C]//Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006.[11]GU Q,ZHOU J.Co-clustering on manifolds[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:359-368.[12]SHANG F ,JIAO L C ,WANG F .Graph dual regularization non-negative matrix factorization for co-clustering[J].Pattern Recognition,2012,45(6):2237-2250.[13]BOYD S,VANDENBERGHE L.Convex Optimization[M]. Cambridge:Cambridge University Press,2004.[14]PHAN X H ,NGUYEN L M ,HORIGUCHI S .Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceeding of the 17th International Conference on World Wide Web.Beijing:ACM,2008:91-100.[15]HU Y ,ZHENG L ,YANG Y ,et al.Twitter100k:A Real-world Dataset for Weakly Supervised Cross-Media Retrieval[J].IEEE Transactions on Multimedia,2018,20(4):927-938.[16]ZHAO Y ,KARYPIS G .Criterion functions for document clustering[C]//Proceedings of the Thirteenth ACM Conference on Information and knowledge Management.ACM,2005:1-30.[17]STREHL A ,GHOSH J .Cluster ensembles — a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2003,3(3):583-617.[18]HUBERT L ,ARABIE P .Comparing Partitions[J].Journal of Classification,1985,2(1):193-218. |
[1] | 陈莹, 郝应光, 王洪玉, 王坤. 基于局部梯度强度图的动态规划检测前跟踪算法 Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map 计算机科学, 2022, 49(8): 150-156. https://doi.org/10.11896/jsjkx.210700135 |
[2] | 杨啸, 王翔坤, 胡浩, 朱敏. 面向设备状态监测的可视化技术综述 Survey on Visualization Technology for Equipment Condition Monitoring 计算机科学, 2022, 49(7): 89-99. https://doi.org/10.11896/jsjkx.210900167 |
[3] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[4] | 赵耿, 王超, 马英杰. 基于混沌序列相关性的峰均比抑制研究 Study on PAPR Reduction Based on Correlation of Chaotic Sequences 计算机科学, 2022, 49(5): 250-255. https://doi.org/10.11896/jsjkx.210400292 |
[5] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[6] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[7] | 刘意, 毛莺池, 程杨堃, 高建, 王龙宝. 基于邻域一致性的异常检测序列集成方法 Locality and Consistency Based Sequential Ensemble Method for Outlier Detection 计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156 |
[8] | 官铮, 邓扬琳, 聂仁灿. 光谱重建约束非负矩阵分解的高光谱与全色图像融合 Non-negative Matrix Factorization Based on Spectral Reconstruction Constraint for Hyperspectral and Panchromatic Image Fusion 计算机科学, 2021, 48(9): 153-159. https://doi.org/10.11896/jsjkx.200900054 |
[9] | 冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述 Survey of Research Progress on Cross-modal Retrieval 计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165 |
[10] | 段菲, 王慧敏, 张超. 面向数据表示的Cauchy非负矩阵分解 Cauchy Non-negative Matrix Factorization for Data Representation 计算机科学, 2021, 48(6): 96-102. https://doi.org/10.11896/jsjkx.200700195 |
[11] | 李雨蓉, 刘杰, 刘亚林, 龚春叶, 王勇. 面向语音分离的深层转导式非负矩阵分解并行算法 Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation 计算机科学, 2020, 47(8): 49-55. https://doi.org/10.11896/jsjkx.190900202 |
[12] | 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用 Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification 计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163 |
[13] | 李向利, 贾梦雪. 基于预处理的超图非负矩阵分解算法 Nonnegative Matrix Factorization Algorithm with Hypergraph Based on Per-treatments 计算机科学, 2020, 47(7): 71-77. https://doi.org/10.11896/jsjkx.200200106 |
[14] | 陈钱, 周杰, 邵根富. 角度域任意功率谱MIMO信道特征计算 MIMO Channels with Arbitrary AoA Power Spectrum for Various Wireless Environments 计算机科学, 2020, 47(6): 271-275. https://doi.org/10.11896/jsjkx.190500022 |
[15] | 莫彩网, 常侃, 李恒鑫, 李明鸿, 覃团发. 基于通道间相关性和非局部自相似性的彩色图像超分辨率算法 Color Image Super-resolution Algorithm Based on Inter-channel Correlation and Nonlocal Self-similarity 计算机科学, 2020, 47(6): 138-143. https://doi.org/10.11896/jsjkx.190500047 |
|