Computer Science ›› 2019, Vol. 46 ›› Issue (12): 69-73.doi: 10.11896/jsjkx.190400107

• Big Data & Data Science • Previous Articles     Next Articles

Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization

HUANG Meng-ting, ZHANG Ling, JIANG Wen-chao   

  1. School of Computers,Guangdong University of Technology,Guangzhou 510006,China
  • Received:2019-04-18 Online:2019-12-15 Published:2019-12-17

Abstract: In this paper,a feature extension method based on non-negative matrix factorization (NMFFE) was proposed to overcome the sparse of short text feature.This method only considers the data itself and does not rely on external resources for feature extension.Firstly,the internal relationship of text and word is taken into account in the factorization of the relationship matrix between text and word ,and word clustering instruction matrix is obtained by graph dual re-gularization non-negative matrix triple factorization (DNMTF) method.Then,word clustering instruction matrix is reduced in dimensionality to get the feature space.Finally,according to the degree of correlation between words,the feature in the feature space is added to the short text,thus solving the problem of feature sparse in short text and improving the accuracy of text classification.The experimental data show that compared with the better performance in BOW algorithm and Char-CNN algorithm,the accuracy of short text classification based on NMFFE algorithm is increased by 25.77%,10.89% and 1.79% on the three datasets,which are Web snippets,Twitter sports and AGnews,respectively.The experimental data fully demonstrate that NMFFE algorithm is superior to BOW algorithm and Char-CNN algorithm in terms of classification accuracy and algorithm robustness.

Key words: Correlation, Feature extension, Feature space, Non-negative matrix factorization, Short text classification

CLC Number: 

  • TP391
[1]TOMMASEL A,GODOY D.Short-text feature construction and selection in social media data:a survey[J].Artificial Intelligence Review,2018,49(3):301-338.
[2]BOLLEGALA D,MATSUO Y,ISHIZUKA M.A Web Search Engine-Based Approach to Measure Semantic Similarity between Words[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(7):977-990.
[3]LI X,SU Y,MA H,et al.Combining Statistical Information and Semantic Similarity for Short Text Feature Extension[C]//International Conference on Intelligent Information Processing.Springer,2016:205-210.
[4]LI J,CAI Y,CAI Z,et al.Wikipedia Based Short Text Classification Method[M]//Database Systems for Advanced Applications.Berlin:Springer,2017:275-286.
[5]LI P,HE L,WANG H,et al.Learning From Short Text Streams With Topic Drifts[J].IEEE Transactions on Cybernetics,2017,48(9):1-15.
[6]VO D T,OCK C Y.Learning to classify short text from scienti- fic documents using topic models with various types of know-ledge[J].Expert Systems with Applications,2015,42(3):1684-1698.
[7]ZHANG H,ZHONG G.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102(C):76-86.
[8]KIM K,CHUNG B S,CHOI Y R,et al.Language independent semantic kernels for short-text classification[J].Expert Systems with Applications,2014,41(2):735-743.
[9]ZHANG X,ZHAO J,LECUN Y.Character-level convolutional networks for text classification[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.ACM,2015,1:649-657.
[10]DING C H Q ,LI T ,PENG W ,et al.Orthogonal nonnegative matrix t-factorizations for clustering[C]//Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006.
[11]GU Q,ZHOU J.Co-clustering on manifolds[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:359-368.
[12]SHANG F ,JIAO L C ,WANG F .Graph dual regularization non-negative matrix factorization for co-clustering[J].Pattern Recognition,2012,45(6):2237-2250.
[13]BOYD S,VANDENBERGHE L.Convex Optimization[M]. Cambridge:Cambridge University Press,2004.
[14]PHAN X H ,NGUYEN L M ,HORIGUCHI S .Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceeding of the 17th International Conference on World Wide Web.Beijing:ACM,2008:91-100.
[15]HU Y ,ZHENG L ,YANG Y ,et al.Twitter100k:A Real-world Dataset for Weakly Supervised Cross-Media Retrieval[J].IEEE Transactions on Multimedia,2018,20(4):927-938.
[16]ZHAO Y ,KARYPIS G .Criterion functions for document clustering[C]//Proceedings of the Thirteenth ACM Conference on Information and knowledge Management.ACM,2005:1-30.
[17]STREHL A ,GHOSH J .Cluster ensembles — a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research,2003,3(3):583-617.
[18]HUBERT L ,ARABIE P .Comparing Partitions[J].Journal of Classification,1985,2(1):193-218.
[1] LIU Jie-ling, LING Xiao-bo, ZHANG Lei, WANG Bo, WANG Zhi-liang, LI Zi-mu, ZHANG Hui, YANG Jia-hai, WU Cheng-nan. Network Security Risk Assessment Framework Based on Tactical Correlation [J]. Computer Science, 2022, 49(9): 306-311.
[2] CHEN Ying, HAO Ying-guang, WANG Hong-yu, WANG Kun. Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map [J]. Computer Science, 2022, 49(8): 150-156.
[3] SHEN Xiang-pei, DING Yan-rui. Multi-detector Fusion-based Depth Correlation Filtering Video Multi-target Tracking Algorithm [J]. Computer Science, 2022, 49(8): 184-190.
[4] WU Su-jie, ZHOU Jie, WANG Xue-ying, LYU Zhi-kang, SHAO Gen-fu. Study on Characteristics of Millimeter-wave MIMO Channel in Rainfall Environment [J]. Computer Science, 2022, 49(7): 297-303.
[5] YANG Xiao, WANG Xiang-kun, HU Hao, ZHU Min. Survey on Visualization Technology for Equipment Condition Monitoring [J]. Computer Science, 2022, 49(7): 89-99.
[6] SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[7] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[8] ZHAO Geng, WANG Chao, MA Ying-jie. Study on PAPR Reduction Based on Correlation of Chaotic Sequences [J]. Computer Science, 2022, 49(5): 250-255.
[9] LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[10] ZHANG Hu, BAI Ping. Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification [J]. Computer Science, 2022, 49(2): 279-284.
[11] LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao. Locality and Consistency Based Sequential Ensemble Method for Outlier Detection [J]. Computer Science, 2022, 49(1): 146-152.
[12] FENG Xia, HU Zhi-yi, LIU Cai-hua. Survey of Research Progress on Cross-modal Retrieval [J]. Computer Science, 2021, 48(8): 13-23.
[13] SUN Lin, PING Guo-lou, YE Xiao-jun. Correlation Analysis for Key-Value Data with Local Differential Privacy [J]. Computer Science, 2021, 48(8): 278-283.
[14] ZHOU Jia-li, FENG Yuan-yuan, WU Min, WU Chao. Stereo Track Blocks Coding System with Rotational Invariance [J]. Computer Science, 2021, 48(8): 175-184.
[15] LUO Jing-jing, TANG Wei-zhen, DING Ji-ting. Research of ATC Simulator Training Values Independence Based on Pearson Correlation Coefficient and Study of Data Visualization Based on Factor Analysis [J]. Computer Science, 2021, 48(6A): 623-628.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!