Computer Science ›› 2017, Vol. 44 ›› Issue (11): 289-296.doi: 10.11896/j.issn.1002-137X.2017.11.044

Previous Articles     Next Articles

Similarity Measure for Text Classification Based on Feature Subjection Degree

CHI Yun-xian, ZHAO Shu-liang, LUO Yan, ZHAO Jun-peng, GAO Lin and LI Chao   

  • Online:2018-12-01 Published:2018-12-01

Abstract: It is a fashionable method to do text classification based on similarity.Algorithm similarity measure for text classification based on feature subjection degree (SMTCFSD) aims at measuring similarity of documents through subjection relationship between feature words and documents.Algorithm SMTCFSD divides words into total subjection word sets,partial subjection word sets and none subjection word sets based on the subjection relationship,and defines subjection function based on three subjection word sets.Total subjection word sets subject to two documents,and subjection degree will decrease when the differences between two total subjection words increase.The words that only belong to one of the two documents are subsumed into partial subjection word sets,in which subjection degree is a definite value.Subjection degree of none subjection word sets is zero,because the words subject to neither of two documents.Total subjection relationship is more important than partial subjection relationship for similarity measure.Due to word sets of documents in the same category is similar to each other,while the ones in different categories have great distinction,classification accuracy will be promoted obviously based on different values of feature words,which are decided by subjection degree.Algorithm SMTCFSD is superior to the widely used similarity measure methods through experimental results on data sets from Reuters-21578 and 20-Newsgroups.

Key words: Data mining,Text classification,Similarity measure,Subjection degree

[1] SEBASTIANI F.Machine learning in automated text categorization[J].ACM CSUR,2002,4(1):1-47.
[2] LIU L,PENG T,ZUO W L,et al.Clustering-Based PU Active Text Clasfication Method [J].Journal of Software,2013,4(11):2571-2583.(in Chinese) 流露,彭涛,左万利,等.一种基于聚类的PU主动文本分类方法[J].软件学报,2013,4(11):2571-2583.
[3] XIA R,XU F,ZONG C Q,et al.Dual Sentiment Analysis:Considering Two Sides of One Review [J].IEEE Transactions on Knowledge and Data Engineering,2015,27(8):2120-2133.
[4] YU Z,WANG H X,LIN X M,et al.Understanding Short Texts through Semantic Enrichment and Hashing [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(2):566-579.
[5] AZAM N,YAO J.Comparison of term frequency and document frequency based feature selection metrics in text categorization[J].Expert Syst.Appl.,2012,9(5):4760-4768.
[6] DUAN J,HU Q H,ZHANG L J,et al.Feature Selection for Multi-Label Classification Based on Neighborhood Rough Set [J].Journal of Computer Research and Development,2015,2(1):56-65.(in Chinese) 段洁,胡清华,张灵均,等.基于邻域粗糙集的多标记分类特征选择算法 [J].计算机研究与发展,2015,2(1):56-65.
[7] TANG B,KAY S,HE H B.Toward Optimal Feature Selection in Nave Bayes for Text Categorization [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(9):2508-2521.
[8] TANG B,HE H B,BAGGENSTOSS P M,et al.A BayesianClassification Approach Using Class-Specific Features for Text Categorization[J].IEEE Transactions Knowledge and Data Engineering,2016,8(6):1602-1606.
[9] CHENG V C,LEUNG C H C,LIU J M,et al.Probabilistic Aspect Mining Model for Drug Reviews [J].IEEE Transactions on Knowledge and Data Engineering,2014,6(8):2002-2013.
[10] TANG B,HE H ,DING D,et al.A parametric classification rule based on the exponentially embedded family [J].IEEE Transactions on Neural Networks and Learning Systems,2015,6(2):367-377.
[11] LIU S H,CHENG X Q, LI F X,et al.TASC:Topic-Adaptive Sentiment Classification on Dynamic Tweets [J].IEEE Transactions on Knowledge and Data Engineering,2015,7(6),1696-1709.
[12] SUN K W,LEE C H,WANG J.Multilabel Classification via Co-Evolutionary Multilabel Hypernetwork [J].IEEE Transactions on Knowledge and Data Engineering,2016,8(9):2438-2451.
[13] ZHANG T,TANG Y Y,FANG B,et al.Document clustering in correlation similarity measure space[J].IEEE Transactions on Knowledge and Data Engineering,2012,4(6):1002-1013.
[14] MORI U,MENDIBURI A,LOZANO J A.Similarity MeasureSelection for Clustering Time Series Databases[J].IEEE Tran-sactions on Knowledge and Data Engineering,2016,28(1):181-195.
[15] KANG Y B,HAGHIGH P D,BURSTEIN F.TaxoFinder:AGraph-Based Approach for Taxonomy Learning [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(2):524-536.
[16] WANG Q,CUI M Y,LIANG H Z.Semantic-Aware Bloking for Entiy Resolution [J].IEEE Transactions on Knowledge and DataEngineering,2016,28(1):166-180.
[17] MCCALLUM A,NIGAM K,et al.A comparison of event mo-dels for naive bayes text classification[C]∥AAAI-98 Workshop on Learning for Text Categorization.1998:41-48.
[18] SCHOENHARL T W,MADEY G.Evaluation of measurement techniques for the validation of agent-based simulations against streaming data[C]∥Proc.ICCS.Kraków,Poland,2008:6-15.
[19] STREHL A,GHOSH J.Value-based customer grouping fromlarge retail data-sets[C]∥Proc.SPIE.Orlando,FL,USA,2000:33-42.
[20] BISWAS S K,MILANFAR P.One Shot Detection with Laplacian Object and Fast Matrix Cosine Similarity [J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence,2016,38(3):546-562.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!