计算机科学 ›› 2017, Vol. 44 ›› Issue (11): 289-296.doi: 10.11896/j.issn.1002-137X.2017.11.044

• 人工智能 • 上一篇    下一篇

基于特征隶属度的文本分类相似性度量方法

池云仙,赵书良,罗燕,赵骏鹏,高琳,李超   

  1. 河北师范大学资源与环境科学学院 石家庄050024;河北师范大学数学与信息科学学院 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学资金项目(71271067),国家社科基金重大项目(13&ZD091),河北省高等学校科学技术研究项目(QN2014196),河北师范大学硕士基金(xj2015003)资助

Similarity Measure for Text Classification Based on Feature Subjection Degree

CHI Yun-xian, ZHAO Shu-liang, LUO Yan, ZHAO Jun-peng, GAO Lin and LI Chao   

  • Online:2018-12-01 Published:2018-12-01

摘要: 基于相似性进行文本分类是当前流行的文本处理方法。基于特征隶属度的文本分类相似性度量方法旨在利用特征与文档间的隶属关系度量文档相似性,从而实现文本分类。该方法基于特征与文档的隶属关系,对特征进行全隶属、偏隶属和无隶属词集划分,并基于3种隶属词集定义隶属度函数。全隶属词集隶属于两篇文档,隶属度随权差增大而降低;偏隶属词集仅隶属于其中某一篇文档,隶属度为一个定值;无隶属词集与两篇文档无隶属关系,隶属度为零。在度量相似性时,偏隶属关系高于全隶属关系。由于同类文档词集相近,异类文档词集差异明显,因此,基于特征与文档的隶属度进行相似性度量,可清晰界定词集与类别的隶属关系,提升分类精度。最后,采用数据集20-Newgroups和Reuters-21578对分类有效性进行验证,结果表明基于特征隶属度的相似性度量方法的性能优于目前流行的相似性度量方法。

关键词: 数据挖掘,文本分类,相似性度量,隶属度

Abstract: It is a fashionable method to do text classification based on similarity.Algorithm similarity measure for text classification based on feature subjection degree (SMTCFSD) aims at measuring similarity of documents through subjection relationship between feature words and documents.Algorithm SMTCFSD divides words into total subjection word sets,partial subjection word sets and none subjection word sets based on the subjection relationship,and defines subjection function based on three subjection word sets.Total subjection word sets subject to two documents,and subjection degree will decrease when the differences between two total subjection words increase.The words that only belong to one of the two documents are subsumed into partial subjection word sets,in which subjection degree is a definite value.Subjection degree of none subjection word sets is zero,because the words subject to neither of two documents.Total subjection relationship is more important than partial subjection relationship for similarity measure.Due to word sets of documents in the same category is similar to each other,while the ones in different categories have great distinction,classification accuracy will be promoted obviously based on different values of feature words,which are decided by subjection degree.Algorithm SMTCFSD is superior to the widely used similarity measure methods through experimental results on data sets from Reuters-21578 and 20-Newsgroups.

Key words: Data mining,Text classification,Similarity measure,Subjection degree

[1] SEBASTIANI F.Machine learning in automated text categorization[J].ACM CSUR,2002,4(1):1-47.
[2] LIU L,PENG T,ZUO W L,et al.Clustering-Based PU Active Text Clasfication Method [J].Journal of Software,2013,4(11):2571-2583.(in Chinese) 流露,彭涛,左万利,等.一种基于聚类的PU主动文本分类方法[J].软件学报,2013,4(11):2571-2583.
[3] XIA R,XU F,ZONG C Q,et al.Dual Sentiment Analysis:Considering Two Sides of One Review [J].IEEE Transactions on Knowledge and Data Engineering,2015,27(8):2120-2133.
[4] YU Z,WANG H X,LIN X M,et al.Understanding Short Texts through Semantic Enrichment and Hashing [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(2):566-579.
[5] AZAM N,YAO J.Comparison of term frequency and document frequency based feature selection metrics in text categorization[J].Expert Syst.Appl.,2012,9(5):4760-4768.
[6] DUAN J,HU Q H,ZHANG L J,et al.Feature Selection for Multi-Label Classification Based on Neighborhood Rough Set [J].Journal of Computer Research and Development,2015,2(1):56-65.(in Chinese) 段洁,胡清华,张灵均,等.基于邻域粗糙集的多标记分类特征选择算法 [J].计算机研究与发展,2015,2(1):56-65.
[7] TANG B,KAY S,HE H B.Toward Optimal Feature Selection in Nave Bayes for Text Categorization [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(9):2508-2521.
[8] TANG B,HE H B,BAGGENSTOSS P M,et al.A BayesianClassification Approach Using Class-Specific Features for Text Categorization[J].IEEE Transactions Knowledge and Data Engineering,2016,8(6):1602-1606.
[9] CHENG V C,LEUNG C H C,LIU J M,et al.Probabilistic Aspect Mining Model for Drug Reviews [J].IEEE Transactions on Knowledge and Data Engineering,2014,6(8):2002-2013.
[10] TANG B,HE H ,DING D,et al.A parametric classification rule based on the exponentially embedded family [J].IEEE Transactions on Neural Networks and Learning Systems,2015,6(2):367-377.
[11] LIU S H,CHENG X Q, LI F X,et al.TASC:Topic-Adaptive Sentiment Classification on Dynamic Tweets [J].IEEE Transactions on Knowledge and Data Engineering,2015,7(6),1696-1709.
[12] SUN K W,LEE C H,WANG J.Multilabel Classification via Co-Evolutionary Multilabel Hypernetwork [J].IEEE Transactions on Knowledge and Data Engineering,2016,8(9):2438-2451.
[13] ZHANG T,TANG Y Y,FANG B,et al.Document clustering in correlation similarity measure space[J].IEEE Transactions on Knowledge and Data Engineering,2012,4(6):1002-1013.
[14] MORI U,MENDIBURI A,LOZANO J A.Similarity MeasureSelection for Clustering Time Series Databases[J].IEEE Tran-sactions on Knowledge and Data Engineering,2016,28(1):181-195.
[15] KANG Y B,HAGHIGH P D,BURSTEIN F.TaxoFinder:AGraph-Based Approach for Taxonomy Learning [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(2):524-536.
[16] WANG Q,CUI M Y,LIANG H Z.Semantic-Aware Bloking for Entiy Resolution [J].IEEE Transactions on Knowledge and DataEngineering,2016,28(1):166-180.
[17] MCCALLUM A,NIGAM K,et al.A comparison of event mo-dels for naive bayes text classification[C]∥AAAI-98 Workshop on Learning for Text Categorization.1998:41-48.
[18] SCHOENHARL T W,MADEY G.Evaluation of measurement techniques for the validation of agent-based simulations against streaming data[C]∥Proc.ICCS.Kraków,Poland,2008:6-15.
[19] STREHL A,GHOSH J.Value-based customer grouping fromlarge retail data-sets[C]∥Proc.SPIE.Orlando,FL,USA,2000:33-42.
[20] BISWAS S K,MILANFAR P.One Shot Detection with Laplacian Object and Fast Matrix Cosine Similarity [J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence,2016,38(3):546-562.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!