计算机科学 ›› 2016, Vol. 43 ›› Issue (9): 82-86.doi: 10.11896/j.issn.1002-137X.2016.09.015

• 2015 年第三届CCF 大数据学术会议 • 上一篇    下一篇

融入内部语义关系对文本分类的影响研究

朱建林,杨小平,彭鲸桥   

  1. 中国人民大学财政金融学院 北京100083,中国人民大学信息学院 北京100083,中国人民大学信息学院 北京100083
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(71271209),北京市自然科学基金(4132067),教育部人文社会科学青年基金(11YJC630268),河北省自然科学基金项目(A2013410011)资助

Research on Effect of Adding Internal Semantic Relationship into Text Categorization

ZHU Jian-lin, YANG Xiao-ping and PENG Jing-qiao   

  • Online:2018-12-01 Published:2018-12-01

摘要: 为了在不加入外部语义知识的前提下改善向量空间模型的文本分类效果,通过挖掘语料库内部蕴含的词间关系和文本间关系,并以不同的方式融入原始的词文本矩阵,然后选择常用的SVM和KNN算法,在领域性较强的法律语料库和领域性较宽泛的新闻语料库上进行文本分类的对比实验。实验证明,加入词间关系和文本间关系通常能有效改善文本分类的效果,但是对不同的分类方法和领域特征有不同的影响,在实际应用中应该区别对待。

关键词: 向量空间模型,文本分类,语义挖掘,特征矩阵

Abstract: In order to improve the effect of text categorization on the premise of no addition of the external knowledge,this paper presented a feature matrix-based categorization framework.First,the internal knowledge of corpus is mined and added into the original word-text matrix in different ways.Two common algorithms named SVM and KNN are chosen for contrastive experiment of text categorization in highly territorial legal corpus and domain-wide news corpus.Experi-mental results show that it is generally helpful when adding the semantic relationships extracted from corpus into the original matrix,but the adding method should be chosen according to different classification methods and domain chara-cteristics.

Key words: Vector space model,Text categorization,Semantic mining,Feature matrix

[1] Salton G,Yang C S.On the specification of term values in automatic indexing[J].Journal of Documentation,1973,29(4):351-372
[2] Alfred R,Anthony P,Alias S,et al.Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge[M]∥Soft Computing Applications and Intelligent Systems.Springer Berlin Heidelberg,2013:283-292
[3] Hotho A,Staab S,Stumme G.Ontologies improve text docu-ment clustering[C]∥Third IEEE International Conference on Data Mining,2003(ICDM 2003).IEEE,2003:541-544
[4] Miller G A.WordNet:a lexical database for English[J].Communications of the ACM,1995,38(11):39-41
[5] Bloehdorn S,Cimiano P,Hotho A.Learning ontologies to im-prove text clustering and classification[M]∥From Data and Information Analysis to Knowledge Engineering.Springer Berlin Heidelberg,2006:334-341
[6] Gabrilovich E,Markovitch S.Wikipedia-based semantic inter-pretation for natural language processing[J].Journal of Artificial Intelligence Research,2009,34(2):443-498
[7] Huang A,Milne D,Frank E,et al.Clustering documents using a Wikipedia-based concept representation[M]∥Advances in Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2009:628-636
[8] Cilibrasi R L,Vitanyi P M B.The google similarity distance[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(3):370-383
[9] Deerwester S C,Dumais S T,Landauer T K,et al.Indexing by latent semantic analysis[J].JASIS,1990,41(6):391-407
[10] Kontostathis A,Pottenger W M.A framework for understan-ding Latent Semantic Indexing (LSI) performance[J].Information Processing & Management,2006,42(1):56-73
[11] Chen M,Weinberger K Q,Sha F.An alternative text representation to TF-IDF and Bag-of-Words[J].arXiv preprint arXiv:1301.6770,2013
[12] Figueiredo F,Rocha L,Couto T,et al.Word co-occurrence features for text classification[J].Information Systems,2011,36(5):843-858
[13] Baker L D,McCallum A K.Distributional clustering of words for text classification[C]∥Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1998:96-103
[14] Yang Y,Pedersen J O.A comparative study on feature selection in text categorization[C]∥ Fourteenth International Conference on Mechine Learning.1997:412-420
[15] Forman G.An extensive empirical study of feature selectionmetrics for text classification[J].The Journal of Machine Learning Research,2003,3(2):1289-1305
[16] Zelikovitz S,Hirsh H.Using LSI for text classification in thepresence of background text[C]∥Proceedings of the Tenth International Conference on Information and Knowledge Management.ACM,2001:113-118
[17] Seifert C,Ulbrich E,Kern R,et al.Text Representation for Efficient Document Annotation[J].J.UCS,2013,19(3):383-405
[18] Lewis D D.Feature selection and feature extraction for text ca-tegorization[C]∥Proceedings of the Workshop on Speech and Natural Language.Association for Computational Linguistics,1992:212-217
[19] Ding C H Q.A similarity-based probability model for latent semantic indexing[C]∥Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1999:58-65
[20] Lee D D,Seung H S.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(6755):788-791
[21] Tan S B,Wang Y F.Chinese text categorization corps-TanCorp-V1.0.[2014-4-13].http://www.searchforum.org.cn/tansongbo/corpus.htm
[22] Zhang H P.The Chinese academy of sciences segmentation kit.[2014-4-13].http://www.ictclas.org
[23] He L,Wang Z Y,Jia Y,et al.Category candidate search in large scale hierarchical classification[J].Chinese Journal of Compu-ters,2014,31(1):41-49
[24] Zhang Yu-fang,Wang Yong,Liu Ming,et al.New feature selection approach for text categorization[J].Computer Engineering and Applications,2013,49(5):132-135

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!