融入内部语义关系对文本分类的影响研究

doi:10.11896/j.issn.1002-137X.2016.09.015

Abstract

Abstract: In order to improve the effect of text categorization on the premise of no addition of the external knowledge,this paper presented a feature matrix-based categorization framework.First,the internal knowledge of corpus is mined and added into the original word-text matrix in different ways.Two common algorithms named SVM and KNN are chosen for contrastive experiment of text categorization in highly territorial legal corpus and domain-wide news corpus.Experi-mental results show that it is generally helpful when adding the semantic relationships extracted from corpus into the original matrix,but the adding method should be chosen according to different classification methods and domain chara-cteristics.

Key words: Vector space model,Text categorization,Semantic mining,Feature matrix

ZHU Jian-lin, YANG Xiao-ping and PENG Jing-qiao. Research on Effect of Adding Internal Semantic Relationship into Text Categorization[J].Computer Science, 2016, 43(9): 82-86.

References

[1] Salton G,Yang C S.On the specification of term values in automatic indexing[J].Journal of Documentation,1973,29(4):351-372
[2] Alfred R,Anthony P,Alias S,et al.Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge[M]∥Soft Computing Applications and Intelligent Systems.Springer Berlin Heidelberg,2013:283-292
[3] Hotho A,Staab S,Stumme G.Ontologies improve text docu-ment clustering[C]∥Third IEEE International Conference on Data Mining,2003(ICDM 2003).IEEE,2003:541-544
[4] Miller G A.WordNet:a lexical database for English[J].Communications of the ACM,1995,38(11):39-41
[5] Bloehdorn S,Cimiano P,Hotho A.Learning ontologies to im-prove text clustering and classification[M]∥From Data and Information Analysis to Knowledge Engineering.Springer Berlin Heidelberg,2006:334-341
[6] Gabrilovich E,Markovitch S.Wikipedia-based semantic inter-pretation for natural language processing[J].Journal of Artificial Intelligence Research,2009,34(2):443-498
[7] Huang A,Milne D,Frank E,et al.Clustering documents using a Wikipedia-based concept representation[M]∥Advances in Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2009:628-636
[8] Cilibrasi R L,Vitanyi P M B.The google similarity distance[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(3):370-383
[9] Deerwester S C,Dumais S T,Landauer T K,et al.Indexing by latent semantic analysis[J].JASIS,1990,41(6):391-407
[10] Kontostathis A,Pottenger W M.A framework for understan-ding Latent Semantic Indexing (LSI) performance[J].Information Processing & Management,2006,42(1):56-73
[11] Chen M,Weinberger K Q,Sha F.An alternative text representation to TF-IDF and Bag-of-Words[J].arXiv preprint arXiv:1301.6770,2013
[12] Figueiredo F,Rocha L,Couto T,et al.Word co-occurrence features for text classification[J].Information Systems,2011,36(5):843-858
[13] Baker L D,McCallum A K.Distributional clustering of words for text classification[C]∥Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1998:96-103
[14] Yang Y,Pedersen J O.A comparative study on feature selection in text categorization[C]∥ Fourteenth International Conference on Mechine Learning.1997:412-420
[15] Forman G.An extensive empirical study of feature selectionmetrics for text classification[J].The Journal of Machine Learning Research,2003,3(2):1289-1305
[16] Zelikovitz S,Hirsh H.Using LSI for text classification in thepresence of background text[C]∥Proceedings of the Tenth International Conference on Information and Knowledge Management.ACM,2001:113-118
[17] Seifert C,Ulbrich E,Kern R,et al.Text Representation for Efficient Document Annotation[J].J.UCS,2013,19(3):383-405
[18] Lewis D D.Feature selection and feature extraction for text ca-tegorization[C]∥Proceedings of the Workshop on Speech and Natural Language.Association for Computational Linguistics,1992:212-217
[19] Ding C H Q.A similarity-based probability model for latent semantic indexing[C]∥Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,1999:58-65
[20] Lee D D,Seung H S.Learning the parts of objects by non-negative matrix factorization[J].Nature,1999,401(6755):788-791
[21] Tan S B,Wang Y F.Chinese text categorization corps-TanCorp-V1.0.[2014-4-13].http://www.searchforum.org.cn/tansongbo/corpus.htm
[22] Zhang H P.The Chinese academy of sciences segmentation kit.[2014-4-13].http://www.ictclas.org
[23] He L,Wang Z Y,Jia Y,et al.Category candidate search in large scale hierarchical classification[J].Chinese Journal of Compu-ters,2014,31(1):41-49
[24] Zhang Yu-fang,Wang Yong,Liu Ming,et al.New feature selection approach for text categorization[J].Computer Engineering and Applications,2013,49(5):132-135

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Effect of Adding Internal Semantic Relationship into Text Categorization

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0