Computer Science ›› 2016, Vol. 43 ›› Issue (2): 254-258.doi: 10.11896/j.issn.1002-137X.2016.02.053

Previous Articles     Next Articles

Semantic-based Feature Extraction Method for Document

JIANG Fang, LI Guo-he and YUE Xiang   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Feature extraction of Chinese documents is an important part in the document processing,and imposes great influence on the document classification.Pre-existing document feature extraction methods have many shortcomings,such as creating a feature vector of high dimensions,depending on training sets,ignoring low-frequency keywords,and so on.In this paper,the semantic distance between words was calculated based on the synonyms dictionary,and then theme related words of each classification were selected by the density clustering method,and finally the feature words were selected from the theme related words using the information gain algorithm.In order to validate the proposed method,one validation experiment and one comparison experiment were designed and the evaluation indexes including the macro-F value and the micro-F value were calculated.Experiment results show that the proposed document feature extraction method has better performance than other traditional algorithms.

Key words: Feature word,Semantic distance,Information gain,Text classification

[1] Can Do-gan,Shrikanth S N.On the computation of document frequency statistics from spoken corpora using factor automata[C]∥INTERSPEECH 2013-14th Annual Conference of the International Speech Communication Association.2013:6-10
[2] Yang Kai-feng,Zhang Yi-kun,Li Yan.Feature selection method based on document frequency[J].Computer Engineering,2010(10):33-35,38(in Chinese) 杨凯峰,张毅坤,李燕.基于文档频率的特征选择方法[J].计算机工程,2010(10):33-35,38
[3] Zhang Hai-long,Wang Lian-zhi.Automatic text categorization feature selection methods research[J].Computer Engineering and Design,2006(2):3838-3841(in Chinese) 张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006(2):3838-3841
[4] Ren Yong-gong,Yang Rong-jie,Yin Ming-fei,et al.Information-gain-based text feature selection method[J].Computer Science,2012,39(11):127-130(in Chinese) 任永功,杨荣杰,尹明飞,等.基于信息增益的文本特征选择方法[J].计算机科学,2012,39(11):127-130
[5] Guo Ya-wei,Liu Xiao-xia.Study on information gain-based feature selection in Chinese text categorization[J].Computer Engineering and Applications,2012(27):119-122,127(in Chinese) 郭亚维,刘晓霞.文本分类中信息增益特征选择方法的研究[J].计算机工程与应用,2012(27):119-122,127
[6] Vatsavai R R,Cheriyadat A,Gleason S.Supervised SemanticClassification for Nuclear Proliferation Monitoring[C]∥2010 IEEE 39th Applied Imagery Pattern Recognition Workshop(AIPR).2010:1-10
[7] Tang Liang,Duan Jian-guo,Xu Hong-bo,et al.Mutual information maximization based feature selection algorithm in text classification[J].Computer Engineering and Design,2008(13):130-133(in Chinese) 唐亮,段建国,许洪波,等.基于互信息最大化的特征选择算法及应用[J].计算机工程与设计,2008(13):130-133
[8] Zhou Hai-fang,Du Yun-fei,Yang Xue-jun,et al.Study and Implement of Parallel Region-based Registration Algorithm Based on Mutual Information for Remote-sensing Images[J].Journal of Image and Graphics,2010(1):174-180(in Chinese) 周海芳,杜云飞,杨学军,等.基于互信息的遥感图像区域配准并行算法的研究与实现[J].中国图象图形学报,2010(1):174-180
[9] Xiong Zhong-yang,Zhang Peng-zhao,Zhang Yu-fang.Improved approach to CHI in feature extraction[J].Journal of Computer Applications,2008(2):513-514,518(in Chinese) 熊忠阳,张鹏招,张玉芳.基于χ~2统计的文本分类特征选择方法的研究[J].计算机应用,2008(2):513-514,518
[10] Mao Xiao-li,He Zhong-shi,Xing Xin-lai,et al.Entity relationextraction based on feature selection[J].Application Research of Computers,2012(2):530-532(in Chinese) 毛小丽,何中市,邢欣来,等.基于特征选择的实体关系抽取[J].计算机应用研究,2012(2):530-532
[11] Liu Feng-chen,Liu Qing-wen,Hu Yue,et al.Space and time optimized algorithm of n-Gram/2L index structure[J].Computer Engineering and Applications,2008(5):180-183(in Chinese) 刘凤晨,刘庆文,胡玥,等.n-Gram/2L索引结构的存储与时间优化算法[J].计算机工程与应用,2008(5):180-183
[12] Xu Hong-tao.The research of Web image semantic analysis and Automatic tagging [D].Shanghai:Fudan University,2009(in Chinese) 许红涛.Web图像语义分析与自动标注研究[D].上海:复旦大学,2009
[13] Liu Duan-yang,Wang Liang-fang.Extraction Algorithm Based on Semantic Expansion Integrated with Lexical Chain[J].Computer Science,2013,0(12):264-269,291(in Chinese) 刘端阳,王良芳.结合语义扩展度和词汇链的关键词提取算法[J].计算机科学,2013,0(12):264-269,291
[14] Liu Jie.The research of food safety incidents cross-media information semantic analysis and classification [D].Beijing:Beijing University of Posts and Telecommunications,2013(in Chinese) 刘杰.食品安全突发事件跨媒体信息的语义分析与分类研究[D].北京:北京邮电大学,2013
[15] Yan Le-lin.The video semantic analysis and retrieval technology based on visual and auditory information research[D].Beijing:Beijing University of Posts and Telecommunications,2012(in Chinese) 闫乐林.基于视听信息的视频语义分析与检索技术研究[D].北京:北京邮电大学,2012
[16] Wu Xu-dong.Subjective and objective combination of semantic similarity algorithm and its application[D].Nanjing:Nanjing University of Posts and Telecommunications,2013(in Chinese) 吴旭东.主客观结合的语义相似度算法及其应用研究[D].南京:南京邮电大学,2013
[17] Zhai Yan-dong.The research of essay semantic web mining algorithm Based on WordNet [D].Changchun:Jilin University,2012(in Chinese) 翟延冬.基于WordNet的短文本语义网挖掘算法研究[D].长春:吉林大学,2012
[18] Wu Fang-fang,Zhao Yin-liang,Jiang Ze-fei.Novel support vector machine classifier based on density clustering[J].Journal of Xi’an Jiaotong University,2005,9(12):1319-1322,1348(in Chinese) 武方方,赵银亮,蒋泽飞.基于密度聚类的支持向量机分类算法[J].西安交通大学学报,2005,9(12):1319-1322,1348
[19] Li Xia,Jiang Sheng-yi,Zhang Qian-sheng,et al.A Dynamic Density-Based Clustering Algorithm Appropriate to Large-Scale Text Processing [J].Acta Scientiarum Naturalium Universitatis Pekinensis,2013,9(1):133-139(in Chinese) 李霞,蒋盛益,张倩生,等.适用于大规模文本处理的动态密度聚类算法[J].北京大学学报(自然科学版),2013,9(1):133-139
[20] Li Xue-ming,Li Hai-rui,Xue Liang,et al.TFIDF algorithmbased on information gain and information entropy[J].Compu-ter Engineering,2012,8(8):37-40(in Chinese) 李学明,李海瑞,薛亮,等.基于信息增益与信息熵的TFIDF算法[J].计算机工程,2012,8(8):37-40
[21] Liu Song-hua,Zhang Jun-ying,Xu Jin,et al.Kernel-kNN:ANew kNN Algorithm Based on Informational Energy Metric[J].Acta Automatica Sinica,2010,6(12):1681-1688(in Chinese) 刘松华,张军英,许进,等.Kernel-kNN:基于信息能度量的核k-最近邻算法[J].自动化学报,2010,6(12):1681-1688

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!