Computer Science ›› 2020, Vol. 47 ›› Issue (3): 110-115.doi: 10.11896/jsjkx.190700041

• Database & Big Data & Data Science • Previous Articles     Next Articles

Keywords Extraction Method Based on Semantic Feature Fusion

GAO Nan,LI Li-juan,Wei-william LEE,ZHU Jian-ming   

  1. (School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China)
  • Received:2019-06-04 Online:2020-03-15 Published:2020-03-30
  • About author:GAO Nan,born in 1983,Ph.D,is member of China Computer Federation.Her main research interests include data mining,machine learning and intelligent transportation system.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61702456) and Zhejiang Public Welfare Technology Research Program (2017C33108).

Abstract: Keyword extraction is widely used in the field of text mining,which is the prerequisite technology of text automatic summarization,classification and clustering.Therefore,it is very important to extract high quality keywords.At present,most researches on keyword extraction methods only consider some statistical features,but not the implicit semantic features of words,which leads to the low accuracy of extraction results and the lack of semantic information of keywords.To solve this problem,this paper designed a quantification method of the features between words and text themes.First,the word vector method is used to mine the context semantic relations of words.Then the main semantic features of the text is extracted by clustering.Finally,the distance between the words and the topic with the similar distance method is calculated.It is regarded as the semantic features of word.In addition,by combining the semantic features of word with the features of word frequency,length,location,language and other various description of words,a keywords extraction method of short text with semantic features was proposed,namely SFKE method.This method analyzes the importance of words from the statistical and semantic aspects,thus can extract the most relevant keyword set by integrating many factors.Experimental results show that the keyword extraction method integrating multiple features has significant improvement compared with TFIDF,TextRank,Yake,KEA,AE methods.The F-Score of this methodhas improved by 9.3% compared with AE.In addition,this paper used the method of information gain to evaluate the importance of features.The experimental results show that the F-Score of the model is increased by 7.2% after adding semantic feature.

Key words: Text mining, Statistical features, Semantic features, Support vector machine, Classification model

CLC Number: 

  • TP391
[1]ZHAO J S,ZHU Q M,ZHOU G D,et al.Review of Research in Automatic Keyword Extraction[J].Journal of Software,2017,28(9):2431-2449.
[2]BABAR S A,PATIL P D.Improving Performance of Text Summarization[J].Procedia Computer Science,2015,46:354-363.
[3]ONAN A,KORUKGLU S,BULUT H.Ensemble of Keyword Extraction Methods and Classifiers in Text Classification[J].Expert Systems with Applications,2016,57(C):232-247.
[4]LUHN H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J].IBM Journal of Research and Development 1957,1(4):309-317.
[5]MIHALCEA R,TARAU P.TextRank:Bringing Order into Texts[C]∥Proceeding Conference on Empirical Methods in Natural Language Processing.Barcelona,Spain:2004:404-411.
[6]CHEN W,WU Y Z,CHEN W L,et al.Automatic keyword extraction Based on BiLSTM-CRF[J].Computer Science,2018,45(S1):104-109.
[7]CAMPPOS R,MANGARAVITE V,PASQUALI A,et al.A Text Feature Based Automatic Keyword Extraction Method for Single Documents[C]∥Advances in Information Retrieval (EDS).Cham:Springer,2018:10772.
[8]ARDIANSYAH S,MAJID M A,ZAIN J M.Knowledge of extraction from trained neural network by using decision tree[C]∥International Conference on Science in Information Technology.IEEE,2017.
[9]FRANK E,PAYNTER G W,et al.Domain-Specic Keyphrase Extraction [C]∥International Joint Conference on Artificial Intelligence.1999:668-673.
[10]CHEN Y,YIN J,ZHU W,et al.Novel Word Features for Keyword Extraction [M]∥Web-Age Information Management.Springer International Publishing,2015:148-160.
[11]KANIS J.Digging Language Model-Maximum Entropy Phrase Extraction[C]∥International Conference on Text.Speech:Brno,Czech,2016:46-53.
[12]ZHOU C,LI S.Research of Information Extraction Algorithm based on Hidden Markov Model[C]∥International Conference on Information Science and Engineering.Springer,2010:1-4.
[13]ZHANG C.Automatic Keyword Extraction from Documents Using Conditional Random Fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180.
[14]ZHANG Q,WANG Y,GONG Y,et al.Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]∥Empirical Methods in Nnatural Language Processing.2016:836-845.
[15]AQUINO,GERMAN O,LANZARINI L C.Keyword Identification in Spanish Documents using Neural Networks[J].Journal of Computer Science & Technology,2015,15(2):55-60.
[16]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations(ICLR).2013:1301-3781.
[17]LIU Z Y.Research on Keyword Extraction Method Based on Document Topic Structure[D].Beijing:Tsinghua University,2011.
[18]GitHub[OL].https://github.com/uk9921/StopWords.
[19]CHEN Y C,ZHANG Y X,WANG H,et al.Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J].Journal of Software,2018,29(7):2046-2070.
[20]LI S,ZHAO Z,HU R,et al.Analogical Reasoning on Chinese Morphological and Semantic Relations[J].Meeting of the Association for Computational Linguistics,2018,2:138-143.
[1] CAO Su-e, YANG Ze-min. Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine [J]. Computer Science, 2020, 47(8): 319-322.
[2] SONG Yan, HU Rong-hua, GUO Fu-min, YUAN Xin-liang and XIONG Rui-yang. Improved SVM+BP Algorithm for Muscle Force Prediction Based on sEMG [J]. Computer Science, 2020, 47(6A): 75-78.
[3] FANG Meng-lin, TANG Wen-bing, HUANG Hong-yun and DING Zuo-hua. Wall-following Navigation of Mobile Robot Based on Fuzzy-based Information Decomposition and Control Rules [J]. Computer Science, 2020, 47(6A): 79-83.
[4] XU Xiang-yan and HOU Rui-huan. Medium and Long-term Population Prediction Based on GM(1,1)-SVM Combination Model [J]. Computer Science, 2020, 47(6A): 485-487.
[5] ZHU Di-chen, XIA Huan, YANG Xiu-zhang, YU Xiao-min, ZHANG Ya-cheng and WU Shuai. Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis [J]. Computer Science, 2020, 47(6A): 530-534.
[6] GONG Kou-lin, ZHOU Yu, DING Li, WANG Yong-chao. Vulnerability Detection Using Bidirectional Long Short-term Memory Networks [J]. Computer Science, 2020, 47(5): 295-300.
[7] PAN Heng, LI Jing feng, MA Jun hu. Role Dynamic Adjustment Algorithm for Resisting Insider Threat [J]. Computer Science, 2020, 47(5): 313-318.
[8] YANG Li, LI Xin-yu, SHI Huai-feng, PAN Cheng-sheng. Task Intelligent Identification Method for Spatial Information Network [J]. Computer Science, 2020, 47(4): 262-269.
[9] WU Yu-kun,XIAO Jie,Wei William LEE,LOU Ji-lin. Support Vector Machine Model Based on Grey Wolf Optimization Fused Asymptotic [J]. Computer Science, 2020, 47(2): 37-43.
[10] ZHU Xiao-ling, LI Kun, ZHANG Chang-sheng, DU Fu-xin. Elevator Boot Fault Diagnosis Method Based on Gabor Wavelet Transform and Multi-coreSupport Vector Machine [J]. Computer Science, 2020, 47(12): 258-261.
[11] ZHAO Rui-jie, SHI Yong, ZHANG Han, LONG Jun, XUE Zhi. Webshell File Detection Method Based on TF-IDF [J]. Computer Science, 2020, 47(11A): 363-367.
[12] ZHOU Yu, REN Qin-chai, NIU Hui-bin. Research on Training Sample Data Selection Methods [J]. Computer Science, 2020, 47(11A): 402-408.
[13] LI Bao-sheng, QIN Chuan-dong. Study on Electric Vehicle Price Prediction Based on PSO-SVM Multi-classification Method [J]. Computer Science, 2020, 47(11A): 421-424.
[14] HAN Cheng-cheng, LIN Qiang, MAN Zheng-xing, CAO Yong-chun, WANG Hai-jun, WANG Wei-lan. Mining Nuclear Medicine Diagnosis Text for Correlation Extraction Between Lesions and Their Representations [J]. Computer Science, 2020, 47(11A): 524-530.
[15] YANG Xiao-hua, YAN Shi-yu, LIU Jie, LI Meng. Hierarchical Classification Model for Metamorphic Relations of Scientific Computing Programs [J]. Computer Science, 2020, 47(11A): 557-561.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .