计算机科学 ›› 2019, Vol. 46 ›› Issue (6): 29-34.doi: 10.11896/j.issn.1002-137X.2019.06.003

• 大数据与数据科学* • 上一篇    下一篇

基于句法分析与词向量的领域新词发现方法

赵志滨1, 石玉鑫1, 李斌阳2   

  1. (东北大学计算机科学与工程学院 沈阳110819)1
    (国际关系学院信息科技学院 北京100091)2
  • 收稿日期:2018-08-18 发布日期:2019-06-24
  • 通讯作者: 赵志滨(1975-),男,博士,副教授,CCF会员,主要研究方向为分布式计算、Web数据挖掘和大数据管理,E-mail:zhaozb@mail.neu.edu.cn
  • 作者简介:石玉鑫(1994-),男,硕士生,主要研究方向为Web数据挖掘;李斌阳(1982-),男,博士,副教授,CCF会员,主要研究方向为自然语言处理、社会计算。
  • 基金资助:
    国家重点研发计划项目(2018YFB1004700),国家自然科学基金项目(61472070),航天专业部新技术研究高校合作项目(SKX182010023)资助。

Newly-emerging Domain Word Detection Method Based on Syntactic Analysis and Term Vector

ZHAO Zhi-bin1, SHI Yu-xin1, LI Bin-yang2   

  1. (School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)1
    (School of Information Science and Technology,University of International Relations,Beijing 100091,China)2
  • Received:2018-08-18 Published:2019-06-24

摘要: 很多已经存在的词汇和词组可能会被运用于它们之前从未被运用过的领域文本中,这样的词汇或词组被称为领域新词。领域新词的发现可以为该领域的研究人员提供最新的领域发展动态,帮助其分析该领域的最新舆情,因此具有非常重要的意义。针对领域新词发现这一问题,文中提出了一种基于依存句法分析与词向量的领域新词发现方法。首先,提出了句法词典的概念,并基于依存句法分析,结合TF-IDF值的计算,提出了构建领域句法词典的方法;然后,使用领域句法词典,结合词向量技术,完成了领域新词发现方法的设计;最后,使用来自于护肤品论坛的真实文本数据集对所提方法进行了正确性验证。实验结果表明,构建的句法词典的质量较高,所提方法在进行领域新词发现时具有良好的性能。

关键词: 词向量, 句法词典, 句法分析, 领域新词发现

Abstract: Many existing words and phrases may be used in a domain in which they have never appeared before.These words and phrases are called newly-emerging domain words.The researchers can get insight into the latest development tendency and public opinions of a domain through these newly-emerging words.Therefore,it is significant to detect newly-emerging domain words.Based on dependency syntactic analysis and term vector,this paper proposed a newly-emerging domain words detection method.Firstly,the concept of syntactic dictionary was proposed, and its constructing method was proposed for some specific domains based on the dependency syntax of sentences and TF-IDF values of training corpus.Next,domain syntactic dictionary and term vectors were used to detect newly-emerging domain words.The comprehensive experiments were conducted to evaluate the proposed method with comment data from a skin-care products forum.The experimental results show that the syntactic dictionary is effective and the proposed method has good performance in newly-emerging domain word detection.

Key words: Newly-emerging domain words, Syntactic analysis, Syntactic dictionary, Term vector

中图分类号: 

  • TP391
[1]YANG Y,LIU L F,WEI X H,et al.New methods for extracting emotional word based on distributed representation of words[J].Journal of Shandong University(Natural Science),2014,49(11):51-58.(in Chinese)
杨阳,刘龙飞,魏现辉,等.基于词向量的情感新词发现方法[J].山东大学学报(理学版),2014,49(11):51-58.
[2]LIANG Y,YIN P,YIU S M.New Word Detection and Tagging on Chinese Twitter Stream[C]∥ International Conference on Big Data Analytics and Knowledge Discovery.Cham:Springer,2015:310-321.
[3]YAN L,BAI B,CHEN W,et al.New Word Extraction From Chinese Financial Documents[J].IEEE Signal Processing Letters,2017,24(6):770-773.
[4]SU Q L,LIU B Q.Chinese new word extraction from MicroBlog data[C]∥International Conference on Machine Learning and Cybernetics.IEEE,2014:1874-1879.
[5]WANG F.Research on New Chinese Words Detection in Micro-blog[J].Computer Engineering & Software,2015,36(11):6-8.
[6]SHEN M,KAWAHARA D,KUROHASHI S.Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring[J].Journal of Natural Language Processing,2016,23(3):235-266.
[7]XU Y,GU H.New Word Recognition Based on Support Vector Machines and Constraints[C]∥ International Conference on Information Science and Control Engineering.IEEE,2015:341-344.
[8]HE T,HAO R,QI H,et al.Mining Feature-Opinion from Re-views Based on Dependency Parsing[J].International Journal of Software Engineering & Knowledge Engineering,2017,26(9n10):1581-1591.
[9]LI Y,ZHOU X,SUN Y,et al.Design and Implementation of Weibo Sentiment Analysis Based on LDA and Dependency Parsing[J].China Communications,2016,13(11):91-105.
[10]SHI Z P,ZOU X X,XIANG R Z,et al.Multi-feature Word Sense Disambiguation Based on Dependency Parsing Analysis[J].Computer Engineering,2017,43(9):210-213.(in Chinese)
史兆鹏,邹徐熹,向润昭,等.基于依存句法分析的多特征词义消歧[J].计算机工程,2017,43(9):210-213.
[11]GUO F,ZHOU G.Research on micro-blog sentiment orientation analysis based on improved dependency parsing∥International Conference on Consumer Electronics.IEEE,2014.
[12]ZHI S,LI X,ZHANG J,et al.Aspects Opinion Mining Based on Word Embedding and Dependency Parsing[C]∥ International Conference on Advances in Image Processing.ACM,2017:210-215.
[13]LIN Z,WANG Y.Age Prediction in Social Networks Based on Word Embedding and Tensor Learning[C]∥ International Conference on Communication and Electronic Information Engineering.Paris:Atlantis Press,2017.
[14]HAYRAN A,SERT M.Sentiment analysis on microblog data based on word embedding and fusion techniques[C]∥ Signal Processing and Communications Applications Conference.IEEE,2017.
[15]MENG F,LU W,XUE R.Mapping senses in BabelNet to Chinese based on word embedding[C]∥ International Congress on Image and Signal Processing,Biomedical Engineering and Informatics.IEEE,2018.
[16]KUSNER M J,SUN Y,KOLKIN N I,et al.From word embeddings to document distances[C]∥ International Conference on International Conference on Machine Learning.JMLR.org,2015:957-966.
[17]CHE W,LI Z,LIU T.LTP:a Chinese Language Technology Platform[C]∥ International Conference on Computational Linguistics:Demonstrations.Association for Computational Linguistics,2010:13-16.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[3] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[4] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[5] 线岩团, 高凡雅, 相艳, 余正涛, 王剑.
融合多策略数据增强的低资源依存句法分析方法
Improving Low-resource Dependency Parsing Using Multi-strategy Data Augmentation
计算机科学, 2022, 49(1): 73-79. https://doi.org/10.11896/jsjkx.210900036
[6] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[7] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[8] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
[9] 张玉帅, 赵欢, 李博.
基于BERT和BiLSTM的语义槽填充
Semantic Slot Filling Based on BERT and BiLSTM
计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088
[10] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[11] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[12] 霍丹, 张生杰, 万路军.
基于上下文的情感词向量混合模型
Context-based Emotional Word Vector Hybrid Model
计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114
[13] 石赫, 杨群, 刘绍翰, 李伟.
基于深度学习的电网故障预案信息抽取研究
Study on Information Extraction of Power Grid Fault Emergency Pre-plans Based on Deep Learning
计算机科学, 2020, 47(11A): 52-56. https://doi.org/10.11896/jsjkx.191100210
[14] 景丽, 李曼曼, 何婷婷.
结合扩充词典与自监督学习的网络评论情感分类
Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning
计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061
[15] 杨丹浩,吴岳辛,范春晓.
一种基于注意力机制的中文短文本关键词提取模型
Chinese Short Text Keyphrase Extraction Model Based on Attention
计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!