计算机科学 ›› 2019, Vol. 46 ›› Issue (6): 29-34.doi: 10.11896/j.issn.1002-137X.2019.06.003
赵志滨1, 石玉鑫1, 李斌阳2
ZHAO Zhi-bin1, SHI Yu-xin1, LI Bin-yang2
摘要: 很多已经存在的词汇和词组可能会被运用于它们之前从未被运用过的领域文本中,这样的词汇或词组被称为领域新词。领域新词的发现可以为该领域的研究人员提供最新的领域发展动态,帮助其分析该领域的最新舆情,因此具有非常重要的意义。针对领域新词发现这一问题,文中提出了一种基于依存句法分析与词向量的领域新词发现方法。首先,提出了句法词典的概念,并基于依存句法分析,结合TF-IDF值的计算,提出了构建领域句法词典的方法;然后,使用领域句法词典,结合词向量技术,完成了领域新词发现方法的设计;最后,使用来自于护肤品论坛的真实文本数据集对所提方法进行了正确性验证。实验结果表明,构建的句法词典的质量较高,所提方法在进行领域新词发现时具有良好的性能。
中图分类号:
[1]YANG Y,LIU L F,WEI X H,et al.New methods for extracting emotional word based on distributed representation of words[J].Journal of Shandong University(Natural Science),2014,49(11):51-58.(in Chinese) 杨阳,刘龙飞,魏现辉,等.基于词向量的情感新词发现方法[J].山东大学学报(理学版),2014,49(11):51-58. [2]LIANG Y,YIN P,YIU S M.New Word Detection and Tagging on Chinese Twitter Stream[C]∥ International Conference on Big Data Analytics and Knowledge Discovery.Cham:Springer,2015:310-321. [3]YAN L,BAI B,CHEN W,et al.New Word Extraction From Chinese Financial Documents[J].IEEE Signal Processing Letters,2017,24(6):770-773. [4]SU Q L,LIU B Q.Chinese new word extraction from MicroBlog data[C]∥International Conference on Machine Learning and Cybernetics.IEEE,2014:1874-1879. [5]WANG F.Research on New Chinese Words Detection in Micro-blog[J].Computer Engineering & Software,2015,36(11):6-8. [6]SHEN M,KAWAHARA D,KUROHASHI S.Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring[J].Journal of Natural Language Processing,2016,23(3):235-266. [7]XU Y,GU H.New Word Recognition Based on Support Vector Machines and Constraints[C]∥ International Conference on Information Science and Control Engineering.IEEE,2015:341-344. [8]HE T,HAO R,QI H,et al.Mining Feature-Opinion from Re-views Based on Dependency Parsing[J].International Journal of Software Engineering & Knowledge Engineering,2017,26(9n10):1581-1591. [9]LI Y,ZHOU X,SUN Y,et al.Design and Implementation of Weibo Sentiment Analysis Based on LDA and Dependency Parsing[J].China Communications,2016,13(11):91-105. [10]SHI Z P,ZOU X X,XIANG R Z,et al.Multi-feature Word Sense Disambiguation Based on Dependency Parsing Analysis[J].Computer Engineering,2017,43(9):210-213.(in Chinese) 史兆鹏,邹徐熹,向润昭,等.基于依存句法分析的多特征词义消歧[J].计算机工程,2017,43(9):210-213. [11]GUO F,ZHOU G.Research on micro-blog sentiment orientation analysis based on improved dependency parsing∥International Conference on Consumer Electronics.IEEE,2014. [12]ZHI S,LI X,ZHANG J,et al.Aspects Opinion Mining Based on Word Embedding and Dependency Parsing[C]∥ International Conference on Advances in Image Processing.ACM,2017:210-215. [13]LIN Z,WANG Y.Age Prediction in Social Networks Based on Word Embedding and Tensor Learning[C]∥ International Conference on Communication and Electronic Information Engineering.Paris:Atlantis Press,2017. [14]HAYRAN A,SERT M.Sentiment analysis on microblog data based on word embedding and fusion techniques[C]∥ Signal Processing and Communications Applications Conference.IEEE,2017. [15]MENG F,LU W,XUE R.Mapping senses in BabelNet to Chinese based on word embedding[C]∥ International Congress on Image and Signal Processing,Biomedical Engineering and Informatics.IEEE,2018. [16]KUSNER M J,SUN Y,KOLKIN N I,et al.From word embeddings to document distances[C]∥ International Conference on International Conference on Machine Learning.JMLR.org,2015:957-966. [17]CHE W,LI Z,LIU T.LTP:a Chinese Language Technology Platform[C]∥ International Conference on Computational Linguistics:Demonstrations.Association for Computational Linguistics,2010:13-16. |
[1] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[2] | 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071 |
[3] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[4] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[5] | 线岩团, 高凡雅, 相艳, 余正涛, 王剑. 融合多策略数据增强的低资源依存句法分析方法 Improving Low-resource Dependency Parsing Using Multi-strategy Data Augmentation 计算机科学, 2022, 49(1): 73-79. https://doi.org/10.11896/jsjkx.210900036 |
[6] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[7] | 杨进才, 曹元, 胡泉, 沈显君. 基于Transformer模型与关系词特征的汉语因果类复句关系自动识别 Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature 计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019 |
[8] | 杨青, 张亚文, 朱丽, 吴涛. 基于注意力机制和BiGRU融合的文本情感分析 Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU 计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075 |
[9] | 张玉帅, 赵欢, 李博. 基于BERT和BiLSTM的语义槽填充 Semantic Slot Filling Based on BERT and BiLSTM 计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088 |
[10] | 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用 Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification 计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163 |
[11] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述 Survey of Natural Language Processing Pre-training Techniques 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167 |
[12] | 霍丹, 张生杰, 万路军. 基于上下文的情感词向量混合模型 Context-based Emotional Word Vector Hybrid Model 计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114 |
[13] | 石赫, 杨群, 刘绍翰, 李伟. 基于深度学习的电网故障预案信息抽取研究 Study on Information Extraction of Power Grid Fault Emergency Pre-plans Based on Deep Learning 计算机科学, 2020, 47(11A): 52-56. https://doi.org/10.11896/jsjkx.191100210 |
[14] | 景丽, 李曼曼, 何婷婷. 结合扩充词典与自监督学习的网络评论情感分类 Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning 计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061 |
[15] | 杨丹浩,吴岳辛,范春晓. 一种基于注意力机制的中文短文本关键词提取模型 Chinese Short Text Keyphrase Extraction Model Based on Attention 计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261 |
|