计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 255-260.doi: 10.11896/jsjkx.191000163

• 人工智能 • 上一篇    下一篇

一种低频词词向量优化方法及其在短文本分类中的应用

程婧1, 2, 刘娜娜1, 2, 闵可锐3, 康昱4, 王新1, 2, 周扬帆1, 2   

  1. 1 复旦大学计算机科学技术学院 上海 201203
    2 上海市智能信息处理重点实验室 上海 201203
    3 上海市秘塔网络科技有限公司 上海 200135
    4 微软亚洲研究院 北京 100080
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 康昱(kangyu159@gmail.com)
  • 作者简介:jcheng17@fudan.edu.cn
  • 基金资助:
    国家自然科学基金(61702107);赛尔网络下一代互联网技术创新项目(NGII20180611)

Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification

CHENG Jing1, 2, LIU Na-na1, 2, MIN Ke-rui3, KANG Yu4, WANG Xin1, 2, ZHOU Yang-fan1, 2   

  1. 1 School of Computer Science, Fudan University, Shanghai 201203, China
    2 Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 201203, China
    3 META SOTA, Shanghai 200135, China
    4 Microsoft Research, Beijing 100080, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:CHENG Jing, born in 1993, postgradua-te.Her main research interests include text classification and so on.
    KANG Yu, born in 1988, Ph.D, is a member of China Computer Federation.His main research interests include datadriven service intelligence, improving cloud computing service performance based on data analysis methods
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61702107) and CERNET Innovation Project (NGII20180611).

摘要: 众多自然语言处理(Natural Language Processing, NLP)任务受益于在大规模语料上训练的词向量。由于预训练的词向量具有大语料上的通用语义特征, 因此将这些词向量应用到特定的下游任务时, 往往需要通过微调进行一定的更新和调整, 使其更适用于目标任务。但是, 目标语料集中的低频词由于缺少训练样本, 导致在微调过程中无法获得稳定的梯度信息, 使得词向量无法得到有效更新。而在短文本分类任务中, 这些低频词对分类结果同样有着重要的指示性。因此, 在具体的短文本分类任务上获得一个更好的低频词词向量表示是有必要的。针对这个问题, 文中提出了一种与下游任务模型无关的低频词词向量更新算法, 通过基于K近邻的词向量偏移计算方法, 利用通用词向量中与低频词相似的高频词所获得的任务特征信息, 来指导低频词的信息更新, 从而获得更准确的且适用于当前任务语境的低频词词向量表示;并以TextCNN作为基准模型, 基于word2vec和GloVe得到的两个通用预训练词向量, 在3个公开的短文本数据集上进行了优化算法的效果验证。实验结果表明, 使用优化算法更新低频词词表示后, 模型分类准确率能达到84.3%~94%, 较更新前提升了0.4%~1.4%, 体现了优化算法的有效性, 也进一步证明了短文本分类任务中低频词对分类结果的影响, 为短文本分类的研究工作提供了一定的借鉴。

关键词: 词向量, 低频词, 短文本分类, 微调

Abstract: Many Natural Language Processing (NLP) tasks have benefitted from the public availability of general-purpose vector representations of words trained with large-scale datasets.Since pre-trained word embeddings only have general semantic features from large corpus, it is often necessary to fine-tune these embeddings to make them more suitable for target tasks when it is applied to certain downstream tasks.But, the words with low occurrence frequencies can hardly receive stable gradient information when fine-tuning.However, low-frequency terms are likely to convey important class-specific information in tasks for short text classification.Therefore, it is necessary to obtain a better low-frequency word embedding on the specific task.To address the problem, this paper proposes a model-agnostic algorithm, which optimizes the vector representations of these words according to the task specifics.This approach leverages the update information from common words to guide the embedding updating on rare words.It helps achieve more effective embeddings for the low-frequency words.Our evaluation on three public short-text classification tasks shows that the proposed algorithm produces better task-specific embeddings for rarely occurring words, as a result, the model performance is improved from 0.4% to 1.4% on these tasks.It proves the positive influence of low frequency words on short-text classification tasks, which can shed light on short text classification tasks.

Key words: Fine-tuning, Low-frequency word, Short text classification, Word embedding

中图分类号: 

  • TP391
[1]MIKOLOV T, SUTSKEVER I, CHEN K, et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.New York:MIT Press, 2013:3111-3119.
[2]MIKOLOV T, CHEN K, CORRADO G, et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781, 2013.
[3]PENNINGTON J, SOCHER R, MANNING C.Glove:Globalvectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL, 2014:1532-1543.
[4]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to)convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015.
[5]CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R.Nasari:Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities[J].Artificia Intelligence, 2016, 240:36-64.
[6]CALISKAN A, BRYSON J J, NARAYANAN A.Semantics derived automatically from language corpora contain human-like biases[J].Science, 2017, 356(6334):183-186.
[7]WANG Y, HUANG M, ZHAO L.Attention-based LSTM foraspect-level sentiment classification[C]∥Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL, 2016:606-615.
[8]LIU Y, LIU B, SHAN L, et al.Modelling context with neural networks for recommending idioms in essay writing[J].Neurocomputing, 2018, 275:2287-2293.
[9]REZAEINIA S M, GHODSI A, RAHMANI R.Improving the accuracy of pre-trained word embeddings for sentiment analysis[J].arXiv:1711.08609, 2017.
[10]KIM Y.Convolutional neural networks for sentence classification[J].arXiv:1408.5882, 2014.
[11]LIU Y, LIU Z, CHUA T S, et al.Topical word embeddings[C]∥Twenty-Ninth AAAI Conference on Artificial Intelligence.Menlo Park:AAAI, 2015.
[12]ZENG J, LI J, SONG Y, et al.Topic memory networks for short text classification[J].arXiv:1809.03664, 2018.
[13]HUANG H, WANG Y, FENG C, et al.Leveraging Conceptualization for Short-Text Embedding[J].IEEE Transactions on Knowledge and Data Engineering, 2017, 30(7):1282-1295.
[14]WANG J, WANG Z, ZHANG D, et al.Combining Knowledgewith Deep Convolutional Neural Networks for Short Text Classification[C]∥International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann, 2017:2915-2921.
[15]HUA W, WANG Z, WANG H, et al.Short text understanding through lexical-semantic analysis[C]∥2015 IEEE 31st International Conference on Data Engineering.Piscataway:IEEE, 2015:495-506.
[16]MESNIL G, HE X, DEND L, et al.Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding[C]∥14th Annual Conference of the International Speech Communication Association.Lous Tourils:ISCA, 2013:3771-3775.
[17]YANG X, MAO K.Supervised fine tuning for word embedding with integrated knowledge[J].arXiv:1505.07931, 2015.
[18]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015.
[19]HEAP B, BAIN M, WOBCKE W, et al.Word vector enrichment of low frequency words in the bag-of-words model for short text multi-class classification problems[J].arXiv:1709.05778, 2017.
[20]PETERSON L E.K-nearest neighbor[J].Scholarpedia, 2009, 4(2):1883.
[21] LI X, ROTH D.Learning question classifiers[C]∥Proceedings of the 19th International Conference on Computational Linguistics-Volume 1.Association for Computational Linguistics, New York:ACM 2002:1-7.
[22]VITALE D, FERRAGINA P, SCAIELLA U.Classification ofshort texts by deploying topical annotations[C]∥European Conference on Information Retrieval.Heidelberg:Springer, 2012:376-387.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[3] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[4] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[5] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[6] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[7] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[8] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[9] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
[10] 张玉帅, 赵欢, 李博.
基于BERT和BiLSTM的语义槽填充
Semantic Slot Filling Based on BERT and BiLSTM
计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088
[11] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[12] 霍丹, 张生杰, 万路军.
基于上下文的情感词向量混合模型
Context-based Emotional Word Vector Hybrid Model
计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114
[13] 景丽, 李曼曼, 何婷婷.
结合扩充词典与自监督学习的网络评论情感分类
Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning
计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061
[14] 杨丹浩,吴岳辛,范春晓.
一种基于注意力机制的中文短文本关键词提取模型
Chinese Short Text Keyphrase Extraction Model Based on Attention
计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261
[15] 王乐乐,汪斌强,刘建港,张建辉,苗启广.
基于递归神经网络的恶意程序检测研究
Study on Malicious Program Detection Based on Recurrent Neural Network
计算机科学, 2019, 46(7): 86-90. https://doi.org/10.11896/j.issn.1002-137X.2019.07.013
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!