Computer Science ›› 2020, Vol. 47 ›› Issue (8): 255-260.doi: 10.11896/jsjkx.191000163

Previous Articles     Next Articles

Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification

CHENG Jing1, 2, LIU Na-na1, 2, MIN Ke-rui3, KANG Yu4, WANG Xin1, 2, ZHOU Yang-fan1, 2   

  1. 1 School of Computer Science, Fudan University, Shanghai 201203, China
    2 Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 201203, China
    3 META SOTA, Shanghai 200135, China
    4 Microsoft Research, Beijing 100080, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:CHENG Jing, born in 1993, postgradua-te.Her main research interests include text classification and so on.
    KANG Yu, born in 1988, Ph.D, is a member of China Computer Federation.His main research interests include datadriven service intelligence, improving cloud computing service performance based on data analysis methods
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61702107) and CERNET Innovation Project (NGII20180611).

Abstract: Many Natural Language Processing (NLP) tasks have benefitted from the public availability of general-purpose vector representations of words trained with large-scale datasets.Since pre-trained word embeddings only have general semantic features from large corpus, it is often necessary to fine-tune these embeddings to make them more suitable for target tasks when it is applied to certain downstream tasks.But, the words with low occurrence frequencies can hardly receive stable gradient information when fine-tuning.However, low-frequency terms are likely to convey important class-specific information in tasks for short text classification.Therefore, it is necessary to obtain a better low-frequency word embedding on the specific task.To address the problem, this paper proposes a model-agnostic algorithm, which optimizes the vector representations of these words according to the task specifics.This approach leverages the update information from common words to guide the embedding updating on rare words.It helps achieve more effective embeddings for the low-frequency words.Our evaluation on three public short-text classification tasks shows that the proposed algorithm produces better task-specific embeddings for rarely occurring words, as a result, the model performance is improved from 0.4% to 1.4% on these tasks.It proves the positive influence of low frequency words on short-text classification tasks, which can shed light on short text classification tasks.

Key words: Fine-tuning, Low-frequency word, Short text classification, Word embedding

CLC Number: 

  • TP391
[1]MIKOLOV T, SUTSKEVER I, CHEN K, et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.New York:MIT Press, 2013:3111-3119.
[2]MIKOLOV T, CHEN K, CORRADO G, et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781, 2013.
[3]PENNINGTON J, SOCHER R, MANNING C.Glove:Globalvectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL, 2014:1532-1543.
[4]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to)convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015.
[5]CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R.Nasari:Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities[J].Artificia Intelligence, 2016, 240:36-64.
[6]CALISKAN A, BRYSON J J, NARAYANAN A.Semantics derived automatically from language corpora contain human-like biases[J].Science, 2017, 356(6334):183-186.
[7]WANG Y, HUANG M, ZHAO L.Attention-based LSTM foraspect-level sentiment classification[C]∥Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL, 2016:606-615.
[8]LIU Y, LIU B, SHAN L, et al.Modelling context with neural networks for recommending idioms in essay writing[J].Neurocomputing, 2018, 275:2287-2293.
[9]REZAEINIA S M, GHODSI A, RAHMANI R.Improving the accuracy of pre-trained word embeddings for sentiment analysis[J].arXiv:1711.08609, 2017.
[10]KIM Y.Convolutional neural networks for sentence classification[J].arXiv:1408.5882, 2014.
[11]LIU Y, LIU Z, CHUA T S, et al.Topical word embeddings[C]∥Twenty-Ninth AAAI Conference on Artificial Intelligence.Menlo Park:AAAI, 2015.
[12]ZENG J, LI J, SONG Y, et al.Topic memory networks for short text classification[J].arXiv:1809.03664, 2018.
[13]HUANG H, WANG Y, FENG C, et al.Leveraging Conceptualization for Short-Text Embedding[J].IEEE Transactions on Knowledge and Data Engineering, 2017, 30(7):1282-1295.
[14]WANG J, WANG Z, ZHANG D, et al.Combining Knowledgewith Deep Convolutional Neural Networks for Short Text Classification[C]∥International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann, 2017:2915-2921.
[15]HUA W, WANG Z, WANG H, et al.Short text understanding through lexical-semantic analysis[C]∥2015 IEEE 31st International Conference on Data Engineering.Piscataway:IEEE, 2015:495-506.
[16]MESNIL G, HE X, DEND L, et al.Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding[C]∥14th Annual Conference of the International Speech Communication Association.Lous Tourils:ISCA, 2013:3771-3775.
[17]YANG X, MAO K.Supervised fine tuning for word embedding with integrated knowledge[J].arXiv:1505.07931, 2015.
[18]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015.
[19]HEAP B, BAIN M, WOBCKE W, et al.Word vector enrichment of low frequency words in the bag-of-words model for short text multi-class classification problems[J].arXiv:1709.05778, 2017.
[20]PETERSON L E.K-nearest neighbor[J].Scholarpedia, 2009, 4(2):1883.
[21] LI X, ROTH D.Learning question classifiers[C]∥Proceedings of the 19th International Conference on Computational Linguistics-Volume 1.Association for Computational Linguistics, New York:ACM 2002:1-7.
[22]VITALE D, FERRAGINA P, SCAIELLA U.Classification ofshort texts by deploying topical annotations[C]∥European Conference on Information Retrieval.Heidelberg:Springer, 2012:376-387.
[1] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[2] SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[3] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[4] LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[5] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[6] ZHANG Hu, BAI Ping. Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification [J]. Computer Science, 2022, 49(2): 279-284.
[7] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[8] LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining [J]. Computer Science, 2022, 49(1): 59-64.
[9] YU Sheng, LI Bin, SUN Xiao-bing, BO Li-li, ZHOU Cheng. Approach for Knowledge-driven Similar Bug Report Recommendation [J]. Computer Science, 2021, 48(5): 91-98.
[10] ZHANG Yu-shuai, ZHAO Huan, LI Bo. Semantic Slot Filling Based on BERT and BiLSTM [J]. Computer Science, 2021, 48(1): 247-252.
[11] TIAN Ye, SHOU Li-dan, CHEN Ke, LUO Xin-yuan, CHEN Gang. Natural Language Interface for Databases with Content-based Table Column Embeddings [J]. Computer Science, 2020, 47(9): 60-66.
[12] LI Zhou-jun,FAN Yu,WU Xian-jie. Survey of Natural Language Processing Pre-training Techniques [J]. Computer Science, 2020, 47(3): 162-173.
[13] GU Xue-mei,LIU Jia-yong,CHENG Peng-sen,HE Xiang. Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model [J]. Computer Science, 2020, 47(2): 245-250.
[14] HUO Dan, ZHANG Sheng-jie, WAN Lu-jun. Context-based Emotional Word Vector Hybrid Model [J]. Computer Science, 2020, 47(11A): 28-34.
[15] XU Sheng, ZHU Yong-xin. Study on Question Processing Algorithms in Visual Question Answering [J]. Computer Science, 2020, 47(11): 226-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!