基于共现的汉语词的分布表示学习与改进

doi:10.11896/jsjkx.200900140

计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 222-226.doi: 10.11896/jsjkx.200900140

基于共现的汉语词的分布表示学习与改进

曹学飞¹, 牛倩¹, 王瑞波², 王钰², 李济洪²

1 山西大学自动化与软件学院太原030006
2 山西大学现代教育技术学院太原030006

收稿日期:2020-09-18 修回日期:2021-01-03 出版日期:2021-06-15 发布日期:2021-06-03
通讯作者: 曹学飞(caoxuefei@sxu.edu.cn)
基金资助:
国家自然科学基金(62076156,61806115,61603228);山西省应用基础研究计划(201901D111034)

Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence

CAO Xue-fei¹, NIU Qian¹, WANG Rui-bo², WANG Yu², LI Ji-hong²

1 School of Automation and Software Engineering,Shanxi University,Taiyuan 030006,China
2 School of Modern Educational Technology,Shanxi University,Taiyuan 030006,China

Received:2020-09-18 Revised:2021-01-03 Online:2021-06-15 Published:2021-06-03
About author:CAO Xue-fei,born in 1981,Ph.D.His main research interests include nature language processing and so on.
Supported by:
National Natural Science Foundation of China(62076156,61806115,61603228) and Shanxi Applied Basic Research Program(20191D111034).

摘要/Abstract

摘要： 词与其上下文的共现矩阵是词的分布表示学习的关键。在构造共现矩阵时,可采用不同方法来度量词与其上下文之间的关联。文中首先介绍了3种词与其上下文的关联度量方法并构造了相应的共现矩阵,使用同一个优化求解框架学习得到词的分布表示,在中文词语类比任务和语义相似性任务上的评价结果显示,GloVe方法的结果最好;然后进一步对GloVe方法进行了改进,通过引入一个超参数校正词与其上下文的共现次数,以使校正后的共现次数近似服从Zip’f分布,并给出了求解该超参数估计值的方法。基于改进后的方法学习得到的词的分布表示在词语类比任务上的准确率提高了0.67%,且在McNemar检验下是显著的;在词语相似性任务上的性能提高了5.6%。此外,将改进后的方法得到的词的分布表示应用到语义角色识别任务中,作为词特征的初始向量得到的F1值相比使用改进前的词的分布得到的F1值也提高了0.15%,且经3×2交叉验证的Bayes检验其提升也较为显著。

关键词: Zip’f分布, 词语类比, 词语相似性, 分布表示, 共现

Abstract: The co-occurrence matrix of words and their contexts is the key to learning the distributed of words.Different methods can be used to measure the association between words and their contexts when constructing a co-occurrence matrix.In this paper,we firstly introduce three association measures of words and their contexts,construct corresponding co-occurrence matrices and learn the distributed representations of words under a unified optimization framework.The results on semantic similarity and word analogy show that GloVe method is the best.Then,we further introduce a hyperparameter to calibrate the co-occurrences of the words and their contexts based on the Zip’f distribution,and present a method for solving the estimated value of the hyperparameter.The obtained distributed representations of words based on the improved method indicate that the accuracy of the word analogy task has increased by 0.67%,and it is significant under the McNemar test.The correlation coefficient on the word simila-rity task has increased by 5.6%.In addition,the distributed representations of the words learned by the improved method is also applied to the semantic role identification task as the initial vector of word feature,and the F1 value obtained is also increased by 0.15%.

Key words: Co-occurrence, Distributed representation, Word analogy, Word similarity, Zip’f

中图分类号:

TP391

曹学飞, 牛倩, 王瑞波, 王钰, 李济洪. 基于共现的汉语词的分布表示学习与改进[J]. 计算机科学, 2021, 48(6): 222-226. https://doi.org/10.11896/jsjkx.200900140

CAO Xue-fei, NIU Qian, WANG Rui-bo, WANG Yu, LI Ji-hong. Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence[J]. Computer Science, 2021, 48(6): 222-226. https://doi.org/10.11896/jsjkx.200900140

参考文献

[1]HAN N,QIAO S J,HUANG P,et al.Multi-Language TextClustering Model for Internet Public Opinion Based on Swarm Intelligence[J].Journal of Chongqing Institute of Technology,2019,33(9):99-108.
[2]CAO Y Y,ZHOU Y H,SHEN F H,et al.Research on Named Entity Recognition of Chinese Electronic Medical Record Based on CNN-CRF[J].Journal of Chongqing University of Posts and Telecommunications,2019,31(6):869-875.
[3]XU W,ZHOU J.End-to-end learning of semantic role labeling using recurrent neural networks[C]//Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics.2015:1127-1137.
[4]HARRIS Z S.Distributional structure[M].Word,1954.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[6]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the Empi-rical Methods in Natural Language Processing.2014:1532-1543.
[7]STRATOS K,COLLINS M,HSU D.Model-based Word Embeddings from Decompositions of Count Matrices[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.2015:1282-1291,
[8]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29.
[9]LEVY O,GOLDBERG Y.Neural word embedding as implicit matrix factorization[C]//Advances in Neural Information Processing Systems.2014:177-2185.
[10]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[11]CORRAL A,MORENO-SANCHEZ I,FONT-CLOS F.Large-scale analysis of Zipf’s law in English [J].PLoS ONE,2016,11:1-19.
[12]PIANTADOSI S T.Zipf’s word frequency law in natural language:A critical review and future directions [J].Psychonomic Bulletin & Review.2014,21(5):1112-1130.
[13]PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//AACL.2018:2227-2237.
[14]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:4171-4186.
[15]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[16]CHEN X X,XU L,LIU Z Y,et al.Joint learning of characterand word embeddings[C]//International Joint Conference on Artificial Intelligence.2015:1236-1242.
[17]GOLDBERG Y,LEVY O.Neural word embedding as implicit matrix factorization[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.2014,2177-2185.
[18]ZHOU B,CROSS J,XIANG B.Good,Better,Best:ChoosingWord Embedding Context [J].Computer Science,2015,31(3):624-634.
[19]EGGHE L.On the law of Zipf-Mandelbrot for multi-word phrases [J].American Society for Information Science.1999,50(3):233-241.
[20]DEVINE K,SMITH F J.Storing and retrieving word phrases[J].Information Processing and Management,1985,21:215-224.
[21]DIETTERICH T G.Approximate statistical tests for comparing supervised classification learning algorithms [J].Neural computation,1998,10:1895-1923.
[22]CARRERAS X,MARQUEZ E.Introduction to the conll-2004 shared task:Semantic role labeling[C]//Proceedings of CoNLL.2004,89-97.
[23]YANG J,LIANG S,ZHANG Y.Design Challenges and Misconceptions in Neural Sequence Labeling[C]//Proceedings of the 27th International Conference on Computational Linguistics.2018,3879-3889.
[24]LIU K Y.Research on Chinese FrameNet Construction and Application technologies[J].Journal of Chinese Information Processing,2011,25(6):46-52.
[25]WANG R B,LI J H.Bayes Test of Precision,Recall,and F1 Measure for Comparison of Two Natural Language Processing Models[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4135-4145.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于共现的汉语词的分布表示学习与改进

Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

Metrics

本文评价

推荐阅读 0

[1]	杨波, 李远彪. 数据科学与大数据技术课程体系的复杂网络分析 Complex Network Analysis on Curriculum System of Data Science and Big Data Technology 计算机科学, 2022, 49(6A): 680-685. https://doi.org/10.11896/jsjkx.210800123
[2]	李浩翔, 李浩君. MACTEN:新型大规模布料纹理分类框架 MACTEN:Novel Large Scale Cloth Texture Classification Architecture 计算机科学, 2020, 47(11A): 258-265. https://doi.org/10.11896/jsjkx.191200115
[3]	张小川, 余林峰, 张宜浩. 基于LDA的多特征融合的短文本相似度计算 Multi-feature Fusion for Short Text Similarity Calculation Based on LDA 计算机科学, 2018, 45(9): 266-270. https://doi.org/10.11896／j.issn.1002-137X.2018.09.044
[4]	安亚巍, 操晓春, 罗顺. 面向语料的领域主题词表构建算法 Construction Method of Domain Subject Thesaurus Based on Corpus 计算机科学, 2018, 45(6A): 396-397.
[5]	王旭阳,尉醒醒. 基于本体和局部共现的查询扩展方法 Query Expansion Method Based on Ontology and Local Co-occurrence 计算机科学, 2017, 44(1): 214-218. https://doi.org/10.11896/j.issn.1002-137X.2017.01.041
[6]	张书波,张引,张斌,孙达明. 基于Copulas框架的混合式查询扩展方法 Combined Query Expansion Method Based on Copulas Framework 计算机科学, 2016, 43(Z6): 485-488. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.114
[7]	马春来,单洪,马涛,顾正海. 一种基于随机森林的LBS用户社会关系判断方法 Random Forests Based Method for Inferring Social Ties of LBS Users 计算机科学, 2016, 43(12): 218-222. https://doi.org/10.11896/j.issn.1002-137X.2016.12.040
[8]	钟敏娟,万常选,刘德喜,廖述梅,焦贤沛. 基于较高质量扩展源和局部词共现模型的XML查询词扩展 XML Query Expansion Based on High Quality Expansion Source and Local Word Co-occurrence Model 计算机科学, 2014, 41(4): 200-204.