计算机科学 ›› 2021, Vol. 48 ›› Issue (6): 222-226.doi: 10.11896/jsjkx.200900140
曹学飞1, 牛倩1, 王瑞波2, 王钰2, 李济洪2
CAO Xue-fei1, NIU Qian1, WANG Rui-bo2, WANG Yu2, LI Ji-hong2
摘要: 词与其上下文的共现矩阵是词的分布表示学习的关键。在构造共现矩阵时,可采用不同方法来度量词与其上下文之间的关联。文中首先介绍了3种词与其上下文的关联度量方法并构造了相应的共现矩阵,使用同一个优化求解框架学习得到词的分布表示,在中文词语类比任务和语义相似性任务上的评价结果显示,GloVe方法的结果最好;然后进一步对GloVe方法进行了改进,通过引入一个超参数校正词与其上下文的共现次数,以使校正后的共现次数近似服从Zip’f分布,并给出了求解该超参数估计值的方法。基于改进后的方法学习得到的词的分布表示在词语类比任务上的准确率提高了0.67%,且在McNemar检验下是显著的;在词语相似性任务上的性能提高了5.6%。此外,将改进后的方法得到的词的分布表示应用到语义角色识别任务中,作为词特征的初始向量得到的F1值相比使用改进前的词的分布得到的F1值也提高了0.15%,且经3×2交叉验证的Bayes检验其提升也较为显著。
中图分类号:
[1]HAN N,QIAO S J,HUANG P,et al.Multi-Language TextClustering Model for Internet Public Opinion Based on Swarm Intelligence[J].Journal of Chongqing Institute of Technology,2019,33(9):99-108. [2]CAO Y Y,ZHOU Y H,SHEN F H,et al.Research on Named Entity Recognition of Chinese Electronic Medical Record Based on CNN-CRF[J].Journal of Chongqing University of Posts and Telecommunications,2019,31(6):869-875. [3]XU W,ZHOU J.End-to-end learning of semantic role labeling using recurrent neural networks[C]//Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics.2015:1127-1137. [4]HARRIS Z S.Distributional structure[M].Word,1954. [5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [6]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the Empi-rical Methods in Natural Language Processing.2014:1532-1543. [7]STRATOS K,COLLINS M,HSU D.Model-based Word Embeddings from Decompositions of Count Matrices[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.2015:1282-1291, [8]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29. [9]LEVY O,GOLDBERG Y.Neural word embedding as implicit matrix factorization[C]//Advances in Neural Information Processing Systems.2014:177-2185. [10]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119. [11]CORRAL A,MORENO-SANCHEZ I,FONT-CLOS F.Large-scale analysis of Zipf’s law in English [J].PLoS ONE,2016,11:1-19. [12]PIANTADOSI S T.Zipf’s word frequency law in natural language:A critical review and future directions [J].Psychonomic Bulletin & Review.2014,21(5):1112-1130. [13]PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//AACL.2018:2227-2237. [14]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:4171-4186. [15]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119. [16]CHEN X X,XU L,LIU Z Y,et al.Joint learning of characterand word embeddings[C]//International Joint Conference on Artificial Intelligence.2015:1236-1242. [17]GOLDBERG Y,LEVY O.Neural word embedding as implicit matrix factorization[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.2014,2177-2185. [18]ZHOU B,CROSS J,XIANG B.Good,Better,Best:ChoosingWord Embedding Context [J].Computer Science,2015,31(3):624-634. [19]EGGHE L.On the law of Zipf-Mandelbrot for multi-word phrases [J].American Society for Information Science.1999,50(3):233-241. [20]DEVINE K,SMITH F J.Storing and retrieving word phrases[J].Information Processing and Management,1985,21:215-224. [21]DIETTERICH T G.Approximate statistical tests for comparing supervised classification learning algorithms [J].Neural computation,1998,10:1895-1923. [22]CARRERAS X,MARQUEZ E.Introduction to the conll-2004 shared task:Semantic role labeling[C]//Proceedings of CoNLL.2004,89-97. [23]YANG J,LIANG S,ZHANG Y.Design Challenges and Misconceptions in Neural Sequence Labeling[C]//Proceedings of the 27th International Conference on Computational Linguistics.2018,3879-3889. [24]LIU K Y.Research on Chinese FrameNet Construction and Application technologies[J].Journal of Chinese Information Processing,2011,25(6):46-52. [25]WANG R B,LI J H.Bayes Test of Precision,Recall,and F1 Measure for Comparison of Two Natural Language Processing Models[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4135-4145. |
[1] | 杨波, 李远彪. 数据科学与大数据技术课程体系的复杂网络分析 Complex Network Analysis on Curriculum System of Data Science and Big Data Technology 计算机科学, 2022, 49(6A): 680-685. https://doi.org/10.11896/jsjkx.210800123 |
[2] | 李浩翔, 李浩君. MACTEN:新型大规模布料纹理分类框架 MACTEN:Novel Large Scale Cloth Texture Classification Architecture 计算机科学, 2020, 47(11A): 258-265. https://doi.org/10.11896/jsjkx.191200115 |
[3] | 张小川, 余林峰, 张宜浩. 基于LDA的多特征融合的短文本相似度计算 Multi-feature Fusion for Short Text Similarity Calculation Based on LDA 计算机科学, 2018, 45(9): 266-270. https://doi.org/10.11896/j.issn.1002-137X.2018.09.044 |
[4] | 安亚巍, 操晓春, 罗顺. 面向语料的领域主题词表构建算法 Construction Method of Domain Subject Thesaurus Based on Corpus 计算机科学, 2018, 45(6A): 396-397. |
[5] | 王旭阳,尉醒醒. 基于本体和局部共现的查询扩展方法 Query Expansion Method Based on Ontology and Local Co-occurrence 计算机科学, 2017, 44(1): 214-218. https://doi.org/10.11896/j.issn.1002-137X.2017.01.041 |
[6] | 张书波,张引,张斌,孙达明. 基于Copulas框架的混合式查询扩展方法 Combined Query Expansion Method Based on Copulas Framework 计算机科学, 2016, 43(Z6): 485-488. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.114 |
[7] | 马春来,单洪,马涛,顾正海. 一种基于随机森林的LBS用户社会关系判断方法 Random Forests Based Method for Inferring Social Ties of LBS Users 计算机科学, 2016, 43(12): 218-222. https://doi.org/10.11896/j.issn.1002-137X.2016.12.040 |
[8] | 钟敏娟,万常选,刘德喜,廖述梅,焦贤沛. 基于较高质量扩展源和局部词共现模型的XML查询词扩展 XML Query Expansion Based on High Quality Expansion Source and Local Word Co-occurrence Model 计算机科学, 2014, 41(4): 200-204. |
|