基于共现的汉语词的分布表示学习与改进

doi:10.11896/jsjkx.200900140

Abstract

Abstract: The co-occurrence matrix of words and their contexts is the key to learning the distributed of words.Different methods can be used to measure the association between words and their contexts when constructing a co-occurrence matrix.In this paper,we firstly introduce three association measures of words and their contexts,construct corresponding co-occurrence matrices and learn the distributed representations of words under a unified optimization framework.The results on semantic similarity and word analogy show that GloVe method is the best.Then,we further introduce a hyperparameter to calibrate the co-occurrences of the words and their contexts based on the Zip’f distribution,and present a method for solving the estimated value of the hyperparameter.The obtained distributed representations of words based on the improved method indicate that the accuracy of the word analogy task has increased by 0.67%,and it is significant under the McNemar test.The correlation coefficient on the word simila-rity task has increased by 5.6%.In addition,the distributed representations of the words learned by the improved method is also applied to the semantic role identification task as the initial vector of word feature,and the F1 value obtained is also increased by 0.15%.

Key words: Co-occurrence, Distributed representation, Word analogy, Word similarity, Zip’f

CLC Number:

TP391

CAO Xue-fei, NIU Qian, WANG Rui-bo, WANG Yu, LI Ji-hong. Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence[J].Computer Science, 2021, 48(6): 222-226.

References

[1]HAN N,QIAO S J,HUANG P,et al.Multi-Language TextClustering Model for Internet Public Opinion Based on Swarm Intelligence[J].Journal of Chongqing Institute of Technology,2019,33(9):99-108.
[2]CAO Y Y,ZHOU Y H,SHEN F H,et al.Research on Named Entity Recognition of Chinese Electronic Medical Record Based on CNN-CRF[J].Journal of Chongqing University of Posts and Telecommunications,2019,31(6):869-875.
[3]XU W,ZHOU J.End-to-end learning of semantic role labeling using recurrent neural networks[C]//Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics.2015:1127-1137.
[4]HARRIS Z S.Distributional structure[M].Word,1954.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[6]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the Empi-rical Methods in Natural Language Processing.2014:1532-1543.
[7]STRATOS K,COLLINS M,HSU D.Model-based Word Embeddings from Decompositions of Count Matrices[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics.2015:1282-1291,
[8]CHURCH K,HANKS P.Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29.
[9]LEVY O,GOLDBERG Y.Neural word embedding as implicit matrix factorization[C]//Advances in Neural Information Processing Systems.2014:177-2185.
[10]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[11]CORRAL A,MORENO-SANCHEZ I,FONT-CLOS F.Large-scale analysis of Zipf’s law in English [J].PLoS ONE,2016,11:1-19.
[12]PIANTADOSI S T.Zipf’s word frequency law in natural language:A critical review and future directions [J].Psychonomic Bulletin & Review.2014,21(5):1112-1130.
[13]PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//AACL.2018:2227-2237.
[14]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics.2019:4171-4186.
[15]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[16]CHEN X X,XU L,LIU Z Y,et al.Joint learning of characterand word embeddings[C]//International Joint Conference on Artificial Intelligence.2015:1236-1242.
[17]GOLDBERG Y,LEVY O.Neural word embedding as implicit matrix factorization[C]//Proceedings of the Annual Conference on Neural Information Processing Systems.2014,2177-2185.
[18]ZHOU B,CROSS J,XIANG B.Good,Better,Best:ChoosingWord Embedding Context [J].Computer Science,2015,31(3):624-634.
[19]EGGHE L.On the law of Zipf-Mandelbrot for multi-word phrases [J].American Society for Information Science.1999,50(3):233-241.
[20]DEVINE K,SMITH F J.Storing and retrieving word phrases[J].Information Processing and Management,1985,21:215-224.
[21]DIETTERICH T G.Approximate statistical tests for comparing supervised classification learning algorithms [J].Neural computation,1998,10:1895-1923.
[22]CARRERAS X,MARQUEZ E.Introduction to the conll-2004 shared task:Semantic role labeling[C]//Proceedings of CoNLL.2004,89-97.
[23]YANG J,LIANG S,ZHANG Y.Design Challenges and Misconceptions in Neural Sequence Labeling[C]//Proceedings of the 27th International Conference on Computational Linguistics.2018,3879-3889.
[24]LIU K Y.Research on Chinese FrameNet Construction and Application technologies[J].Journal of Chinese Information Processing,2011,25(6):46-52.
[25]WANG R B,LI J H.Bayes Test of Precision,Recall,and F1 Measure for Comparison of Two Natural Language Processing Models[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4135-4145.

Related Articles 15

[1]	YANG Bo, LI Yuan-biao. Complex Network Analysis on Curriculum System of Data Science and Big Data Technology [J]. Computer Science, 2022, 49(6A): 680-685.
[2]	LI Hao-xiang, LI Hao-jun. MACTEN:Novel Large Scale Cloth Texture Classification Architecture [J]. Computer Science, 2020, 47(11A): 258-265.
[3]	GAO Li-zheng, ZHOU Gang, HUANG Yong-zhong, LUO Jun-yong, WANG Shu-wei. Open Domain Event Vector Algorithm Based on Zipf's Co-occurrence Matrix Factorization [J]. Computer Science, 2020, 47(10): 207-214.
[4]	YE Peng, WANG Yong-fang, XIA Yu-meng, AN Ping. Perceptual Model Based on GLCM Combined with Depth [J]. Computer Science, 2019, 46(3): 92-96.
[5]	JIANG Xian-wei, ZHANG Miao-xian, ZHU Zhao-song. Recognition of Chinese Finger Sign Language Based on Gray Level Co-occurrence Matrix and Fine Gaussian Support Vector Machine [J]. Computer Science, 2019, 46(11A): 303-308.
[6]	DU Xiu-li, ZHANG Wei, GU Bin-bin, CHEN Bo, QIU Shao-ming. GLCM-based Adaptive Block Compressed Sensing Method for Image [J]. Computer Science, 2018, 45(8): 277-282.
[7]	AN Ya-wei, CAO Xiao-chun, LUO Shun. Construction Method of Domain Subject Thesaurus Based on Corpus [J]. Computer Science, 2018, 45(6A): 396-397.
[8]	ZHANG Yong-mei, GUO Sha, JI Yan, MA Li and ZHANG Rui. Spatial-Temporal Co-occurrence Pattern Mining Algorithm Based on Network [J]. Computer Science, 2018, 45(3): 223-230.
[9]	ZOU Na, TIAN Jin-wen. Research on Multi Feature Fusion Infrared Ship Wake Detection [J]. Computer Science, 2018, 45(11A): 172-175.
[10]	LI Xiong, DING Zhi-ming, SU Xing, GUO Li-min. Word Clustering Based Text Semantic Tagging Extraction Method [J]. Computer Science, 2018, 45(11A): 417-421.
[11]	LI Xiao, XIE Hui and LI Li-jie. Research on Sentence Semantic Similarity Calculation Based on Word2vec [J]. Computer Science, 2017, 44(9): 256-260.
[12]	SU Hui-jia and ZHENG Ji-ming. Image Splicing Blind Detection Method Combined RLRN with GLCM [J]. Computer Science, 2017, 44(6): 150-154.
[13]	ZHAI Ao-bo, WEN Xian-bin and ZHANG Xin. Retrieval Algorithm for Texture Image Based on Improved Dual Tree Complex Wavelet Transform and Gray Gradient Co-occurrence Matrix [J]. Computer Science, 2017, 44(6): 274-277.
[14]	WANG Xu-yang and WEI Xing-xing. Query Expansion Method Based on Ontology and Local Co-occurrence [J]. Computer Science, 2017, 44(1): 214-218.
[15]	ZHANG Shu-bo, ZHANG Yin, ZHANG Bin and SUN Da-ming. Combined Query Expansion Method Based on Copulas Framework [J]. Computer Science, 2016, 43(Z6): 485-488.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Distributed Representation Learning and Improvement of Chinese Words Based on Co-occurrence

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0