计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300201-8.doi: 10.11896/jsjkx.230300201
李帅, 于娟, 巫邵诚
LI Shuai, YU Juan, WU Shaocheng
摘要: 跨语言文本主题发现是跨语言文本挖掘领域的重要研究方向,对跨语言文本分析和组织各种文本数据具有较高的应用价值。基于Bagging和跨语言词嵌入改进LDA主题模型,提出跨语言文本主题发现方法BCL-LDA(Bagging,Cross-lingual word embedding with LDA),从多语言文本中挖掘关键信息。该方法首先将Bagging集成学习思想与LDA主题模型结合生成混合语言子主题集;然后利用跨语言词嵌入和K-means算法对混合子主题进行聚类分组;最后使用TF-IDF算法对主题词进行过滤排序。汉语-德语、汉语-法语主题发现实验表明,该方法在主题连贯性和多样性方面均表现优异,能够提取出语义更加相关且主题更加连贯多样的双语主题。
中图分类号:
[1]HARANDIZADEH B,PRINISKI H,MORSTA-TTER F.Keyword Assisted Embedded Topic Model[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining.2022:372-380. [2]WANG D,XU Y,LI M,et al.Knowledge-aware Bayesian deep topic model[C]//Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022).2022. [3]BREIMAN L.Bagging predictors[J].Machine Learning,1996,24(2):123-140. [4]BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(Mar):1137-1155. [5]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022. [6]MIMNO D,WALLACH H,NARADOWSKY J,et al.Polylin-gual topic models[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.2009:880-889. [7]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linking Based on Bilingual Topic Model and Bilingual Embedding[J].Computer Science,2019,46(1):238-244. [8]ZOSA E,GRANROTH-WILDING M.Multilingual dynamictopic model[C]//Proceedings of the International Conference on Recent Advances in Natural Language Processing.2019:1388-1396. [9]LEEK T,JIN H,SISTA S,et al.The BBN crosslingual topic detection and tracking system[C]//Working Notes of the Third Topic Detection and Tracking Workshop.2000:894-901. [10]CHEN X S,LUO L,WANG H Z,et al.Analysis and Research on Cross Language Topic Discovery in Chinese and English[J].Advanced En-gineering Sciences,2017,49(2):100-106. [11]JAGARLAMUDI J,DAUMÉ H.Extracting multilingual topics from unaligned comparable corpora[C]//European Conference on Information Retrieval.Springer,Berlin,Heidelberg,2010:444-456. [12]ZHANG D,MEI Q,ZHAI C X.Cross-lingual latent topic extraction[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.2010:1128-1137. [13]BOYD-GRABER J,BLEI D.Multilingual topic models for unaligned text[C]//Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence.2009:75-82. [14]LIU X,YAN X,XU G,et al.Khmer-Chinese bilingual LDAtopic model based on dictionary[J].International Journal of Computing Science and Mathematics,2019,10(6):557-565. [15]YANG W Y,YU Z T,GAO S X,et al.Chinese-Vietnamesenews topic discovery method based on crosslingual neural topic model[J].Journal of Computer Applications,2021,41(10):2879-2884. [16]CHANG C H,HWANG S Y,XUI T H.Incorporating word embedding into cross-lingual topic modeling[C]//2018 IEEE International Congress on Big Data(BigData Congress).IEEE,2018:17-24. [17]CHANG C H,HWANG S Y.A word embedding-based ap-proach to cross-lingual topic modeling[J].Knowledge and Information Systems,2021,63(6):1529-1555. [18]CHAN C H,ZENG J,WESSLER H,et al.Reproducible extraction of cross-lingual topics(rectr)[J].Communication Methods and Measures,2020,14(4):285-305. [19]BIANCHI F,TERRAGNI S,HOVY D,et al.Cross-lingual contextualized topic models with zero-shot learning[C]//Procee-dings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1676-1683. [20]SRIVASTAVA A,SUTTON C.Autoencoding variational infe-rence for topic models[C]//Proceedings of the 5th International Conference on Learning Representations(ICLR 2017).2017. [21]DAI H L,ZHONG G J,YOU Z M,et al.Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark[J].Computer Science,2021,48(9):118-124. [22]LIANG B T,NI Y F.Chinese Named Entity Recognition Based on Integrated Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2022,45(3):123-131. [23]FENG X,YANG Y T,DONG R,et al.Uygur-Chinese NeuralMachine Translation Method Based on Back-translation and Ensemble Learning[J].Journal of Lanzhou University of Technology,2022,48(5):99-106. [24]Big Data Search and Control Laboratory.NL-PIRICTCLAS Chinese Word Segmentation System[EB/OL].[2023-03-16].www.nlpir.org/. [25]HELMUT S.TreeTagger[EB/OL].[2023-02-09].https://www.cis.lmu.de/~schmid/tools/TreeTagger/. [26]JIAN Z W,YU J.German Text Clustering Based on Feature Word Pairing[J].Information Research,2022,299(9):86-93. [27]EFRON B,TIBSHIRANI R J.An introduction to the bootstrap[M].Boca Raton,Florida:CRC Press,1994. [28]YANG Y,CER D,AHMAD A,et al.Multilin-gual universalsentence encoder for semantic retrieval[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.2020:87-94. [29]GROOTENDORST M.BERTopic:Neural topic modeling with a class-based TF-IDF procedure[J].arXiv:2203.05794,2022. [30]SALTON G,BUCKLEY C.Term-Weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management,1988,24(5):513-23. [31]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:4512-4525. [32]WU X,LI C,ZHU Y,et al.Learning multil-ingual topics with neural variational inference[C]//Natural Language Processing and Chinese Computing:9th CCF International Conference.2020:840-851. [33]HAO S,PAUL M.Learning multilingual topics from incompara-ble corpora[C]//Proceedings of the 27th International Confe-rence on Computational Linguistics.2018:2595-2609. [34]HAO S,BOYD-GRABER J,PAUL M J.Lessons from the Bible on modern topics:Low-resource multilingual topic model evaluation[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1(Long Papers).2018:1090-1100. [35]BOUMA G.Normalized(pointwise) mutual information in collocation extraction[C]//Proceedings of the Biennial GSCL Conference.2009:31-40. [36]BISCHOF J,AIROLDI E M.Summarizing topical content withword frequency and exclusivity[C]//Proceedings of the 29th International Conference on Machine Learning(ICML-12).2012:201-208. |
|