计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300201-8.doi: 10.11896/jsjkx.230300201

• 人工智能 • 上一篇    下一篇

基于集成学习的跨语言文本主题发现方法研究

李帅, 于娟, 巫邵诚   

  1. 福州大学经济与管理学院 福州 350108
  • 发布日期:2024-06-06
  • 通讯作者: 巫邵诚(2558861318@qq.com)
  • 作者简介:(lish1223@163.com)
  • 基金资助:
    国家自然科学基金(71771054,72171090)

Cross-lingual Text Topic Discovery Based on Ensemble Learning

LI Shuai, YU Juan, WU Shaocheng   

  1. School of Economics and Management,Fuzhou University,Fuzhou 350108,China
  • Published:2024-06-06
  • About author:LI Shuai,born in 1997,postgraduate.His main research interests include data mining and decision support system.
    WU Shaocheng,born in 1997,doctoral candidate.His main research interests include text mining and data science.
  • Supported by:
    National Natural Science Foundation of China(71771054,72171090).

摘要: 跨语言文本主题发现是跨语言文本挖掘领域的重要研究方向,对跨语言文本分析和组织各种文本数据具有较高的应用价值。基于Bagging和跨语言词嵌入改进LDA主题模型,提出跨语言文本主题发现方法BCL-LDA(Bagging,Cross-lingual word embedding with LDA),从多语言文本中挖掘关键信息。该方法首先将Bagging集成学习思想与LDA主题模型结合生成混合语言子主题集;然后利用跨语言词嵌入和K-means算法对混合子主题进行聚类分组;最后使用TF-IDF算法对主题词进行过滤排序。汉语-德语、汉语-法语主题发现实验表明,该方法在主题连贯性和多样性方面均表现优异,能够提取出语义更加相关且主题更加连贯多样的双语主题。

关键词: 主题发现, 跨语言, LDA, 主题聚类, 德语, 法语

Abstract: Cross-lingual text topic discovery is an important research direction in the field of cross-lingual text mining,and it has high application value for cross-lingual text analysis and organization of various text data.Based on Bagging and cross-lingual word embedding to improve the LDA topic model,a cross-lingual text topic discovery method BCL-LDA(Bagging,cross-lingual word embedding with LDA) is proposed to mine key information from multilingual text.This method first combines the Bagging integrated learning idea with the LDA topic model to generate a mixed language subtopic set.Then it uses cross-lingual word embedding and K-means algorithm to cluster and group the mixed subtopics.Finally,the TF-IDF algorithm is used to filter and sort the subject words.The Chinese-German and Chinese-French topic discovery experiments show that this method performs well in terms of topic coherence and diversity,and can extract bilingual topics with more relevant semantics and more coherent and diverse topics.

Key words: Topic discovery, Cross-lingual, LDA, Topic clustering, German, French

中图分类号: 

  • TP391.1
[1]HARANDIZADEH B,PRINISKI H,MORSTA-TTER F.Keyword Assisted Embedded Topic Model[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining.2022:372-380.
[2]WANG D,XU Y,LI M,et al.Knowledge-aware Bayesian deep topic model[C]//Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022).2022.
[3]BREIMAN L.Bagging predictors[J].Machine Learning,1996,24(2):123-140.
[4]BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(Mar):1137-1155.
[5]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[6]MIMNO D,WALLACH H,NARADOWSKY J,et al.Polylin-gual topic models[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.2009:880-889.
[7]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linking Based on Bilingual Topic Model and Bilingual Embedding[J].Computer Science,2019,46(1):238-244.
[8]ZOSA E,GRANROTH-WILDING M.Multilingual dynamictopic model[C]//Proceedings of the International Conference on Recent Advances in Natural Language Processing.2019:1388-1396.
[9]LEEK T,JIN H,SISTA S,et al.The BBN crosslingual topic detection and tracking system[C]//Working Notes of the Third Topic Detection and Tracking Workshop.2000:894-901.
[10]CHEN X S,LUO L,WANG H Z,et al.Analysis and Research on Cross Language Topic Discovery in Chinese and English[J].Advanced En-gineering Sciences,2017,49(2):100-106.
[11]JAGARLAMUDI J,DAUMÉ H.Extracting multilingual topics from unaligned comparable corpora[C]//European Conference on Information Retrieval.Springer,Berlin,Heidelberg,2010:444-456.
[12]ZHANG D,MEI Q,ZHAI C X.Cross-lingual latent topic extraction[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.2010:1128-1137.
[13]BOYD-GRABER J,BLEI D.Multilingual topic models for unaligned text[C]//Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence.2009:75-82.
[14]LIU X,YAN X,XU G,et al.Khmer-Chinese bilingual LDAtopic model based on dictionary[J].International Journal of Computing Science and Mathematics,2019,10(6):557-565.
[15]YANG W Y,YU Z T,GAO S X,et al.Chinese-Vietnamesenews topic discovery method based on crosslingual neural topic model[J].Journal of Computer Applications,2021,41(10):2879-2884.
[16]CHANG C H,HWANG S Y,XUI T H.Incorporating word embedding into cross-lingual topic modeling[C]//2018 IEEE International Congress on Big Data(BigData Congress).IEEE,2018:17-24.
[17]CHANG C H,HWANG S Y.A word embedding-based ap-proach to cross-lingual topic modeling[J].Knowledge and Information Systems,2021,63(6):1529-1555.
[18]CHAN C H,ZENG J,WESSLER H,et al.Reproducible extraction of cross-lingual topics(rectr)[J].Communication Methods and Measures,2020,14(4):285-305.
[19]BIANCHI F,TERRAGNI S,HOVY D,et al.Cross-lingual contextualized topic models with zero-shot learning[C]//Procee-dings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1676-1683.
[20]SRIVASTAVA A,SUTTON C.Autoencoding variational infe-rence for topic models[C]//Proceedings of the 5th International Conference on Learning Representations(ICLR 2017).2017.
[21]DAI H L,ZHONG G J,YOU Z M,et al.Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark[J].Computer Science,2021,48(9):118-124.
[22]LIANG B T,NI Y F.Chinese Named Entity Recognition Based on Integrated Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2022,45(3):123-131.
[23]FENG X,YANG Y T,DONG R,et al.Uygur-Chinese NeuralMachine Translation Method Based on Back-translation and Ensemble Learning[J].Journal of Lanzhou University of Technology,2022,48(5):99-106.
[24]Big Data Search and Control Laboratory.NL-PIRICTCLAS Chinese Word Segmentation System[EB/OL].[2023-03-16].www.nlpir.org/.
[25]HELMUT S.TreeTagger[EB/OL].[2023-02-09].https://www.cis.lmu.de/~schmid/tools/TreeTagger/.
[26]JIAN Z W,YU J.German Text Clustering Based on Feature Word Pairing[J].Information Research,2022,299(9):86-93.
[27]EFRON B,TIBSHIRANI R J.An introduction to the bootstrap[M].Boca Raton,Florida:CRC Press,1994.
[28]YANG Y,CER D,AHMAD A,et al.Multilin-gual universalsentence encoder for semantic retrieval[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.2020:87-94.
[29]GROOTENDORST M.BERTopic:Neural topic modeling with a class-based TF-IDF procedure[J].arXiv:2203.05794,2022.
[30]SALTON G,BUCKLEY C.Term-Weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management,1988,24(5):513-23.
[31]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:4512-4525.
[32]WU X,LI C,ZHU Y,et al.Learning multil-ingual topics with neural variational inference[C]//Natural Language Processing and Chinese Computing:9th CCF International Conference.2020:840-851.
[33]HAO S,PAUL M.Learning multilingual topics from incompara-ble corpora[C]//Proceedings of the 27th International Confe-rence on Computational Linguistics.2018:2595-2609.
[34]HAO S,BOYD-GRABER J,PAUL M J.Lessons from the Bible on modern topics:Low-resource multilingual topic model evaluation[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1(Long Papers).2018:1090-1100.
[35]BOUMA G.Normalized(pointwise) mutual information in collocation extraction[C]//Proceedings of the Biennial GSCL Conference.2009:31-40.
[36]BISCHOF J,AIROLDI E M.Summarizing topical content withword frequency and exclusivity[C]//Proceedings of the 29th International Conference on Machine Learning(ICML-12).2012:201-208.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!