Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230300201-8.doi: 10.11896/jsjkx.230300201

• Artificial Intelligenc • Previous Articles     Next Articles

Cross-lingual Text Topic Discovery Based on Ensemble Learning

LI Shuai, YU Juan, WU Shaocheng   

  1. School of Economics and Management,Fuzhou University,Fuzhou 350108,China
  • Published:2024-06-06
  • About author:LI Shuai,born in 1997,postgraduate.His main research interests include data mining and decision support system.
    WU Shaocheng,born in 1997,doctoral candidate.His main research interests include text mining and data science.
  • Supported by:
    National Natural Science Foundation of China(71771054,72171090).

Abstract: Cross-lingual text topic discovery is an important research direction in the field of cross-lingual text mining,and it has high application value for cross-lingual text analysis and organization of various text data.Based on Bagging and cross-lingual word embedding to improve the LDA topic model,a cross-lingual text topic discovery method BCL-LDA(Bagging,cross-lingual word embedding with LDA) is proposed to mine key information from multilingual text.This method first combines the Bagging integrated learning idea with the LDA topic model to generate a mixed language subtopic set.Then it uses cross-lingual word embedding and K-means algorithm to cluster and group the mixed subtopics.Finally,the TF-IDF algorithm is used to filter and sort the subject words.The Chinese-German and Chinese-French topic discovery experiments show that this method performs well in terms of topic coherence and diversity,and can extract bilingual topics with more relevant semantics and more coherent and diverse topics.

Key words: Topic discovery, Cross-lingual, LDA, Topic clustering, German, French

CLC Number: 

  • TP391.1
[1]HARANDIZADEH B,PRINISKI H,MORSTA-TTER F.Keyword Assisted Embedded Topic Model[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining.2022:372-380.
[2]WANG D,XU Y,LI M,et al.Knowledge-aware Bayesian deep topic model[C]//Proceedings of the 36th Conference on Neural Information Processing Systems(NeurIPS 2022).2022.
[3]BREIMAN L.Bagging predictors[J].Machine Learning,1996,24(2):123-140.
[4]BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(Mar):1137-1155.
[5]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[6]MIMNO D,WALLACH H,NARADOWSKY J,et al.Polylin-gual topic models[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing.2009:880-889.
[7]YU Y Y,CHAO W H,HE Y Y,et al.Cross-language Know-ledge Linking Based on Bilingual Topic Model and Bilingual Embedding[J].Computer Science,2019,46(1):238-244.
[8]ZOSA E,GRANROTH-WILDING M.Multilingual dynamictopic model[C]//Proceedings of the International Conference on Recent Advances in Natural Language Processing.2019:1388-1396.
[9]LEEK T,JIN H,SISTA S,et al.The BBN crosslingual topic detection and tracking system[C]//Working Notes of the Third Topic Detection and Tracking Workshop.2000:894-901.
[10]CHEN X S,LUO L,WANG H Z,et al.Analysis and Research on Cross Language Topic Discovery in Chinese and English[J].Advanced En-gineering Sciences,2017,49(2):100-106.
[11]JAGARLAMUDI J,DAUMÉ H.Extracting multilingual topics from unaligned comparable corpora[C]//European Conference on Information Retrieval.Springer,Berlin,Heidelberg,2010:444-456.
[12]ZHANG D,MEI Q,ZHAI C X.Cross-lingual latent topic extraction[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.2010:1128-1137.
[13]BOYD-GRABER J,BLEI D.Multilingual topic models for unaligned text[C]//Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence.2009:75-82.
[14]LIU X,YAN X,XU G,et al.Khmer-Chinese bilingual LDAtopic model based on dictionary[J].International Journal of Computing Science and Mathematics,2019,10(6):557-565.
[15]YANG W Y,YU Z T,GAO S X,et al.Chinese-Vietnamesenews topic discovery method based on crosslingual neural topic model[J].Journal of Computer Applications,2021,41(10):2879-2884.
[16]CHANG C H,HWANG S Y,XUI T H.Incorporating word embedding into cross-lingual topic modeling[C]//2018 IEEE International Congress on Big Data(BigData Congress).IEEE,2018:17-24.
[17]CHANG C H,HWANG S Y.A word embedding-based ap-proach to cross-lingual topic modeling[J].Knowledge and Information Systems,2021,63(6):1529-1555.
[18]CHAN C H,ZENG J,WESSLER H,et al.Reproducible extraction of cross-lingual topics(rectr)[J].Communication Methods and Measures,2020,14(4):285-305.
[19]BIANCHI F,TERRAGNI S,HOVY D,et al.Cross-lingual contextualized topic models with zero-shot learning[C]//Procee-dings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume.2021:1676-1683.
[20]SRIVASTAVA A,SUTTON C.Autoencoding variational infe-rence for topic models[C]//Proceedings of the 5th International Conference on Learning Representations(ICLR 2017).2017.
[21]DAI H L,ZHONG G J,YOU Z M,et al.Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark[J].Computer Science,2021,48(9):118-124.
[22]LIANG B T,NI Y F.Chinese Named Entity Recognition Based on Integrated Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2022,45(3):123-131.
[23]FENG X,YANG Y T,DONG R,et al.Uygur-Chinese NeuralMachine Translation Method Based on Back-translation and Ensemble Learning[J].Journal of Lanzhou University of Technology,2022,48(5):99-106.
[24]Big Data Search and Control Laboratory.NL-PIRICTCLAS Chinese Word Segmentation System[EB/OL].[2023-03-16].www.nlpir.org/.
[25]HELMUT S.TreeTagger[EB/OL].[2023-02-09].https://www.cis.lmu.de/~schmid/tools/TreeTagger/.
[26]JIAN Z W,YU J.German Text Clustering Based on Feature Word Pairing[J].Information Research,2022,299(9):86-93.
[27]EFRON B,TIBSHIRANI R J.An introduction to the bootstrap[M].Boca Raton,Florida:CRC Press,1994.
[28]YANG Y,CER D,AHMAD A,et al.Multilin-gual universalsentence encoder for semantic retrieval[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics:System Demonstrations.2020:87-94.
[29]GROOTENDORST M.BERTopic:Neural topic modeling with a class-based TF-IDF procedure[J].arXiv:2203.05794,2022.
[30]SALTON G,BUCKLEY C.Term-Weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management,1988,24(5):513-23.
[31]REIMERS N,GUREVYCH I.Making Monolingual SentenceEmbeddings Multilingual using Knowledge Distillation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.2020:4512-4525.
[32]WU X,LI C,ZHU Y,et al.Learning multil-ingual topics with neural variational inference[C]//Natural Language Processing and Chinese Computing:9th CCF International Conference.2020:840-851.
[33]HAO S,PAUL M.Learning multilingual topics from incompara-ble corpora[C]//Proceedings of the 27th International Confe-rence on Computational Linguistics.2018:2595-2609.
[34]HAO S,BOYD-GRABER J,PAUL M J.Lessons from the Bible on modern topics:Low-resource multilingual topic model evaluation[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1(Long Papers).2018:1090-1100.
[35]BOUMA G.Normalized(pointwise) mutual information in collocation extraction[C]//Proceedings of the Biennial GSCL Conference.2009:31-40.
[36]BISCHOF J,AIROLDI E M.Summarizing topical content withword frequency and exclusivity[C]//Proceedings of the 29th International Conference on Machine Learning(ICML-12).2012:201-208.
[1] JIANG Haoda, ZHAO Chunlei, CHEN Han, WANG Chundong. Construction Method of Domain Sentiment Lexicon Based on Improved TF-IDF and BERT [J]. Computer Science, 2024, 51(6A): 230800011-9.
[2] CHEN Jie. Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features [J]. Computer Science, 2023, 50(6A): 220800192-6.
[3] SUN Xuekui, DAI Hua, ZHOU Jianguo, YANG Geng, CHEN Yanli. LTTFAD:Log Template Topic Feature-based Anomaly Detection [J]. Computer Science, 2023, 50(6): 313-321.
[4] YU Ben-gong, ZHANG Zi-wei, WANG Hui-ling. TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information [J]. Computer Science, 2022, 49(6A): 165-171.
[5] WEN Cheng-yu, FANG Wei-dong, CHEN Wei. Object Initialization in Multiple Object Tracking:A Review [J]. Computer Science, 2022, 49(3): 152-162.
[6] YU Juan, ZHANG Chen. Cross-lingual Term Alignment with Kernel-XGBoost [J]. Computer Science, 2022, 49(11A): 211000111-6.
[7] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[8] LIU Yun-han, SHA Chao-feng, NIU Jun-yu. Analysis of Topics on Database Systems in Stack Overflow [J]. Computer Science, 2021, 48(6): 48-56.
[9] WANG Sheng, ZHANG Yang-sen, ZHANG Wen, JIANG Yu-ru, ZHANG Rui. Domain Label Acquisition Method Based on SL-LDA Model [J]. Computer Science, 2020, 47(11): 95-100.
[10] WANG Han, XIA Hong-bin. Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model [J]. Computer Science, 2019, 46(9): 216-222.
[11] ZHANG Lei,CAI Ming. Image Annotation Based on Topic Fusion and Frequent Patterns Mining [J]. Computer Science, 2019, 46(7): 246-251.
[12] ZHANG Xiao-chuan, YU Lin-feng, ZHANG Yi-hao. Multi-feature Fusion for Short Text Similarity Calculation Based on LDA [J]. Computer Science, 2018, 45(9): 266-270.
[13] QIU Xian-biao, CHEN Xiao-rong. Text Similarity Calculation Algorithm Based on SA_LDA Model [J]. Computer Science, 2018, 45(6A): 106-109.
[14] ZHANG Jing and ZHU Guo-bin. Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model [J]. Computer Science, 2018, 45(4): 208-214.
[15] XIAN Xue-feng, CUI Zhi-ming, ZHAO Peng-peng, LIU Zhao-bin and GU Cai-dong. Location-awareness Publication Subscription System Based on Topic Model [J]. Computer Science, 2018, 45(3): 165-170.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!