计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100087-9.doi: 10.11896/jsjkx.241100087

• 人工智能 • 上一篇    下一篇

基于多特征词语嵌入的日语文本聚类方法研究

于娟1, 李维婷1, 曾心怡1, 赵慧云2   

  1. 1 福州大学经济与管理学院 福州 350108
    2 中国移动通信集团福建有限公司福州分公司 福州 350108
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 李维婷(weitingli1160@163.com)
  • 作者简介:yujuan@fzu.edu.cn
  • 基金资助:
    国家自然科学基金(72171090,71771054);福建省自然科学基金(2023J01393)

Japanese Text Clustering Based on Multi-attribute Word Embedding

YU Juan1, LI Weiting1, ZENG Xinyi1, ZHAO Huiyun2   

  1. 1 School of Economics and Management,Fuzhou University,Fuzhou 350108,China
    2 China Mobile Group Fujian Co.,Ltd.Fuzhou Branch,Fuzhou 350108,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(72171090,71771054) and Natural Science Foundation of Fujian Province(2023J01393).

摘要: 针对传统日语文本词语表征的信息丢失及高维稀疏向量处理困难的问题,研究了日语文本词语提取和聚类方法。首先,根据日语的语言特性及改进的原子词步长法提取词语,并结合其统计特征、位置、词长和语义特征计算多特征融合权重值(Multi-attribute Fusion Weight,MFW),筛选得到文本特征词语集合,保留文本信息的同时实现特征降维;然后,以BERT加权特征词语MFW进行文本表示,并融合到以K-means++算法改进后的深度嵌入模型框架中,实现日语文本的聚类。在两个题材不同的日语文本数据集上进行实验,结果表明,该方法相较于已有方法在NMI和Purity指标值上均提升了5%以上,展现了良好的聚类效果。

关键词: 日语文本挖掘, 词语提取, 文本表征, 文本聚类, 深度聚类

Abstract: To address the problems of information loss in traditional Japanese text representation and the difficulty in processing high-dimensional sparse vectors,we study Japanese text word extraction and clustering methods.Firstly,the words are extracted using the improved atomic-word-step method based on Japanese linguistic characteristics.The Multi-attribute Fusion Weight (MFW) of the words is calculated combining their statistical features,positions,word lengths and semantic features so as to obtain a set of text feature words for retaining text information while reducing feature dimensionality.Then,Japanese texts are represented as the BERT-weighted MFWs of feature words,which is fused into the deep embedding model framework improved by the K-means++ algorithm to realize the clustering of Japanese texts.Experimental results on two Japanese text datasets with different topics show that the approach proposed in this paper improves both the NMI and Purity index values by more than 5% compared with the existing methods,which demonstrates a good clustering performance.

Key words: Japanese text mining, Word extraction, Text representation, Text clustering, Deep clustering

中图分类号: 

  • TP181
[1]Wikipedia.Japanese language[EB/OL].[2024-01-31].https://en.wikipedia.org/wiki/Japanese_language.
[2]YU J,DANG Y Z.Word extraction method combining part-ofspeech analysis and string frequency statistics[J].Systems Engineering Theory and Practice,2010,30(1):105-111.
[3]JACOB D,MING-WEI C,KENTON L,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[4]XIE J,GIRSHICK R,FARHADI A.Unsupervised Deep Em-bedding for Clustering Analysis [C]//Proceedings of the 33rd International Conference on Machine Learning.New York:PMLR,2016:478-487.
[5]DING X Y,WANG L C.Research on Optimized CalculationMethod for Weight of Terms in BBS Text[J].Information stu-dies:Theory & Application,2021,44(5):187-192.
[6]MEHTA S,KARWA R,CHAVAN R.Keyphrase Extraction using Graph-based Statistical Approach with NLP Patterns [J].Sādhanā,2024(49).
[7]SUN Y,QIU H,ZHENG Y,et al.SIFRank:A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model [J].IEEE Access,2020(8):10896-10906.
[8]GROOTENDORST M,WARMERDAM V D.MaartenGr/KeyBERT:V0.5[EB/OL].[2024-01-31].https://github.com/MaartenGr/KeyBERT.
[9]SORNLERTLAMVANICH V,YUENYONG S.Thai NamedEntity Recognition using BiLSTM-CNN-CRF Enhanced by TCC [J].IEEE Access,2022(10):53043-53052.
[10]IZUTSU J,KOMIYA K,SHINNOU H.Word Segmentation of Hiragana Sentences Using Hiragana BERT[C]//Trends in Artificial Intelligence(PRICAI 2023).2023:323-335.
[11]YOSHINAGA N.Back to Patterns:Efficient Japanese Morphological Analysis with Feature-Sequence Trie[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2023:13-23.
[12]京都大学情報学研究科-日本電信電話株式会社コミュニケーション科学基礎研究所.MeCab:Yet Another Part-of-Speech and Morphological Analyzer [EB/OL].[2024-07-20].http://taku910.github.io/mecab/.
[13]Matthew Honnibal,Ines Montani.spaCy[EB/OL].[2024-07-20].https://github.com/explosion/spaCy.
[14]KAWANAMI S,HIDEMA K,OKADA K.Proposal of a Method Extracting Strategic Phrases from Japanese Enterprise Disclosure Documents[C]//Proceedings of the 9th International Congress on Advanced Applied Informatics.Piscataway,NJ:IEEE,2020:506-511.
[15]KIRIHARA T,MATSUMOTO K,YOSHIDA M,et al.Keyword extraction and classification from TV program viewers’ tweets[C]//Proceedings of the Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2020:360-369.
[16]TANAKAA R,NAKAYAMAB S.Extraction of Chemical Substance Names from Patent Publication[J].Journal of Computer Chemistry,2022(21):1-9.
[17]HADIFAR A,STERCKX L,DEMEESTER T,et al.A Self-Training Approach for Short Text Clustering [C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Stroudsburg,PA:ACL,2019:194-199.
[18]GUAN R,ZHANG H,LIANG Y,et al.Deep Feature-BasedText Clustering and its Explanation [J].IEEE Transaction on Knowledge and Data Engineering,2022,34(8):3669-3680.
[19]PUGACHEV L,BURTSEV M.Short Text Clustering withTransformers [EB/OL].(2021-01-31) [2024-01-31].https://arxiv.org/pdf/2102.00541.pdf.
[20]ZHANG D,NAN F,XIAOKAI W,et al.Supporting Clustering with Contrastive Learning [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg,PA:ACL,2021:5419-5430.
[21]AO Z,WENNING H,GANG C,et al.DEC-transformer:deep embedded clustering with transformer on Chinese long text [J].Pattern Analysis and Applications,2023(26):1349-1362.
[22]LIU E,IZUMI K,TSUBOUCHI K,et al.Cross-lingual news article comparison using Bi-graph Clustering and Siamese-LSTM[C]//Proceedings of the 31st Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2017:52-57.
[23]THANG D T,IWAI C,ONISHI K.A keyword clustering system based on search motivation for search marketing with BERT and HDBSCAN [C]//Proceedings of the 86th National Convention of IPSJ.Tokyo:IPSJ,2022:85-86.
[24]SUZUKI M,SEKIZAKI N,KURODA S,et al.An Analysis on the Customer Logistic Satisfaction based on Word Clustering [J].Innovation and Supply Chain Management,2023,17(1):11-16.
[25]国立国語研究所言語資源開発センター.「UniDic」国語研短単位自動解析用辞書[EB/OL].[2024-7-20].https://clrd.ninjal.ac.jp/unidic/.
[26]CHEN J Y.Research on Chinese Text Similarity DetectionTechnology Based on Word Weight Analysis[D].Zhengzhou:Zhengzhou University,2021.
[27]NHN Japan株式会社.livedoor ニュースコーパス[EB/OL].[2024-07-20].https://www.rondhuit.com/download.html#ldcc.
[28]The National Institute of Information and CommunicationsTechnology(NICT).Japanese Wiki Corpus Generated from the Japanese-English Bilingual Corpus of Wikipedia’s Kyoto Articles[EB/OL].[2024-07-20].https://www.japanesewiki.com/.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!