基于多特征词语嵌入的日语文本聚类方法研究

doi:10.11896/jsjkx.241100087

Abstract

Abstract: To address the problems of information loss in traditional Japanese text representation and the difficulty in processing high-dimensional sparse vectors,we study Japanese text word extraction and clustering methods.Firstly,the words are extracted using the improved atomic-word-step method based on Japanese linguistic characteristics.The Multi-attribute Fusion Weight (MFW) of the words is calculated combining their statistical features,positions,word lengths and semantic features so as to obtain a set of text feature words for retaining text information while reducing feature dimensionality.Then,Japanese texts are represented as the BERT-weighted MFWs of feature words,which is fused into the deep embedding model framework improved by the K-means++ algorithm to realize the clustering of Japanese texts.Experimental results on two Japanese text datasets with different topics show that the approach proposed in this paper improves both the NMI and Purity index values by more than 5% compared with the existing methods,which demonstrates a good clustering performance.

Key words: Japanese text mining, Word extraction, Text representation, Text clustering, Deep clustering

CLC Number:

TP181

YU Juan, LI Weiting, ZENG Xinyi, ZHAO Huiyun. Japanese Text Clustering Based on Multi-attribute Word Embedding[J].Computer Science, 2025, 52(11A): 241100087-9.

References

[1]Wikipedia.Japanese language[EB/OL].[2024-01-31].https://en.wikipedia.org/wiki/Japanese_language.
[2]YU J,DANG Y Z.Word extraction method combining part-ofspeech analysis and string frequency statistics[J].Systems Engineering Theory and Practice,2010,30(1):105-111.
[3]JACOB D,MING-WEI C,KENTON L,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[4]XIE J,GIRSHICK R,FARHADI A.Unsupervised Deep Em-bedding for Clustering Analysis [C]//Proceedings of the 33rd International Conference on Machine Learning.New York:PMLR,2016:478-487.
[5]DING X Y,WANG L C.Research on Optimized CalculationMethod for Weight of Terms in BBS Text[J].Information stu-dies:Theory & Application,2021,44(5):187-192.
[6]MEHTA S,KARWA R,CHAVAN R.Keyphrase Extraction using Graph-based Statistical Approach with NLP Patterns [J].Sādhanā,2024(49).
[7]SUN Y,QIU H,ZHENG Y,et al.SIFRank:A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model [J].IEEE Access,2020(8):10896-10906.
[8]GROOTENDORST M,WARMERDAM V D.MaartenGr/KeyBERT:V0.5[EB/OL].[2024-01-31].https://github.com/MaartenGr/KeyBERT.
[9]SORNLERTLAMVANICH V,YUENYONG S.Thai NamedEntity Recognition using BiLSTM-CNN-CRF Enhanced by TCC [J].IEEE Access,2022(10):53043-53052.
[10]IZUTSU J,KOMIYA K,SHINNOU H.Word Segmentation of Hiragana Sentences Using Hiragana BERT[C]//Trends in Artificial Intelligence(PRICAI 2023).2023:323-335.
[11]YOSHINAGA N.Back to Patterns:Efficient Japanese Morphological Analysis with Feature-Sequence Trie[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2023:13-23.
[12]京都大学情報学研究科－日本電信電話株式会社コミュニケーション科学基礎研究所.MeCab:Yet Another Part-of-Speech and Morphological Analyzer [EB/OL].[2024-07-20].http://taku910.github.io/mecab/.
[13]Matthew Honnibal,Ines Montani.spaCy[EB/OL].[2024-07-20].https://github.com/explosion/spaCy.
[14]KAWANAMI S,HIDEMA K,OKADA K.Proposal of a Method Extracting Strategic Phrases from Japanese Enterprise Disclosure Documents[C]//Proceedings of the 9th International Congress on Advanced Applied Informatics.Piscataway,NJ:IEEE,2020:506-511.
[15]KIRIHARA T,MATSUMOTO K,YOSHIDA M,et al.Keyword extraction and classification from TV program viewers’ tweets[C]//Proceedings of the Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2020:360-369.
[16]TANAKAA R,NAKAYAMAB S.Extraction of Chemical Substance Names from Patent Publication[J].Journal of Computer Chemistry,2022(21):1-9.
[17]HADIFAR A,STERCKX L,DEMEESTER T,et al.A Self-Training Approach for Short Text Clustering [C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Stroudsburg,PA:ACL,2019:194-199.
[18]GUAN R,ZHANG H,LIANG Y,et al.Deep Feature-BasedText Clustering and its Explanation [J].IEEE Transaction on Knowledge and Data Engineering,2022,34(8):3669-3680.
[19]PUGACHEV L,BURTSEV M.Short Text Clustering withTransformers [EB/OL].(2021-01-31) [2024-01-31].https://arxiv.org/pdf/2102.00541.pdf.
[20]ZHANG D,NAN F,XIAOKAI W,et al.Supporting Clustering with Contrastive Learning [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg,PA:ACL,2021:5419-5430.
[21]AO Z,WENNING H,GANG C,et al.DEC-transformer:deep embedded clustering with transformer on Chinese long text [J].Pattern Analysis and Applications,2023(26):1349-1362.
[22]LIU E,IZUMI K,TSUBOUCHI K,et al.Cross-lingual news article comparison using Bi-graph Clustering and Siamese-LSTM[C]//Proceedings of the 31st Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2017:52-57.
[23]THANG D T,IWAI C,ONISHI K.A keyword clustering system based on search motivation for search marketing with BERT and HDBSCAN [C]//Proceedings of the 86th National Convention of IPSJ.Tokyo:IPSJ,2022:85-86.
[24]SUZUKI M,SEKIZAKI N,KURODA S,et al.An Analysis on the Customer Logistic Satisfaction based on Word Clustering [J].Innovation and Supply Chain Management,2023,17(1):11-16.
[25]国立国語研究所言語資源開発センター.「UniDic」国語研短単位自動解析用辞書[EB/OL].[2024-7-20].https://clrd.ninjal.ac.jp/unidic/.
[26]CHEN J Y.Research on Chinese Text Similarity DetectionTechnology Based on Word Weight Analysis[D].Zhengzhou:Zhengzhou University,2021.
[27]NHN Japan株式会社.livedoor ニュースコーパス[EB/OL].[2024-07-20].https://www.rondhuit.com/download.html#ldcc.
[28]The National Institute of Information and CommunicationsTechnology(NICT).Japanese Wiki Corpus Generated from the Japanese-English Bilingual Corpus of Wikipedia’s Kyoto Articles[EB/OL].[2024-07-20].https://www.japanesewiki.com/.

Related Articles 15

[1]	ZHANG Shiju, GUO Chaoyang, WU Chengliang, WU Lingjun, YANG Fengyu. Text Clustering Approach Based on Key Semantic Driven and Contrastive Learning [J]. Computer Science, 2025, 52(8): 171-179.
[2]	WANG Baohui, XU Boren, LI Chang’ao, YE Zihao. Study on Algorithm for Keyword Extraction from WeChat Conversation Text [J]. Computer Science, 2025, 52(6A): 240700105-8.
[3]	AN Rui, LU Jin, YANG Jingjing. Deep Clustering Method Based on Dual-branch Wavelet Convolutional Autoencoder and DataAugmentation [J]. Computer Science, 2025, 52(4): 129-137.
[4]	QIN Xianping, DING Zhaoxu, ZHONG Guoqiang, WANG Dong. Deep Learning-based Method for Mining Ocean Hot Spot News [J]. Computer Science, 2024, 51(11A): 231200005-10.
[5]	LIANG Mingxuan, WANG Shi, ZHU Junwu, LI Yang, GAO Xiang, JIAO Zhixiang. Survey of Knowledge-enhanced Natural Language Generation Research [J]. Computer Science, 2023, 50(6A): 220200120-8.
[6]	CHEN Jie. Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features [J]. Computer Science, 2023, 50(6A): 220800192-6.
[7]	CAI Shaotian, CHEN Xiaojun, CHEN Longteng, QIU Liping. Stratified Pseudo-label Based Image Clustering [J]. Computer Science, 2023, 50(6): 225-235.
[8]	KONG Fengling, WU Hao, DONG Qingqing. Self-optimized Single Cell Clustering Using ZINB Model and Graph Attention Autoencoder [J]. Computer Science, 2023, 50(12): 104-112.
[9]	HE Wenhao, WU Chunjiang, ZHOU Shijie, HE Chaoxin. Study on Short Text Clustering with Unsupervised SimCSE [J]. Computer Science, 2023, 50(11): 71-76.
[10]	ZHENG Cheng, MEI Liang, ZHAO Yiyan, ZHANG Suhang. Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks [J]. Computer Science, 2023, 50(1): 221-228.
[11]	KANG Yan, KOU Yong-qi, XIE Si-yu, WANG Fei, ZHANG Lan, WU Zhi-wei, LI Hao. Deep Clustering Model Based on Fusion Variational Graph Attention Self-encoder [J]. Computer Science, 2021, 48(11A): 81-87.
[12]	ZHANG Hao-yang and ZHOU Liang. Application of Improved GHSOM Algorithm in Civil Aviation Regulation Knowledge Map Construction [J]. Computer Science, 2020, 47(6A): 429-435.
[13]	ZHANG Xiao-hui, YU Shuang-yuan, WANG Quan-xin and XU Bao-min. Text Representation and Classification Algorithm Based on Adversarial Training [J]. Computer Science, 2020, 47(6A): 12-16.
[14]	KANG Yan,CUI Guo-rong,LI Hao,YANG Qi-yue,LI Jin-yuan,WANG Pei-yao. Software Requirements Clustering Algorithm Based on Self-attention Mechanism and Multi- channel Pyramid Convolution [J]. Computer Science, 2020, 47(3): 48-53.
[15]	LI Ke,CHEN Guang-ping. Mining Deep Semantic Features of Reviews for Amazon Commodity Recommendation [J]. Computer Science, 2020, 47(2): 65-71.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Japanese Text Clustering Based on Multi-attribute Word Embedding

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0