计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100087-9.doi: 10.11896/jsjkx.241100087
于娟1, 李维婷1, 曾心怡1, 赵慧云2
YU Juan1, LI Weiting1, ZENG Xinyi1, ZHAO Huiyun2
摘要: 针对传统日语文本词语表征的信息丢失及高维稀疏向量处理困难的问题,研究了日语文本词语提取和聚类方法。首先,根据日语的语言特性及改进的原子词步长法提取词语,并结合其统计特征、位置、词长和语义特征计算多特征融合权重值(Multi-attribute Fusion Weight,MFW),筛选得到文本特征词语集合,保留文本信息的同时实现特征降维;然后,以BERT加权特征词语MFW进行文本表示,并融合到以K-means++算法改进后的深度嵌入模型框架中,实现日语文本的聚类。在两个题材不同的日语文本数据集上进行实验,结果表明,该方法相较于已有方法在NMI和Purity指标值上均提升了5%以上,展现了良好的聚类效果。
中图分类号:
| [1]Wikipedia.Japanese language[EB/OL].[2024-01-31].https://en.wikipedia.org/wiki/Japanese_language. [2]YU J,DANG Y Z.Word extraction method combining part-ofspeech analysis and string frequency statistics[J].Systems Engineering Theory and Practice,2010,30(1):105-111. [3]JACOB D,MING-WEI C,KENTON L,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186. [4]XIE J,GIRSHICK R,FARHADI A.Unsupervised Deep Em-bedding for Clustering Analysis [C]//Proceedings of the 33rd International Conference on Machine Learning.New York:PMLR,2016:478-487. [5]DING X Y,WANG L C.Research on Optimized CalculationMethod for Weight of Terms in BBS Text[J].Information stu-dies:Theory & Application,2021,44(5):187-192. [6]MEHTA S,KARWA R,CHAVAN R.Keyphrase Extraction using Graph-based Statistical Approach with NLP Patterns [J].Sādhanā,2024(49). [7]SUN Y,QIU H,ZHENG Y,et al.SIFRank:A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model [J].IEEE Access,2020(8):10896-10906. [8]GROOTENDORST M,WARMERDAM V D.MaartenGr/KeyBERT:V0.5[EB/OL].[2024-01-31].https://github.com/MaartenGr/KeyBERT. [9]SORNLERTLAMVANICH V,YUENYONG S.Thai NamedEntity Recognition using BiLSTM-CNN-CRF Enhanced by TCC [J].IEEE Access,2022(10):53043-53052. [10]IZUTSU J,KOMIYA K,SHINNOU H.Word Segmentation of Hiragana Sentences Using Hiragana BERT[C]//Trends in Artificial Intelligence(PRICAI 2023).2023:323-335. [11]YOSHINAGA N.Back to Patterns:Efficient Japanese Morphological Analysis with Feature-Sequence Trie[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2023:13-23. [12]京都大学情報学研究科-日本電信電話株式会社コミュニケーション科学基礎研究所.MeCab:Yet Another Part-of-Speech and Morphological Analyzer [EB/OL].[2024-07-20].http://taku910.github.io/mecab/. [13]Matthew Honnibal,Ines Montani.spaCy[EB/OL].[2024-07-20].https://github.com/explosion/spaCy. [14]KAWANAMI S,HIDEMA K,OKADA K.Proposal of a Method Extracting Strategic Phrases from Japanese Enterprise Disclosure Documents[C]//Proceedings of the 9th International Congress on Advanced Applied Informatics.Piscataway,NJ:IEEE,2020:506-511. [15]KIRIHARA T,MATSUMOTO K,YOSHIDA M,et al.Keyword extraction and classification from TV program viewers’ tweets[C]//Proceedings of the Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2020:360-369. [16]TANAKAA R,NAKAYAMAB S.Extraction of Chemical Substance Names from Patent Publication[J].Journal of Computer Chemistry,2022(21):1-9. [17]HADIFAR A,STERCKX L,DEMEESTER T,et al.A Self-Training Approach for Short Text Clustering [C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Stroudsburg,PA:ACL,2019:194-199. [18]GUAN R,ZHANG H,LIANG Y,et al.Deep Feature-BasedText Clustering and its Explanation [J].IEEE Transaction on Knowledge and Data Engineering,2022,34(8):3669-3680. [19]PUGACHEV L,BURTSEV M.Short Text Clustering withTransformers [EB/OL].(2021-01-31) [2024-01-31].https://arxiv.org/pdf/2102.00541.pdf. [20]ZHANG D,NAN F,XIAOKAI W,et al.Supporting Clustering with Contrastive Learning [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg,PA:ACL,2021:5419-5430. [21]AO Z,WENNING H,GANG C,et al.DEC-transformer:deep embedded clustering with transformer on Chinese long text [J].Pattern Analysis and Applications,2023(26):1349-1362. [22]LIU E,IZUMI K,TSUBOUCHI K,et al.Cross-lingual news article comparison using Bi-graph Clustering and Siamese-LSTM[C]//Proceedings of the 31st Annual Conference of the Japanese Society for Artificial Intelligence.Tokyo:JSAI,2017:52-57. [23]THANG D T,IWAI C,ONISHI K.A keyword clustering system based on search motivation for search marketing with BERT and HDBSCAN [C]//Proceedings of the 86th National Convention of IPSJ.Tokyo:IPSJ,2022:85-86. [24]SUZUKI M,SEKIZAKI N,KURODA S,et al.An Analysis on the Customer Logistic Satisfaction based on Word Clustering [J].Innovation and Supply Chain Management,2023,17(1):11-16. [25]国立国語研究所言語資源開発センター.「UniDic」国語研短単位自動解析用辞書[EB/OL].[2024-7-20].https://clrd.ninjal.ac.jp/unidic/. [26]CHEN J Y.Research on Chinese Text Similarity DetectionTechnology Based on Word Weight Analysis[D].Zhengzhou:Zhengzhou University,2021. [27]NHN Japan株式会社.livedoor ニュースコーパス[EB/OL].[2024-07-20].https://www.rondhuit.com/download.html#ldcc. [28]The National Institute of Information and CommunicationsTechnology(NICT).Japanese Wiki Corpus Generated from the Japanese-English Bilingual Corpus of Wikipedia’s Kyoto Articles[EB/OL].[2024-07-20].https://www.japanesewiki.com/. |
|
||