基于Doc2Vec增强特征的长文本主题聚类研究

doi:10.11896/jsjkx.220800192

计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220800192-6.doi: 10.11896/jsjkx.220800192

基于Doc2Vec增强特征的长文本主题聚类研究

陈洁

中华女子学院数据科学与信息技术学院北京 100101

出版日期:2023-06-10 发布日期:2023-06-12
通讯作者: 陈洁(chenjie@cwu.edu.cn)
基金资助:
中华女子学院科研基金(ZKY200020228)

Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features

CHEN Jie

School of Data Science and Information Technology,China Women’s University,Beijing 100101,China

Online:2023-06-10 Published:2023-06-12
About author:CHEN Jie,born in 1969,postgraduate,associate professor.Her main research interests include information processing and text mining.
Supported by:
Research Fund of China Women’s College(ZKY200020228).

摘要/Abstract

摘要： 针对新闻长文本语义表征的难点,基于Doc2Vec文档嵌入和词向量加权方式构建增强的特征表示。利用DV-sim方法和DV-tfidf方法从文档首尾部分特定词性的内容中提取增强特征,再分别与Doc2Vec文档向量组合,形成新的全局表征。DV-sim从语义角度,采用特征词与Doc2Vec向量的相似度获得词权重;DV-tfidf从词频统计角度,采用词频-逆文档频率方式获得词权重,然后利用HDBSCAN算法在THUCNews 和Sogou数据集上进行主题聚类。相比直接应用Doc2Vec向量,DV-sim在两个数据集上的噪声数分别减少60.82% 和60.63%,准确率提高12.14%和20.58%,F1-Score值提高15.61%和11.58%;DV-tfifd在两个数据集上的噪声数分别减少15.20%和59.55%,准确率提高10.85%和17.93%,F1-Score值提高15.60%和9.21%。实验结果表明,DV-sim和DV-tfidf都可以提高主题聚类性能,且基于语义的增强特征比基于词频的效果更好,DV-sim在优秀女性人物报道的主题聚类上也得到了有效应用。

关键词: 主题聚类, 文本表征, Doc2Vec, 词向量, HDBSCAN

Abstract: Aimed at the difficulties of semantic representation of long news text,an enhanced document feature representation is constructed based on Doc2Vec embedding and word vector weighting.Enhanced features from the specific parts-of-speech contents on the head and tail of the document are extracted by the method of DV-sim or DV-tfidf.These features are then combined with doc2vec to form a new global representation.DV-sim uses the similarity between feature words and doc2vec vectors to obtain word weight from the semantic point of view,and DV-tfidf uses term frequency inverse document frequency to obtain word weight from the word frequency statistics point of view.Then the HDBSCAN algorithm is applied to cluster topics on the Thucnews and Sogou datasets.Compared with the Doc2Vec vector,the noise number on the two datasets reduces by 60.82% and 60.63%,the accuracy improves by 12.14% and 20.58%,and the F1-score increases by 15.61% and 11.58%,respectively,with DV-sim.The noise number on the two datasets reduces by 15.20% and 59.55%,the accuracy improves by 10.85% and 17.93%,and the F1-score increases by 15.60% and 9.21%,respectively,with DV-tfidf.Experiments show that both DV-sim and DV-tfidf can improve the performance of topic clustering,and the enhancement feature based on semantics is better than that based on word frequency.DV-sim has also been effectively applied in topic clustering of excellent female character reports.

Key words: Topic clustering, Text representation, Doc2Vec, Word embedding, HDBSCAN

中图分类号:

TP391

陈洁. 基于Doc2Vec增强特征的长文本主题聚类研究[J]. 计算机科学, 2023, 50(6A): 220800192-6. https://doi.org/10.11896/jsjkx.220800192

CHEN Jie. Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features[J]. Computer Science, 2023, 50(6A): 220800192-6. https://doi.org/10.11896/jsjkx.220800192

参考文献

[1]ZHAO J S,SONG M X,GAO X,et al.Research on Text Representation in Natural Language Processing[J].Journal of Software,2022,33(1):102-128.
[2]XIONG H X,YANG M T,LI Y Y.A Survey of Information Organization and Retrieval Based on Deep Learning[J].Information Science,2020,38(3):3-10.
[3]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].https://arxiv.org/pdf/1810.04805.pdf.
[4]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents[C]//Proceedings of the 31st International Conference on Machine Learning.PMLR,2014,32(2):1188-1196.
[5]LEE S,JIN X,KIM W.Sentiment classification for unlabeled dataset using Doc2Vec with JST[C]//Proceedings of the 18th Annual International Conference on Electronic Commerce:e-Commerce in Smart Connected World.ACM New York,NY,USA,2016.
[6]MANDAL A,GHOSH K,GHOSH S,et al.Unsupervised approaches for measuring textual similarity between legal court case reports.Artificial Intelligence and Law[J].Artificial Intelligence and Law,2021,29(3):417-451.
[7]ADORNO H G,DURAN J,SIDOROV G,et al.Document embeddings learned on various types of n-grams for cross-topic authorship attribution[J].Computing.2018,100(7):741-756.
[8]JIA X T,WANG M Y,CAO Y.Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J].Data Analysis and Knowledge Discovery,2018,2(2):86-95.
[9]RUAN G C,XIA L.Hot Topic Detection in Journal PapersBased on Doc2Vec[J].Information Studies:Theory & Application,2019,42(4):107-111,106.
[10]ARIF B,REZA R,NOVYANTARA P H,et al.Unsupervised News Topic Modelling with Doc2Vec and Spherical Clustering[J].Procedia Computer Science,2021,179:40-46.
[11]CHANG W B,XU Z Z,ZHOU S H,et al.Research on detection methods based on Doc2vec abnormal comments[J].Future Ge-neration Computer Systems,2018,86:656-662.
[12]AMIRI M Z,SHOBI A.A Link Prediction Strategy for Persona-lized Tweet Recommendation through Doc2Vec Approach[C]//Proceedings of 17th International Conference on IT Applications and Management,Babolsar,Iran.Korean Database Society(KDBS),2017:72-82.
[13]CHEN X,ZHU X D,GAO G K,et al.Sentiment Analysis of Chinese Comments Based on Hybrid Vector Model[J].Compu-ter Engineering,2020,46(1):309-314.
[14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013.
[15]TANG M,ZHU L,ZOU X C.Document Vector Representation Based on Word2Vec[J].Computer Science,2016,43(6):214-217,269.
[16]CAMPELLO R,MOULAVI D,SANDER J.Density Based Clustering Based on Hierarchical Density Estimates[C]//Advances in Knowledge Discovery and Data Mining(PAKDD 2013).Gold Coast,Australia.Springer,2013:160-172.
[17]MELVIN R L,XIAO J J,GODWIN R,et al.Visualizing correlated motion with HDBSCAN clustering[J].Protein Science,2018,27(1):62-75.
[18]TAHVILI S,HATVANI L,FELDERER M,et al.AutomatedFunctional Dependency Detection Between Test Cases Using Doc2Vec and Clustering[C]//Proceedings of 2019 IEEE International Conference on Artificial Intelligence Testing(AITest).Newark,CA,USA.IEEE,2019.
[19]ESTER M,KRIEGEL H P,SANDER J,et al.A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining.AAAI,1996:226-231.
[20]WU H,AI S H,KA H R J,et al.Method of computing Chinese sentence similarity based on part-of-speech feature[J].Compu-ter Engineering and Design,2020,41(1):150-155.
[21]MCINNES L,HEALY J,MELVILLE J.UMAP:Uniform Ma-nifold Approximation and Projection for Dimension Reduction[J].arXiv:1802.03426,2018.
[22]ASYAKY M S,MANDALA R.Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP[C]//Proceedings of 2021 8th International Conference on Advanced Informatics:Concepts,Theory and Applications(ICAICTA).Bandung,Indonesia.IEEE,2021.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于Doc2Vec增强特征的长文本主题聚类研究

Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0