Computer Science ›› 2023, Vol. 50 ›› Issue (6A): 220800192-6.doi: 10.11896/jsjkx.220800192

• Artificial Intelligence • Previous Articles     Next Articles

Study on Long Text Topic Clustering Based on Doc2Vec Enhanced Features

CHEN Jie   

  1. School of Data Science and Information Technology,China Women’s University,Beijing 100101,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:CHEN Jie,born in 1969,postgraduate,associate professor.Her main research interests include information processing and text mining.
  • Supported by:
    Research Fund of China Women’s College(ZKY200020228).

Abstract: Aimed at the difficulties of semantic representation of long news text,an enhanced document feature representation is constructed based on Doc2Vec embedding and word vector weighting.Enhanced features from the specific parts-of-speech contents on the head and tail of the document are extracted by the method of DV-sim or DV-tfidf.These features are then combined with doc2vec to form a new global representation.DV-sim uses the similarity between feature words and doc2vec vectors to obtain word weight from the semantic point of view,and DV-tfidf uses term frequency inverse document frequency to obtain word weight from the word frequency statistics point of view.Then the HDBSCAN algorithm is applied to cluster topics on the Thucnews and Sogou datasets.Compared with the Doc2Vec vector,the noise number on the two datasets reduces by 60.82% and 60.63%,the accuracy improves by 12.14% and 20.58%,and the F1-score increases by 15.61% and 11.58%,respectively,with DV-sim.The noise number on the two datasets reduces by 15.20% and 59.55%,the accuracy improves by 10.85% and 17.93%,and the F1-score increases by 15.60% and 9.21%,respectively,with DV-tfidf.Experiments show that both DV-sim and DV-tfidf can improve the performance of topic clustering,and the enhancement feature based on semantics is better than that based on word frequency.DV-sim has also been effectively applied in topic clustering of excellent female character reports.

Key words: Topic clustering, Text representation, Doc2Vec, Word embedding, HDBSCAN

CLC Number: 

  • TP391
[1]ZHAO J S,SONG M X,GAO X,et al.Research on Text Representation in Natural Language Processing[J].Journal of Software,2022,33(1):102-128.
[2]XIONG H X,YANG M T,LI Y Y.A Survey of Information Organization and Retrieval Based on Deep Learning[J].Information Science,2020,38(3):3-10.
[3]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].https://arxiv.org/pdf/1810.04805.pdf.
[4]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents[C]//Proceedings of the 31st International Conference on Machine Learning.PMLR,2014,32(2):1188-1196.
[5]LEE S,JIN X,KIM W.Sentiment classification for unlabeled dataset using Doc2Vec with JST[C]//Proceedings of the 18th Annual International Conference on Electronic Commerce:e-Commerce in Smart Connected World.ACM New York,NY,USA,2016.
[6]MANDAL A,GHOSH K,GHOSH S,et al.Unsupervised approaches for measuring textual similarity between legal court case reports.Artificial Intelligence and Law[J].Artificial Intelligence and Law,2021,29(3):417-451.
[7]ADORNO H G,DURAN J,SIDOROV G,et al.Document embeddings learned on various types of n-grams for cross-topic authorship attribution[J].Computing.2018,100(7):741-756.
[8]JIA X T,WANG M Y,CAO Y.Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J].Data Analysis and Knowledge Discovery,2018,2(2):86-95.
[9]RUAN G C,XIA L.Hot Topic Detection in Journal PapersBased on Doc2Vec[J].Information Studies:Theory & Application,2019,42(4):107-111,106.
[10]ARIF B,REZA R,NOVYANTARA P H,et al.Unsupervised News Topic Modelling with Doc2Vec and Spherical Clustering[J].Procedia Computer Science,2021,179:40-46.
[11]CHANG W B,XU Z Z,ZHOU S H,et al.Research on detection methods based on Doc2vec abnormal comments[J].Future Ge-neration Computer Systems,2018,86:656-662.
[12]AMIRI M Z,SHOBI A.A Link Prediction Strategy for Persona-lized Tweet Recommendation through Doc2Vec Approach[C]//Proceedings of 17th International Conference on IT Applications and Management,Babolsar,Iran.Korean Database Society(KDBS),2017:72-82.
[13]CHEN X,ZHU X D,GAO G K,et al.Sentiment Analysis of Chinese Comments Based on Hybrid Vector Model[J].Compu-ter Engineering,2020,46(1):309-314.
[14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013.
[15]TANG M,ZHU L,ZOU X C.Document Vector Representation Based on Word2Vec[J].Computer Science,2016,43(6):214-217,269.
[16]CAMPELLO R,MOULAVI D,SANDER J.Density Based Clustering Based on Hierarchical Density Estimates[C]//Advances in Knowledge Discovery and Data Mining(PAKDD 2013).Gold Coast,Australia.Springer,2013:160-172.
[17]MELVIN R L,XIAO J J,GODWIN R,et al.Visualizing correlated motion with HDBSCAN clustering[J].Protein Science,2018,27(1):62-75.
[18]TAHVILI S,HATVANI L,FELDERER M,et al.AutomatedFunctional Dependency Detection Between Test Cases Using Doc2Vec and Clustering[C]//Proceedings of 2019 IEEE International Conference on Artificial Intelligence Testing(AITest).Newark,CA,USA.IEEE,2019.
[19]ESTER M,KRIEGEL H P,SANDER J,et al.A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining.AAAI,1996:226-231.
[20]WU H,AI S H,KA H R J,et al.Method of computing Chinese sentence similarity based on part-of-speech feature[J].Compu-ter Engineering and Design,2020,41(1):150-155.
[21]MCINNES L,HEALY J,MELVILLE J.UMAP:Uniform Ma-nifold Approximation and Projection for Dimension Reduction[J].arXiv:1802.03426,2018.
[22]ASYAKY M S,MANDALA R.Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP[C]//Proceedings of 2021 8th International Conference on Advanced Informatics:Concepts,Theory and Applications(ICAICTA).Bandung,Indonesia.IEEE,2021.
[1] WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[2] CHEN Shurui, LIANG Ziran, RAO Yanghui. Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning [J]. Computer Science, 2023, 50(3): 72-82.
[3] ZHENG Cheng, MEI Liang, ZHAO Yiyan, ZHANG Suhang. Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks [J]. Computer Science, 2023, 50(1): 221-228.
[4] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[5] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[6] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[7] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[8] LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining [J]. Computer Science, 2022, 49(1): 59-64.
[9] YU Sheng, LI Bin, SUN Xiao-bing, BO Li-li, ZHOU Cheng. Approach for Knowledge-driven Similar Bug Report Recommendation [J]. Computer Science, 2021, 48(5): 91-98.
[10] ZHANG Yu-shuai, ZHAO Huan, LI Bo. Semantic Slot Filling Based on BERT and BiLSTM [J]. Computer Science, 2021, 48(1): 247-252.
[11] TIAN Ye, SHOU Li-dan, CHEN Ke, LUO Xin-yuan, CHEN Gang. Natural Language Interface for Databases with Content-based Table Column Embeddings [J]. Computer Science, 2020, 47(9): 60-66.
[12] CHENG Jing, LIU Na-na, MIN Ke-rui, KANG Yu, WANG Xin, ZHOU Yang-fan. Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification [J]. Computer Science, 2020, 47(8): 255-260.
[13] ZHANG Xiao-hui, YU Shuang-yuan, WANG Quan-xin and XU Bao-min. Text Representation and Classification Algorithm Based on Adversarial Training [J]. Computer Science, 2020, 47(6A): 12-16.
[14] LI Zhou-jun,FAN Yu,WU Xian-jie. Survey of Natural Language Processing Pre-training Techniques [J]. Computer Science, 2020, 47(3): 162-173.
[15] LI Ke,CHEN Guang-ping. Mining Deep Semantic Features of Reviews for Amazon Commodity Recommendation [J]. Computer Science, 2020, 47(2): 65-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!