计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220800192-6.doi: 10.11896/jsjkx.220800192
陈洁
CHEN Jie
摘要: 针对新闻长文本语义表征的难点,基于Doc2Vec文档嵌入和词向量加权方式构建增强的特征表示。利用DV-sim方法和DV-tfidf方法从文档首尾部分特定词性的内容中提取增强特征,再分别与Doc2Vec文档向量组合,形成新的全局表征。DV-sim从语义角度,采用特征词与Doc2Vec向量的相似度获得词权重;DV-tfidf从词频统计角度,采用词频-逆文档频率方式获得词权重,然后利用HDBSCAN算法在THUCNews 和Sogou数据集上进行主题聚类。相比直接应用Doc2Vec向量,DV-sim在两个数据集上的噪声数分别减少60.82% 和60.63%,准确率提高12.14%和20.58%,F1-Score值提高15.61%和11.58%;DV-tfifd在两个数据集上的噪声数分别减少15.20%和59.55%,准确率提高10.85%和17.93%,F1-Score值提高15.60%和9.21%。实验结果表明,DV-sim和DV-tfidf都可以提高主题聚类性能,且基于语义的增强特征比基于词频的效果更好,DV-sim在优秀女性人物报道的主题聚类上也得到了有效应用。
中图分类号:
[1]ZHAO J S,SONG M X,GAO X,et al.Research on Text Representation in Natural Language Processing[J].Journal of Software,2022,33(1):102-128. [2]XIONG H X,YANG M T,LI Y Y.A Survey of Information Organization and Retrieval Based on Deep Learning[J].Information Science,2020,38(3):3-10. [3]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].https://arxiv.org/pdf/1810.04805.pdf. [4]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents[C]//Proceedings of the 31st International Conference on Machine Learning.PMLR,2014,32(2):1188-1196. [5]LEE S,JIN X,KIM W.Sentiment classification for unlabeled dataset using Doc2Vec with JST[C]//Proceedings of the 18th Annual International Conference on Electronic Commerce:e-Commerce in Smart Connected World.ACM New York,NY,USA,2016. [6]MANDAL A,GHOSH K,GHOSH S,et al.Unsupervised approaches for measuring textual similarity between legal court case reports.Artificial Intelligence and Law[J].Artificial Intelligence and Law,2021,29(3):417-451. [7]ADORNO H G,DURAN J,SIDOROV G,et al.Document embeddings learned on various types of n-grams for cross-topic authorship attribution[J].Computing.2018,100(7):741-756. [8]JIA X T,WANG M Y,CAO Y.Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J].Data Analysis and Knowledge Discovery,2018,2(2):86-95. [9]RUAN G C,XIA L.Hot Topic Detection in Journal PapersBased on Doc2Vec[J].Information Studies:Theory & Application,2019,42(4):107-111,106. [10]ARIF B,REZA R,NOVYANTARA P H,et al.Unsupervised News Topic Modelling with Doc2Vec and Spherical Clustering[J].Procedia Computer Science,2021,179:40-46. [11]CHANG W B,XU Z Z,ZHOU S H,et al.Research on detection methods based on Doc2vec abnormal comments[J].Future Ge-neration Computer Systems,2018,86:656-662. [12]AMIRI M Z,SHOBI A.A Link Prediction Strategy for Persona-lized Tweet Recommendation through Doc2Vec Approach[C]//Proceedings of 17th International Conference on IT Applications and Management,Babolsar,Iran.Korean Database Society(KDBS),2017:72-82. [13]CHEN X,ZHU X D,GAO G K,et al.Sentiment Analysis of Chinese Comments Based on Hybrid Vector Model[J].Compu-ter Engineering,2020,46(1):309-314. [14]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013. [15]TANG M,ZHU L,ZOU X C.Document Vector Representation Based on Word2Vec[J].Computer Science,2016,43(6):214-217,269. [16]CAMPELLO R,MOULAVI D,SANDER J.Density Based Clustering Based on Hierarchical Density Estimates[C]//Advances in Knowledge Discovery and Data Mining(PAKDD 2013).Gold Coast,Australia.Springer,2013:160-172. [17]MELVIN R L,XIAO J J,GODWIN R,et al.Visualizing correlated motion with HDBSCAN clustering[J].Protein Science,2018,27(1):62-75. [18]TAHVILI S,HATVANI L,FELDERER M,et al.AutomatedFunctional Dependency Detection Between Test Cases Using Doc2Vec and Clustering[C]//Proceedings of 2019 IEEE International Conference on Artificial Intelligence Testing(AITest).Newark,CA,USA.IEEE,2019. [19]ESTER M,KRIEGEL H P,SANDER J,et al.A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining.AAAI,1996:226-231. [20]WU H,AI S H,KA H R J,et al.Method of computing Chinese sentence similarity based on part-of-speech feature[J].Compu-ter Engineering and Design,2020,41(1):150-155. [21]MCINNES L,HEALY J,MELVILLE J.UMAP:Uniform Ma-nifold Approximation and Projection for Dimension Reduction[J].arXiv:1802.03426,2018. [22]ASYAKY M S,MANDALA R.Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP[C]//Proceedings of 2021 8th International Conference on Advanced Informatics:Concepts,Theory and Applications(ICAICTA).Bandung,Indonesia.IEEE,2021. |
|