计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240400206-8.doi: 10.11896/jsjkx.240400206
范星1, 周晓航2,3, 张宁1
FAN Xing1, ZHOU Xiaohang2,3, ZHANG Ning1
摘要: 短文本相似性度量作为自然语言处理领域中的一项关键任务,随着社交媒体平台的用户活跃度不断攀升,短文本数据已成为互联网信息传播的核心载体。这类数据对于企业在大数据中深入理解消费者情感、精准描绘用户画像具有显著的应用价值。文中首先对短文本相似性度量方法进行了系统梳理,将其归结为基于字符串的方法、基于词向量的方法以及基于深度学习的方法3类,并深入探讨了不同方法的优势与局限性。其次,聚焦于短文本相似性在企业商业分析中的实际运用,揭示了短文本相似性度量如何助力企业洞察消费者意见、态度以及优化市场营销策略。最后,研究对社交媒体平台短文本相似性度量所面临的挑战进行了全面总结,并对未来的发展前景进行了展望,旨在为相关研究者提供有益的参考和启示。
中图分类号:
[1]AMUR Z H,KWANG HOOI Y,BHANBHRO H,et al.Short-text semantic similarity(STSS):techniques,challenges and future perspectives[J].Applied Sciences,2023,13(6):3911. [2]AHMED M H,TIUN S,OMAR N,et al.Short text clustering algorithms,application and challenges:A survey[J].Applied Sciences,2022,13(1):342. [3]PRAKOSO D W,ABDI A,AMRIT C.Short text similaritymeasurement methods:a review[J].Soft Computing,2021,25:4699-4723. [4]TIUN S,SAAD S,NOR N F M,et al.Quantifying semanticshift visually on a Malay domain-specific corpus using temporal word embedding approach[J].Asia-Pacific Journal of Information Technology and Multimedia,2020,9(2):1-10. [5]HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management.2009:919-928. [6]LEVENSHTEIN V I.Binary codes capable of correcting dele-tions,insertions,and reversals[C]//Soviet Physics Doklady.1966,10(8):707-710. [7]ELHADI M T.Text similarity calculations using text and syntactical structures[C]//7th International Conference on Computing and Convergence Technology(ICCCT).IEEE,2012:715-719. [8]KONDRAK G.N-gram similarity and distance[C]//Interna-tional Symposium on String Processing and Information Retrieval.Berlin,Heidelberg:Springer Berlin Heidelberg,2005:115-126. [9]JACCARD P.Étude comparative de la distribution florale dans une portion des Alpes et des Jura[J].Bull Soc Vaudoise Sci Nat,1901,37:547-579. [10]DICE L R.Measures of the amount of ecologic association between species[J].Ecology,1945,26(3):297-302. [11]SINGH N,CHAUDHARI N S.N-gram approach for a URLsimilarity measure[C]//1st India International Conference on Information Processing(IICIP).IEEE,2016:1-6. [12]DOLEV S,GHANAYIM M,BINUN A,et al.Relationship of Jaccard and edit distance in malware clustering and online identification[C]//IEEE 16th International Symposium on Network Computing and Applications(NCA).IEEE,2017:1-5. [13]SULTANA S,BISKRI I.Identifying similar sentences by using n-grams of characters[C]//Recent Trends and Future Technology in Applied Intelligence:31st International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems,IEA/AIE 2018.Springer International Publishing,2018:833-843. [14]BERGER H,DITTENBACH M,MERKL D.Analyzing theeffect of document representation on machine learning approaches in multi-class e-mail filtering[C]//2006 IEEE/WIC/ACM International Conference on Web Intelligence(WI 2006 Main Conference Proceedings)(WI’06).IEEE,2006:297-300. [15]SALTON G,WONG A,YANG C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620. [16]ROBERTSON S E,WALKER S.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval[C]//SIGIR’94:Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.Springer London,1994:232-241. [17]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26. [18]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196. [19]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).2014:1532-1543. [20]BIGGERS F B,MOHANTY S D,MANDA P.A deep semantic matching approach for identifying relevant messages for social media analysis[J].Scientific Reports,2023,13(1):12005. [21]KUMAR V,GUPTA A K,GARG R R,et al.The ultimate recommendation system:proposed Pranik System[J].Multimedia Tools and Applications,2023:1-22. [22]MOHOTTI W A,NAYAK R.Deep hierarchical non-negative matrix factorization for clustering short text[C]//Neural Information Processing,ICONIP 2020.Springer International Publishing,2020:270-282. [23]MIHALCEA R,CORLEY C,STRAPPARAVA C.Corpus-based and knowledge-based measures of text semantic similarity[C]//Aaai.2006,6(2006):775-780. [24]O’SHEA J,BANDAR Z,CROCKETT K,et al.A comparative study of two short text semantic similarity measures[C]//Agent and Multi-Agent Systems:Technologies and Applications.Springer Berlin Heidelberg,2008:172-181. [25]RUS V,NIRAULA N,BANJADE R.Similarity measures based on latent dirichlet allocation[C]//Computational Linguistics and Intelligent Text Processing:14th International Conference.Springer Berlin Heidelberg,2013:459-470. [26]LOTTO M,ZAKIR HUSSAIN I,KAUR J,et al.Analysis of fluoride-free content on twitter:topic modeling study[J].Journal of Medical Internet Research,2023,25:e44586. [27]SEAR R,RESTREPO N J,LUPU Y,et al.Dynamic topic modeling reveals variations in online hate narratives[C]//Science and Information Conference.Cham:Springer International Publishing,2022:564-578. [28]BANERJEE S,RAMANATHAN K,GUPTA A.Clusteringshort texts using wikipedia[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2007:787-788. [29]ZHAO C,YAO X,SUN S.A HowNet-based feature selectionmethod for Chinese text representation[C]//2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.IEEE,2009,1:26-30. [30]WANG C,LONG L,LI L.HowNet based evaluation for Chinese text summarization[C]//2008 International Conference on Natural Language Processing and Knowledge Engineering.IEEE,2008:1-6. [31]SUN X,WANG H,YU Y.Towards effective short text deep classification[C]//Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.2011:1143-1144. [32]GABRILOVICH E,MARKOVITCH S.Wikipedia-based semantic interpretation for natural language processing[J].Journal of Artificial Intelligence Research,2009,34:443-498. [33]CHANDRASEKARAN D,MAGO V.Evolution of semanticsimilarity-a survey[J].ACM Computing Surveys,2021,54(2):41:1-41:37. [34]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31. [35]ZHOU Y,LI J,CHI J,et al.Set-CNN:A text convolutional neu-ral network based on semantic extension for short text classification[J].Knowledge-Based Systems,2022,257:109948. [36]WANG H,TIAN K,WU Z,et al.A short text classification method based on convolutional neural network and semantic extension[J].International Journal of Computational Intelligence Systems,2021,14(1):367-375. [37]LIU J,MA H,XIE X,et al.Short text classification for faults information of secondary equipment based on convolutional neural networks[J].Energies,2022,15(7):2400. [38]GAO Z,LI Z,LUO J,et al.Short text aspect-based sentimentanalysis based on CNN plus BiGRU[J].Applied Sciences,2022,12(5):2707. [39]LIU Y,LI P,HU X.Combining context-relevant features with multi-stage attention network for short text classification[J].Computer Speech & Language,2022,71:101268. [40]VISHWAKARMA D K,MEEL P,YADAV A,et al.A framework of fake news detection on web platform using ConvNet[J].Social Network Analysis and Mining,2023,13(1):24. [41]ALKHODAIR S A,FUNG B C M,DING S H H,et al.Detecting high-engaging breaking news rumors in social media[J].ACM Transactions on Management Information Systems,2021,12(1):8. [42]WANG Z,YANG B.Attention-based bidirectional long short-term memory networks for relation classification using knowledge distillation from BERT[C]//2020 IEEE Intl Conf on Dependable,Autonomic and Secure Computing,Intl Conf on Pervasive Intelligence and Computing,Intl Conf on Cloud and Big Data Computing,Intl Conf on Cyber Science and Technology Congress.IEEE,2020:562-568. [43]ZHANG D,HONG M,ZOU L,et al.Attention pooling-based bidirectional gated recurrent units model for sentimental classification[J].International Journal of Computational Intelligence Systems,2019,12(2):723-732. [44]AGARWAL B,RAMAMPIARO H,LANGSETH H,et al.A deep network model for paraphrase detection in short text messages[J].Information Processing & Management,2018,54(6):922-937. [45]SALMAN AL-TAMEEMI I K,FEIZI-DERAKHSHI M-R,PASHAZADEH S,et al.An efficient sentiment classification method with the help of neighbors and a hybrid of RNN models[J].Complexity,2023,2023(1):e1896556. [46]MA J,GUO X,ZHAO X.Identifying purchase intention through deep learning:analyzing the Q &D text of an E-Commerce platform[J].Annals Of Operations Research,2022,339(1):329-348. [47]FASEEH M,KHAN M A,IQBAL N,et al.Enhancing user experience on Q&A platforms:measuring text similarity based on hybrid CNN-LSTM model for efficient duplicate question detection[J].IEEE Access,2024,12:34512-34526. [48]DEVLIN J,CHANG M W,LEE K,et al.Bert:pre-training ofdeep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [49]PUGACHEV L,BURTSEV M.Short Text Clustering withTransformers[J].arXiv:2102.00541,2021. [50]YUAN S,LIU N,SUN B,et al.A domain-knowledge based reconstruction framework for out-of-domain news title classification[J].Expert Systems with Applications,2024,237:121483. [51]QIU S,NIU Y,LI J,et al.Research on semantic similarity ofshort text based on bert and time warping distance[J].Journal Of Web Engineering,2021,20(8):2521-2543. [52]NOORIAN A,HAROUNABADI A,HAZRATIFARD M.A sequential neural recommendation system exploiting BERT and LSTM on social media posts[J].Complex & Intelligent Systems,2024,10(1):721-744. [53]ALTAMIMI A,UMER M,HANIF D,et al.Employing siamese malstm model and elmo word embedding for quora duplicate questions detection[J].IEEE ACCESS,Piscataway:IEEE-Inst Electrical Electronics Engineers Inc,2024,12:29072-29082. [54]DIGUTSCH J,KOSINSKI M.Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans[J].Scientific Reports,2023,13(1):5035. [55]SINGH N K,TOMAR D S,SANGAIAH A K.Sentiment analysis:a review and comparative analysis over social media[J].Journal of Ambient Intelligence and Humanized Computing,2020,11(1):97-117. [56]LEHRER S,XIE T,ZHANG X.Social media sentiment,model uncertainty,and volatility forecasting[J].Economic Modelling,2021,102:105556. [57]BEHERA R K,JENA M,RATH S K,et al.Co-LSTM:Convolutional LSTM model for sentiment analysis in social big data[J].Information Processing & Management,2021,58(1):102435. [58]PANG J,LI X,XIE H,et al.SBTM:topic modeling over short texts[C]//Database Systems for Advanced Applications.Cham:Springer International Publishing,2016:43-56. [59]ARAQUE O,ZHU G,IGLESIAS C A.A semantic similarity-based perspective of affect lexicons for sentiment analysis[J].Knowledge-Based Systems,2019,165:346-359. [60]JING Y,GOU H,FU C,et al.Sentiment classification of online reviews based on LDA and semantic analysis of sentimental words[C]//12th International Symposium on Computational Intelligence and Design(ISCID).IEEE,2019,1:249-252. [61]RANE A,KUMAR A.Sentiment classification system of twitter data for US airline service analysis[C]//IEEE 42nd Annual Computer Software and Applications Conference(COMPSAC).IEEE,2018,1:769-773. [62]LI Y M,LIN L,CHIU S W.Enhancing targeted advertisingwith social context endorsement[J].International Journal of Electronic Commerce,2014,19(1):99-128. [63]GHOSE A,IPEIROTIS P G,LI B.Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content[J].Marketing Science,2012,31(3):493-520. [64]SIMSEK A,KARAGOZ P.Wikipedia enriched advertisementrecommendation for microblogs by using sentiment enhanced user profiles[J].Journal of Intelligent Information Systems,2020,54(2):245-269. [65]BLUNDO C,MAIO C D,PARENTE M,et al.Targeted advertising that protects the privacy of social networks users[J].Human-centric Computing and Information Sciences,2021,11:1-1. [66]JIA K.Chinese sentiment classification based on Word2vec and vector arithmetic in human-robot conversation[J].Computers & Electrical Engineering,2021,95:107423. [67]KALLOUBI F,NFAOUI E H,EL BEQQALI O.Microblog semantic context retrieval system based on linked open data and graph-based theory[J].Expert Systems with Applications,2016,53:138-148. [68]TOKAREV G,CHERNEVA N.On the Features of a Quasi-Symbol[J].Chuzhdoezikovo Obuchenie-Foreign Language Teaching,2020,47(5):508-519. [69]SON Y,LEE Y.The reverse translator for symbol table verification in Objective C compiler on Smart Cross Platform[J].Asia Life Sciences,2015:625-636. [70]DI GANGI M A,LO BOSCO G,PILATO G.Effectiveness ofdata-driven induction of semantic spaces and traditional classifiers for sarcasm detection[J].Natural Language Engineering,2019,25(2):257-285. [71]SHANCHENG T,YUNYUE B,FUYU M.A semantic textsimilarity model for double short Chinese sequences[C]//International Conference on Intelligent Transportation,Big Data & Smart City(ICITBS).IEEE,2018:736-739. [72]ZHOU Y,LI C,HUANG G,et al.A short-text similarity model combining semantic and syntactic information[J].Electronics,2023,12(14):3126. [73]SEVERYN A,NICOSIA M,MOSCHITTI A.Building struc-tures from classifiers for passage reranking[C]//Proceedings of the 22nd ACM international conference on Information & Knowledge Management.2013:969-978. [74]HE S,LI Z,ZHAO H,et al.Syntax for semantic role labeling,to be,or not to be[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2061-2071. [75]FENG H,QIAN X.Mining user-contributed photos for personalized product recommendation[J].Neurocomputing,2014,129:409-420. [76]GAO J,PENG P,LU F,et.al.Knowledge-driven spatial competitive intelligence for tourism[J].Transactions in GIS,2024,28(3):535-563. |
|