计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 76-86.doi: 10.11896/jsjkx.220100078

• 数据库&大数据&数据科学 • 上一篇    下一篇

一种结合标签分类和语义查询扩展的文本素材推荐方法

孟怡悦, 彭蓉, 吕其标   

  1. 武汉大学计算机学院 武汉 430072
  • 收稿日期:2022-01-09 修回日期:2022-07-02 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 彭蓉(rongpeng@whu.edu.cn)
  • 作者简介:mengyiyue@whu.edu.cn
  • 基金资助:
    教育部-中国移动联合基金项目(MCM2020J01)

Text Material Recommendation Method Combining Label Classification and Semantic QueryExpansion

MENG Yiyue, PENG Rong, LYU Qibiao   

  1. School of Computer Science,Wuhan University,Wuhan 430072,China
  • Received:2022-01-09 Revised:2022-07-02 Online:2023-01-15 Published:2023-01-09
  • About author:MENG Yiyue,born in 1998,postgra-duate.His main research interests include requirements engineering,software engineering and so on.
    PENG Rong,born in 1975,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include requirements engineering,software engineering,ser-vice computing,etc.
  • Supported by:
    Joint Founds of China Mobile of the Ministry of Education of China(MCM2020J01).

摘要: 在各类规划、调研报告的编制过程中,编制人员往往需要根据拟定的目录或标题去收集、阅读大量文本素材,分类整理后再甄选使用,不仅工作量大而且质量无法得到保障。为此,在数字政府规划文档编制领域中提出了一种结合标签分类和语义查询扩展的文本素材推荐方法,从信息检索的角度出发,将目录中的各级标题视为查询语句,将参阅的文本素材作为目标文档,从而进行文本素材检索与推荐。该方法基于差分进化算法,将基于词向量平均的文本素材推荐方法、基于语义查询扩展的文本素材推荐方法和基于标签分类的文本素材推荐方法有机结合,弥补了传统的文本素材推荐方法的不足,实现了通过目录结构的标题检索以段落为粒度的文本素材。在10个数据集上的实验验证结果表明,该方法的性能提升显著,能够大大减少人工素材选择的工作量,同时减少素材分类的工作量,降低文档编制的难度。

关键词: 文本素材推荐, 信息检索, 数字政府, 查询扩展, 差分进化算法

Abstract: In the process of preparing various planning and research reports,researchers often need to collect and read a large amount of text materials according to the proposed catalog or title,not only the workload is large,but the quality cannot be gua-ranteed.To this end,in the field of digital government planning documentation,a text material recommendation method combining label classification and semantic query expansion is proposed.From the perspective of information retrieval,the titles at all levels in the catalog are regarded as query sentences,and the referenced text materials are used as target documents,so as to retrieve and recommend text materials.This method is based on the differential evolution algorithm,organically combining the text material recommendation method based on word vector average,semantic query expansion and label classification,which makes up the shortcoming of the traditional text material recommendation method and achieves to retrieve the text materials with the granularity of paragraphs through the title of catalog.After experimental verification on 10 datasets,the results show that the performance of the proposed method is significantly improved.It can greatly reduce the workload of manual material selection and material classification,as well as reduce the difficulty of documentation.

Key words: Text material recommendation, Information retrieval, Digital government, Query expansion, Differential evolution algorithm

中图分类号: 

  • TP311.5
[1]ZHANG D W,ZHANG S M,SHI Y.On the methods of improving the efficiency of document preparation in standardization [C]//The 10th China Standardization Forum.2013:810-812,820.
[2]HUANG L,DU L L,ZHUANG Y C.Urban Planning ProjectManagement System:Guangzhou Urban Planning Compilation and Research Center Experience [J].PLANNERS,2009,25(10):9-13.
[3]LI H Y,YUAN M.Project management maturity model for14th Five-Year Plan formulation [J].Project Management Technology,2021,19(5):83-87.
[4]MANNING C D,RAGHAVAN P,SCHÜTZE H.Introduction to Information Retrieval[M].Cambridge:Cambridge University Press,2008.
[5]ITO T,KURIBAYASHI T,KOBAYASHI H,et al.Diamonds in the Rough:Generating Fluent Sentences from Early-Stage Drafts for Academic Writing Assistance [C]//Proceedings of the 12th International Conference on Natural Language Generation.2019:40-53.
[6]ROEMMELE M,GORDON A S.Automated Assistance forCreative Writing with an RNN Language Model [C]//Procee-dings of the 23rd International Conference on Intelligent User Interfaces Companion.Association for Computing Machinery.New York,NY,USA,2019:1-2.
[7]NAGATA R,HASHIGUCHI T,SADOUN D.Is the SimplestChatbot Effective in English Writing Learning Assistance?[C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Demonstrations.2015:245-256.
[8]TONG G.Official Document Writing Assistant System Designand Implement [D].Beijing:Beijing University of Technology,2014.
[9]SOYER H,TOPIĆ G,STENETORP P,et al.CroVeWA:Crosslingual Vector-Based Writing Assistance[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Demonstrations.2015:91-95.
[10]NI W J,SUN Y J,LIU T,et al.NativeHelper:A Bilingual Sentence Search and Recommendation Engine for Academic Writing[C]//Asia-Pacific Web(APWeb) and Web-Age Information Management(WAIM) Joint International Conference on Web and Big Data.2019:412-416.
[11]YANG X Y,YE M C,YOU Q Z,et al.Writing by Memorizing:Hierarchical Retrieval-based Medical Report Generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:5000-5009.
[12]KONG H.Writing Assistant System Based on Topic Recom-mendation [D].Harbin:Harbin Institute of Technology,2015.
[13]CAO J B,ZHANG S Y.Research on Standards Conformance Testing of Traffic information and its System Development [D].Shanxi:Chang'an University,2017.
[14]WANG J,DONG Y.Measurement of Text Similarity:A Survey[J].Information,2020,11(9):421.
[15]FAROUK M.Measuring Sentences Similarity:A Survey[J].CoRR,2019,12(25):1-11.
[16]SALTON G.A Vector Space Model for Automatic Indexing[J].Communications of the ACM,1975,18(11):613-620.
[17]LANDAUER T K,DUMAIS S T.A Solution to Plato's Problem:The Latent Semantic Analysis Theory of Acquisition,Induction,and Representation of Knowledge [J].Psychological Review,1997,104(2):211-240.
[18]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[19]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents[J].arXiv.1405.4053,2014.
[20]HU B,LU Z,HANG L,et al.Convolutional Neural Network Architectures for Matching Natural Language Sentences[J].Advances in Neural Information Processing Systems,2015,3:2042-2050.
[21]PENNINGTON J,SOCHER R,MANNING C.Glove:GlobalVectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[22]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].arXiv:1810.04805,2018.
[23]JIA P T,SUN W.A Survey of Text Classification Based onDeep Learning [J].Computer and Modernization,2021(7):29-37.
[24]QIANG G.An effective algorithm for improving the perfor-mance of naive bayes for text classification[C]//Second International Conference on Computer Research & Development.IEEE,2010:1678-1684.
[25]LIU C L,LIANG R S,DI Y H.Research on short text classification based on TFIDF and gradient lifting decision tree [J].Technology Wind,2019(24):231-232.
[26]HUANG X Y,XIONG L Y,LIU Y T.An improved KNN short text classification algorithm based on category feature words [J].Computer Engineering & Science,2018,40(1):148-154.
[27]WANG H L,LIU L,LIN M,et al.Music personalized recom-mendation algorithm based on k-means clustering algorithm[J].Journal of Jilin University(Engineering and Technology Edition),2021,51(5):1845-1850.
[28]WANG Y Z,ZHENG X,HOU D.Short Text Sentiment Classification of High Dimensional Hybrid Feature Based on SVM [J].Computer Technology and Development,2018,28(2):88-93.
[29]KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP).Doha,Qatar:Association for Computational Linguistics,2014:1746-1751.
[30]SOCHER R,PERELYGIN A,WU J,et al.Recursive Deep Mo-dels for Semantic Compositionality Over a Sentiment Treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Seattle,Washington,USA:Association for Computational Linguistics,2013:1631-1642.
[31]ZHANG Y,LIU Q,SONG L.Sentence-State LSTM for Text Representation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Melbourne,Australia:Association for Computational Linguistics,2018:317-327.
[32]TANG Q T,LI J,CHEN J Y,et al.Full attention-based bi-GRU neural network for news text classification[C]//Proceedings of the 2019 IEEE 5th International Conference on Computer and Communication.2019:1970-1974.
[33]JIN Y C,WANG Q Q,GAO J,et al.Multi-label Financial Text Classification Algorithm Based on Graph Deep Learning[J].Computer Engineering,2022,48(4):16-21.
[34]LAI S,LIU K,HE S,et al.How to Generate a Good Word Embedding [J].IEEE Intelligent Systems,2016,31(6):5-14.
[35]AZAD H K,DEEPAK A.Query expansion techniques for information retrieval:A survey [J].Information Processing and Management,2019,56(5):1698-1735.
[36]STORN R,PRICE K.A Simple and Efficient Heuristic for glo-bal Optimization over Continuous Spaces [J].Journal of Global Optimization,1997,11(4):341-359.
[37]VOORHEES E.The TREC-8 Question Answering Track Report[C]//Proceedings of the 8th Text Retrieval Conference.1999:77-82.
[38]EHEK R,SOJKA P.Software Framework for Topic Modelling with Large Corpora[C]//Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.2010:45-50.
[39]ZHOU J,ZHANG H,LO D.Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports[C]//2012 34th International Conference on Software Engineering(ICSE).2012:14-24.
[40]KANWAL S,NAWAZ S,MALIK M K,et al.A Review ofText-Based Recommendation Systems[J].IEEE Access,2021,9:31638-31661.
[41]HUANG P S,HE X,GAO J,et al.Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM International Conference on Confe-rence on Information & Knowledge Management.2013:2333-2338.
[42]CHEN Q,ZHU X,LING Z H,et al.Enhanced LSTM for Natural Language Inference[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2016:1657-1668.
[1] 刘宝宝, 杨菁菁, 陶露, 王贺应.
基于DE-LSTM模型的教育统计数据预测研究
Study on Prediction of Educational Statistical Data Based on DE-LSTM Model
计算机科学, 2022, 49(6A): 261-266. https://doi.org/10.11896/jsjkx.220300120
[2] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[3] 岑健铭, 封全喜, 张丽丽, 佟锐超.
基于DE-lightGBM模型的上市公司高送转预测实证研究
Empirical Study on the Forecast of Large Stock Dividends of Listed Companies Based on DE-lightGBM
计算机科学, 2022, 49(11A): 211000017-7. https://doi.org/10.11896/jsjkx.211000017
[4] 杨浩, 闫巧.
基于差分进化算法的字符对抗验证码生成方法
Adversarial Character CAPTCHA Generation Method Based on Differential Evolution Algorithm
计算机科学, 2022, 49(11A): 211100074-5. https://doi.org/10.11896/jsjkx.211100074
[5] 倪珍, 李斌, 孙小兵, 李必信, 朱程.
面向软件缺陷报告的缺陷定位方法研究与进展
Research and Progress on Bug Report-oriented Bug Localization Techniques
计算机科学, 2022, 49(11): 8-23. https://doi.org/10.11896/jsjkx.220200117
[6] 余笙, 李斌, 孙小兵, 薄莉莉, 周澄.
知识驱动的相似缺陷报告推荐方法
Approach for Knowledge-driven Similar Bug Report Recommendation
计算机科学, 2021, 48(5): 91-98. https://doi.org/10.11896/jsjkx.200600159
[7] 金文清, 韩芳.
一种基于音高显著性增强的主旋律提取方法
Main Melody Extraction Method Based on Saliency Enhancement
计算机科学, 2020, 47(6A): 24-28. https://doi.org/10.11896/JsJkx.191200022
[8] 段建勇, 游世薪, 张梅, 王昊.
基于多特征融合的关键词抽取
Keyword Extraction Based on Multi-feature Fusion
计算机科学, 2020, 47(11A): 73-77. https://doi.org/10.11896/jsjkx.200300121
[9] 李浩, 钟声, 康雁, 李涛, 张亚钏, 卜荣景.
融合领域知识的API推荐模型
API Recommendation Model with Fusion Domain Knowledge
计算机科学, 2020, 47(11A): 544-548. https://doi.org/10.11896/jsjkx.191200010
[10] 王瑄, 毛莺池, 谢在鹏, 黄倩.
基于差分进化的推断任务卸载策略
Inference Task Offloading Strategy Based on Differential Evolution
计算机科学, 2020, 47(10): 256-262. https://doi.org/10.11896/jsjkx.190800159
[11] 肖鹏, 邹德旋, 张强.
一种高效动态自适应差分进化算法
Efficient Dynamic Self-adaptive Differential Evolution Algorithm
计算机科学, 2019, 46(6A): 124-132.
[12] 范道远, 孙吉红, 王炜, 涂吉屏, 何欣.
融合文本与分类信息的重复缺陷报告检测方法
Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information
计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232
[13] 余圆圆, 巢文涵, 何跃鹰, 李舟军.
基于双语主题模型和双语词向量的跨语言知识链接
Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding
计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037
[14] 韩朝, 苗夺谦, 任福继.
基于粗糙集理论的中文知识问答的知识谓词分析
Rough Set Based Knowledge Predicate Analysis of Chinese Knowledge Based Question Answering
计算机科学, 2018, 45(6): 183-186. https://doi.org/10.11896/j.issn.1002-137X.2018.06.032
[15] 单天羽, 管煜旸.
基于种群多样性的可变种群缩减差分进化算法
Differential Evolution Algorithm with Adaptive Population Size Reduction Based on Population Diversity
计算机科学, 2018, 45(11A): 160-166.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!