基于预训练和深度哈希的大规模文本检索研究

doi:10.11896/jsjkx.210300266

计算机科学 ›› 2021, Vol. 48 ›› Issue (11): 300-306.doi: 10.11896/jsjkx.210300266

基于预训练和深度哈希的大规模文本检索研究

邹傲, 郝文宁, 靳大尉, 陈刚, 田媛

陆军工程大学指挥控制工程学院南京210000

收稿日期:2021-03-26 修回日期:2021-05-23 出版日期:2021-11-15 发布日期:2021-11-10
通讯作者: 郝文宁(hwnbox@163.com)
作者简介:3231954713@qq.com
基金资助:
国家自然科学基金(61806221)

Study on Text Retrieval Based on Pre-training and Deep Hash

ZOU Ao, HAO Wen-ning, JIN Da-wei, CHEN Gang, TIAN Yuan

Command & Control Engineering College,Army Engineering University of PLA,Nanjing 210000,China

Received:2021-03-26 Revised:2021-05-23 Online:2021-11-15 Published:2021-11-10
About author:ZOU Ao,born in 1997,postgraduate.His main research interests include machine learning and natural language processing.
HAO Wen-ning,born in 1971,Ph.D,professor,Ph.D supervisor.His main research interests include big data and machine learning.
Supported by:
National Natural Science Foundation of China(61806221).

摘要/Abstract

摘要： 针对文本检索中存在的检索效率和准确率不高的问题,提出一种基于预训练语言模型和深度哈希方法的检索模型。该模型首先通过迁移学习的方法引入预训练语言模型中所包含的文本先验知识,之后进行特征提取,将输入转化为高维的向量表示。在整个模型的后端加入哈希学习层,通过设计特定的优化目标对模型的参数进行微调,从而在训练中动态地学习哈希函数和每个输入的唯一哈希表示。实验表明,该方法的检索准确率相较于其他基准模型在top-5和top-10指标上分别有至少21.70%和21.38%的提升,哈希码的引入使得模型在仅损失4.78%准确率的前提下将检索速率提升了40倍,因此该方法能够显著提升检索准确率和效率,且在文本检索领域有着潜在应用前景。

关键词: 深度哈希, 深度学习, 相似性检索, 预训练语言模型

Abstract: Aiming at the problem of low retrieval efficiency and accuracy in text retrieval,a retrieval model based on pre-trained language model and deep hash method is proposed.Firstly,the prior knowledge of text contained in the pre-trained language model is introduced by transfer learning,and then the input is transformed into high-dimensional vector representation by feature extraction.A hash learning layer is added to the back end of the whole model to fine tune the parameters of the model by designing specific optimization objectives,so as to dynamically learn the hash function and the unique hash representation of each input in the training.Experimental results show that the retrieval accuracy of this method is at least 21.70% and 21.38% higher than that of other benchmark models in top-5 and top-10,respectively.The introduction of hash code makes the model improve the retrieval speed by 40 times under the premise of only losing 4.78% accuracy.Therefore,this method can significantly improve the retrieval accuracy and efficiency,and has a potential application prospect in the field of text retrieval.

Key words: Deep hash, Deep learning, Pre-trained language model, Similarity retrieval

中图分类号:

TP391.1

邹傲, 郝文宁, 靳大尉, 陈刚, 田媛. 基于预训练和深度哈希的大规模文本检索研究[J]. 计算机科学, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266

ZOU Ao, HAO Wen-ning, JIN Da-wei, CHEN Gang, TIAN Yuan. Study on Text Retrieval Based on Pre-training and Deep Hash[J]. Computer Science, 2021, 48(11): 300-306. https://doi.org/10.11896/jsjkx.210300266

参考文献

[1]MITRA B,CRASWELL N.An introduction to neural information retrieval[M].Now Foundations and Trends,2018:1-126.
[2]ROBERTSON S,ZARAGOZA H.The probabilistic relevanceframework:BM25 and beyond[M].Now Publishers Inc,2009:1-32.
[3]LI H,XU J.Semantic matching in search[J].Foundations and Trends in Information Retrieval,2014,7(5):343-469.
[4]XIONG C Y,DAI Z N,JAMIE C,et al.End-to-End Neural Ad-hoc Ranking with Kernel Pooling[J].ACM Sigir Forum,2017,51(cd):55-64.
[5]HUI K,YATES A,BERBERICH K,et al.Co-PACRR:A Context-Aware Neural IR Model for Ad-hoc Retrieval[C]//Ele-venth ACM International Conference.ACM,2017.
[6]MITRA B,DIAZ F,CRASWELL N.Learning to match using local and distributed representations of text for web search[C]//Proceedings of the 26th International Conference on World Wide Web.2017:1291-1299.
[7]LI Z J,FAN Y,WU X J.Survey of Natural Language ProcessingPre-training Techniques[J].Computer Science,2020,47(3):162-173.
[8]MIKOLOV T.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[9]PENNINGTON J,SOCHER R,MANNING C.GloVe:GlobalVectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014.
[10]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of tricks for efficient text classification[J].arXiv:1607.01759,2016.
[11]RAJPURKAR P,ZHANG J,LOPYREV K,et al.Squad:100000+ questions for machine comprehension of text[J].ar-Xiv:1606.05250,2016.
[12]LAI G,XIE Q,LIU H,et al.Race:Large-scale reading comprehension dataset from examinations[J].arXiv:1704.04683,2017.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll you Need[C]//Neural Information Processing Systems.2017:5998-6008.
[14]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[15]YANG Z,DAI Z,YANG Y,et al.Xlnet:Generalized autoregressive pretraining for language understanding[J].arXiv:1906.08237,2019.
[16]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907.11692,2019.
[17]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[18]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Computer Vision and Pattern Recognition.2016:770-778.
[19]SLANEY M,CASEY M A.Locality-Sensitive Hashing for Fin-ding Nearest Neighbors [Lecture Notes][J].IEEE Signal Processing Magazine,2008,25(2):128-131.
[20]DATAR M,IMMORLICA N,INDYK P,et al.Locality-sensitive hashing scheme based on p-stable distributions[C]//Symposium on Computational Geometry.2004:253-262.
[21]CAI H,LI Z J,SUN J,et al.Fast Chinese Text Search Based on LSH[J].Computer Science,2009,36(8):201-204.
[22]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Conference on Neural Information Processing Systems.2008:1-4.
[23]LIN K,YANG H,HSIAO J,et al.Deep learning of binary hash codes for fast image retrieval[C]//Computer Vision and Pattern Recognition.2015:27-35.
[24]YAO T,LONG F,MEI T,et al.Deep semantic-preserving and rank-ing-based hashing for image retrieval[C]//International Joint Conference on Artificial Intelligence.2016:3931-3937.
[25]LU J,LIONG V E,ZHOU J,et al.Deep Hashing for Scalable Image Search[J].IEEE Transactions on Image Processing,2017,26(5):2352-2367.
[26]ZHANG S,LI J,ZHANG B,et al.Semantic Cluster Unary Loss for Efficient Deep Hashing[J].IEEE Transactions on Image Processing,2019,28(6):2908-2920.
[27]ZENG Y,CHEN Y L,CAI X D.Deep Face Recognition Algorithm Based on Weighted Hashing[J].Computer Science,2019,46(6):277-281.
[28]GUO J,FAN Y,PANG L,et al.A deep look into neural ranking models for information retrieval[J].Information Processing & Management,2019,6:102067-102086.
[29]NOGUEIRA R,CHO K.Passage Re-ranking with BERT[J].arXiv:1901.04085.
[30]YAN M,LI C,WU C,et al.IDST at TREC 2019 Deep Learning Track:Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling[C]//TREC 2019.2019.
[31]HOFSTATTER S,ZLABINGER M,HANBURY A.Interpretable & time-budget-constrained contextualization for re-ranking[J].arXiv:2002.01854.
[32]HOFSTATTER S,ZAMANI H,MITRA B,et al.Local self-attention over long text for efficient document retrieval[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:2021-2024.
[33]MACAVANEY S,YATES A,COHAN A,et al.CEDR:Contextualized embeddings for document ranking[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:1101-1104.
[34]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[35]WANG A,SINGH A,MICHAEL J,et al.GLUE:A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[J].arXiv:1804.07461,2018.
[36]DOLAN W B,BROCKETT C.Automatically constructing a corpus of sentential paraphrases[C]//Proceedings of the Third International Workshop on Paraphrasing.2005.
[37]CER D,DIAB M,AGIRRE E,et al.Semeval-2017 task 1:Semantic textual similarity-multilingual and cross-lingual focused evaluation[J].arXiv:1708.00055,2017.
[38]WOLF T,CHAUMOND J,DEBUT L,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45.
[39]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[40]LEVESQUE H,DAVIS E,MORGENSTERN L.The winograd schema challenge[C]//Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.2012.

相关文章 15

[1]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2]	汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3]	徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[4]	王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[10]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[11]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[12]	周慧, 施皓晨, 屠要峰, 黄圣君. 基于主动采样的深度鲁棒神经网络学习 Robust Deep Neural Network Learning Based on Active Sampling 计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[13]	苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫. 小样本雷达辐射源识别的深度学习方法综述 Survey of Deep Learning for Radar Emitter Identification Based on Small Sample 计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[14]	王君锋, 刘凡, 杨赛, 吕坦悦, 陈峙宇, 许峰. 基于多源迁移学习的大坝裂缝检测 Dam Crack Detection Based on Multi-source Transfer Learning 计算机科学, 2022, 49(6A): 319-324. https://doi.org/10.11896/jsjkx.210500124
[15]	楚玉春, 龚航, 王学芳, 刘培顺. 基于YOLOv4的目标检测知识蒸馏算法研究 Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于预训练和深度哈希的大规模文本检索研究

Study on Text Retrieval Based on Pre-training and Deep Hash

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0