基于预训练和深度哈希的大规模文本检索研究

doi:10.11896/jsjkx.210300266

Abstract

Abstract: Aiming at the problem of low retrieval efficiency and accuracy in text retrieval,a retrieval model based on pre-trained language model and deep hash method is proposed.Firstly,the prior knowledge of text contained in the pre-trained language model is introduced by transfer learning,and then the input is transformed into high-dimensional vector representation by feature extraction.A hash learning layer is added to the back end of the whole model to fine tune the parameters of the model by designing specific optimization objectives,so as to dynamically learn the hash function and the unique hash representation of each input in the training.Experimental results show that the retrieval accuracy of this method is at least 21.70% and 21.38% higher than that of other benchmark models in top-5 and top-10,respectively.The introduction of hash code makes the model improve the retrieval speed by 40 times under the premise of only losing 4.78% accuracy.Therefore,this method can significantly improve the retrieval accuracy and efficiency,and has a potential application prospect in the field of text retrieval.

Key words: Deep hash, Deep learning, Pre-trained language model, Similarity retrieval

CLC Number:

TP391.1

ZOU Ao, HAO Wen-ning, JIN Da-wei, CHEN Gang, TIAN Yuan. Study on Text Retrieval Based on Pre-training and Deep Hash[J].Computer Science, 2021, 48(11): 300-306.

References

[1]MITRA B,CRASWELL N.An introduction to neural information retrieval[M].Now Foundations and Trends,2018:1-126.
[2]ROBERTSON S,ZARAGOZA H.The probabilistic relevanceframework:BM25 and beyond[M].Now Publishers Inc,2009:1-32.
[3]LI H,XU J.Semantic matching in search[J].Foundations and Trends in Information Retrieval,2014,7(5):343-469.
[4]XIONG C Y,DAI Z N,JAMIE C,et al.End-to-End Neural Ad-hoc Ranking with Kernel Pooling[J].ACM Sigir Forum,2017,51(cd):55-64.
[5]HUI K,YATES A,BERBERICH K,et al.Co-PACRR:A Context-Aware Neural IR Model for Ad-hoc Retrieval[C]//Ele-venth ACM International Conference.ACM,2017.
[6]MITRA B,DIAZ F,CRASWELL N.Learning to match using local and distributed representations of text for web search[C]//Proceedings of the 26th International Conference on World Wide Web.2017:1291-1299.
[7]LI Z J,FAN Y,WU X J.Survey of Natural Language ProcessingPre-training Techniques[J].Computer Science,2020,47(3):162-173.
[8]MIKOLOV T.Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[9]PENNINGTON J,SOCHER R,MANNING C.GloVe:GlobalVectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014.
[10]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of tricks for efficient text classification[J].arXiv:1607.01759,2016.
[11]RAJPURKAR P,ZHANG J,LOPYREV K,et al.Squad:100000+ questions for machine comprehension of text[J].ar-Xiv:1606.05250,2016.
[12]LAI G,XIE Q,LIU H,et al.Race:Large-scale reading comprehension dataset from examinations[J].arXiv:1704.04683,2017.
[13]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll you Need[C]//Neural Information Processing Systems.2017:5998-6008.
[14]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[15]YANG Z,DAI Z,YANG Y,et al.Xlnet:Generalized autoregressive pretraining for language understanding[J].arXiv:1906.08237,2019.
[16]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907.11692,2019.
[17]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[18]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Computer Vision and Pattern Recognition.2016:770-778.
[19]SLANEY M,CASEY M A.Locality-Sensitive Hashing for Fin-ding Nearest Neighbors [Lecture Notes][J].IEEE Signal Processing Magazine,2008,25(2):128-131.
[20]DATAR M,IMMORLICA N,INDYK P,et al.Locality-sensitive hashing scheme based on p-stable distributions[C]//Symposium on Computational Geometry.2004:253-262.
[21]CAI H,LI Z J,SUN J,et al.Fast Chinese Text Search Based on LSH[J].Computer Science,2009,36(8):201-204.
[22]WEISS Y,TORRALBA A,FERGUS R.Spectral hashing[C]//Conference on Neural Information Processing Systems.2008:1-4.
[23]LIN K,YANG H,HSIAO J,et al.Deep learning of binary hash codes for fast image retrieval[C]//Computer Vision and Pattern Recognition.2015:27-35.
[24]YAO T,LONG F,MEI T,et al.Deep semantic-preserving and rank-ing-based hashing for image retrieval[C]//International Joint Conference on Artificial Intelligence.2016:3931-3937.
[25]LU J,LIONG V E,ZHOU J,et al.Deep Hashing for Scalable Image Search[J].IEEE Transactions on Image Processing,2017,26(5):2352-2367.
[26]ZHANG S,LI J,ZHANG B,et al.Semantic Cluster Unary Loss for Efficient Deep Hashing[J].IEEE Transactions on Image Processing,2019,28(6):2908-2920.
[27]ZENG Y,CHEN Y L,CAI X D.Deep Face Recognition Algorithm Based on Weighted Hashing[J].Computer Science,2019,46(6):277-281.
[28]GUO J,FAN Y,PANG L,et al.A deep look into neural ranking models for information retrieval[J].Information Processing & Management,2019,6:102067-102086.
[29]NOGUEIRA R,CHO K.Passage Re-ranking with BERT[J].arXiv:1901.04085.
[30]YAN M,LI C,WU C,et al.IDST at TREC 2019 Deep Learning Track:Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling[C]//TREC 2019.2019.
[31]HOFSTATTER S,ZLABINGER M,HANBURY A.Interpretable & time-budget-constrained contextualization for re-ranking[J].arXiv:2002.01854.
[32]HOFSTATTER S,ZAMANI H,MITRA B,et al.Local self-attention over long text for efficient document retrieval[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:2021-2024.
[33]MACAVANEY S,YATES A,COHAN A,et al.CEDR:Contextualized embeddings for document ranking[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.2019:1101-1104.
[34]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[35]WANG A,SINGH A,MICHAEL J,et al.GLUE:A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding[J].arXiv:1804.07461,2018.
[36]DOLAN W B,BROCKETT C.Automatically constructing a corpus of sentential paraphrases[C]//Proceedings of the Third International Workshop on Paraphrasing.2005.
[37]CER D,DIAB M,AGIRRE E,et al.Semeval-2017 task 1:Semantic textual similarity-multilingual and cross-lingual focused evaluation[J].arXiv:1708.00055,2017.
[38]WOLF T,CHAUMOND J,DEBUT L,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45.
[39]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[40]LEVESQUE H,DAVIS E,MORGENSTERN L.The winograd schema challenge[C]//Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.2012.

Related Articles 15

[1]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2]	RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3]	TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
[4]	WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[5]	HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[6]	JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7]	ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[8]	SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[9]	HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[10]	CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[11]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[12]	ZHOU Hui, SHI Hao-chen, TU Yao-feng, HUANG Sheng-jun. Robust Deep Neural Network Learning Based on Active Sampling [J]. Computer Science, 2022, 49(7): 164-169.
[13]	SU Dan-ning, CAO Gui-tao, WANG Yan-nan, WANG Hong, REN He. Survey of Deep Learning for Radar Emitter Identification Based on Small Sample [J]. Computer Science, 2022, 49(7): 226-235.
[14]	ZHU Wen-tao, LAN Xian-chao, LUO Huan-lin, YUE Bing, WANG Yang. Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN [J]. Computer Science, 2022, 49(6A): 378-383.
[15]	WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Study on Text Retrieval Based on Pre-training and Deep Hash

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0