计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 40-48.doi: 10.11896/jsjkx.231000143

• 数据库&大数据&数据科学 • 上一篇    下一篇

结构化数据库查询语言智能合成技术研究进展

刘雨蒙1,2, 赵怡婧1,2, 王碧聪1, 王潮1, 张宝民1   

  1. 1 中国科学院软件研究所 北京 100190
    2 中国科学院大学 北京 100049
  • 收稿日期:2023-10-19 修回日期:2024-03-29 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 赵怡婧(yijing@iscas.ac.cn)
  • 作者简介:(yumeng@iscas.ac.cn)

Advances in SQL Intelligent Synthesis Technology

LIU Yumeng1,2, ZHAO Yijing1,2, WANG Bicong1, WANG Chao1, ZHANG Baomin1   

  1. 1 Institute of Software,Chinese Academy of Sciences,Beijing 100190,China
    2 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2023-10-19 Revised:2024-03-29 Online:2024-07-15 Published:2024-07-10
  • About author:LIU Yumeng,born in 1989,Ph.D.His main research interests include database technology,time series data analysis and data mining.
    ZHAO Yijing,born in 1994,Ph.D candidate.Her main research interests include database technology and data mining.

摘要: 近年来,随着大数据、云计算等技术的飞速发展,大规模数据的产生使得各类应用对于数据库技术的依赖日益加深。然而,传统的数据库一般采用形式化的数据库查询语言SQL进行操作,对无编程经验或数据库使用经验的用户来说,复杂SQL语法难度较高,降低了各个领域数据库应用者的便捷程度。近年来,机器学习、深度神经网络等人工智能技术的飞速发展,尤其是ChatGPT横空出世引发的大语言模型技术热潮,驱动了数据库与人工智能的深度结合与技术变革。通过智能方法将用户输入语言自动化合成SQL语言,以满足不同程度数据库使用者的操作需求,提升数据库的智能性、环境适应性及用户友好性。为全面聚焦数据库查询语言智能合成技术的最新研究进展,从范例输入、文本输入及语音输入这3类用户输入切入,详细阐述各类智能合成模型的研究脉络、代表性工作及最新进展,同时对各类方法的技术框架进行归纳与对比,最后对全文进行全面性的总结,并针对现有方法存在的问题和挑战展望未来发展方向。

关键词: 数据库技术, SQL智能合成, 语义解析, SQL语法, 大语言模型

Abstract: In recent years,with the rapid development of technologies such as big data and cloud computing,large-scale data ge-neration has deepened the dependence of various applications on database technology.However,traditional databases typically operate through the formalized database query language SQL,which poses a significant difficulty for users without programming or database usage experience,reducing the accessibility of databases across various fields.With the rapid advancement of artificial intelligence technologies like machine learning and deep neural networks,especially the surge of large language model technology sparked by the emergence of ChatGPT,there has been a profound synthesis and technological transformation of databases and intelligent technology.Intelligent methods are employed to automatically translate user input language into SQL,meeting the operational needs of database users of varying levels of expertise and enhancing databases' intelligence,environmental adaptability,and user-friendliness.To comprehensively focus on the latest research developments in intelligent SQL generation technology,this paper delves into three types of user inputs-example-based,text-based,and voice-based-and provides a detailed exposition of the research trajectory,representative works,and the latest advancements of various intelligent synthesis models.Additionally,this paper categorizes and compares the technical frameworks of these methods and provides an overall summary.Finally,it paper looks forward to future development directions in light of existing problems and challenges with current methods.

Key words: Database technology, Intelligent SQL synthesis, Semantic parsing, SQL syntax, Large language models

中图分类号: 

  • TP315
[1]WOODS W A.Progress in natural language understanding:an application to lunar geology[C]//National Computer Conference and Exposition.Association for Computing Machinery,1973:441-450.
[2]CODD E F.Seven Steps to Rendezvous with the Casual User[C]//IFIP TC-2 Working Conference Data Base Management Systems.1974.
[3]SACERDOTI E D.Language Access to Distributed Data withError Recovery[C]//International Joint Conference on Artificial Intelligence.1977:196-202.
[4]WARREN D H D,PEREIRA F C.An Efficient Easily Adaptable System for Interpreting Natural Language Queries[J].American Journal of Computational Linguistics,1982,8:110-122.
[5]ZHANG S,SUN Y.Automatically synthesizing SQL queriesfrom input-output examples[C]//2013 IEEE/ACM 28th International Conference on Automated Software Engineering(ASE).IEEE,2013:224-234.
[6]LI H,CHAN C Y,MAIER D.Query from examples:an iterative,data-driven approach to query construction[J].Proceedings of the VLDB Endowment,2015,8(13):2158-2169.
[7]WANG C,CHEUNG A,BODIK R.Synthesizing highly expressive SQL queries from input-output examples[C]//ACM SIGPLAN Conference on Programming Language Design and Implementation.ACM,2017:452-466.
[8]THAKKAR A,NAIK A,SANDS N,et al.Example-GuidedSynthesis of Relational Queries[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.Association for Computing Machinery,2021:1110-1125.
[9]LAW M,RUSSO A,BRODA K.Inductive Learning of Answer Set Programs[C]//European Workshop on Logics in Artificial Intelligence.2014:311-325.
[10]RAGHOTHAMAN M,MENDELSON J,ZHAO D,et al.Prov-enance-Guided Synthesis of Datalog Programs[J].Proceedings of the ACM on Programming Languages,2019,4(POPL):1-27.
[11]ZHOU X,BODIK R,CHEUNG A,et al.Synthesizing analytical SQL queries from computation demonstration[C]//Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation.ACM,2022:168-182.
[12]ZHONG V,XIONG C,SOCHER R.Seq2SQL:GeneratingStructured Queries from Natural Language using Reinforcement Learning[J].arXiv:1709.00103,2017.
[13]YU T,ZHANG R,YANG K,et al.Spider:A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.Association for Computational Linguistics,2018:3911-3921.
[14]XU X,LIU C,SONG D.SQLNet:Generating Structured Queries From Natural Language Without Reinforcement Learning[J].arXiv:1711.04436,2017.
[15]YU T,YASUNAGA M,YANG K,et al.SyntaxSQLNet:Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2018:1653-1663.
[16]LIN K,BOGIN B,NEUMANN M,et al.Grammar-based Neural Text-to-SQL Generation[J].arXiv:1905.13326,2019.
[17]GUO J,ZHAN Z,GAO Y,et al.Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2019:4524-4535.
[18]WANG B,SHIN R,LIU X,et al.RAT-SQL:Relation-AwareSchema Encoding and Linking for Text-to-SQL Parsers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2020:7567-7578.
[19]KENTON J D M W C,TOUTANOVA L K.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.Association for Computational Linguistics,2019:4171-4186.
[20]RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].The Journal of Machine Learning Research,2020,21(1):140:5485-140:5551.
[21]BROWN T,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[C]//Advances in Neural Information Processing System.Curran Associates,Inc.,2020:1877-1901.
[22]CHEN M,TWOREK J,JUN H,et al.Evaluating Large Language Models Trained on Code[J].arXiv:2107.03374,2021.
[23]CHEUNG A,KAMIL S,SOLAR-LEZAMA A.Bridging theGap Between General-Purpose and Domain-Specific Compilers with Synthesis[C]//1st Summit on Advances in Programing Languages.2015:51-62.
[24]SCHOLAK T,SCHUCHER N,BAHDANAU D.PICARD:Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2021:9895-9901.
[25]SHAW P,CHANG M W,PASUPAT P,et al.CompositionalGeneralization and Natural Language Variation:Can a Semantic Parsing Approach Handle Both?[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,Association for Computational Linguistics,2021:922-938.
[26]HE P,MAO Y,CHAKRABARTI K,et al.X-SQL:reinforceschema representation with context[J].arXiv:1908.08113,2019.
[27]RAJKUMAR N,LI R,BAHDANAU D.Evaluating the Text-to-SQL Capabilities of Large Language Models[J].arXiv:2204.00498,2022.
[28]POURREZA M,RAFIEI D.DIN-SQL:Decomposed In-Context Learning of Text-to-SQL with Self-Correction[J].arXiv:2304.11015,2023.
[29]UTAMA P,WEIR N,BINNIG C,et al.Voice-based data exploration:Chatting with your database[C]//Proceedings of the Workshop on Search-Oriented Conversational AI.2017.
[30]SHAH V,LI S,KUMAR A,et al.SpeakQL:Towards Speech-driven Multimodal Querying of Structured Data[C]//Procee-dings of the 2020 ACM SIGMOD International Conference on Management of Data.ACM,2020:2363-2374.
[31]SONG Y,WONG R C W,ZHAO X,et al.Speech-to-SQL:Towards Speech-driven SQL Query Generation From Natural Language Question[J].arXiv:2201.01209,2022.
[32]YU T,ZHANG R,YASUNAGA M,et al.SParC:Cross-Domain Semantic Parsing in Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2019:4511-4523.
[33]BERANT J,CHOU A,FROSTIG R,et al.Semantic Parsing on Freebase from Question-Answer Pairs[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2013:1533-1544.
[34]MIN Q,SHI Y,ZHANG Y.A Pilot Study for Chinese SQL Semantic Parsing[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.Association for Computational Linguistics,2019:3652-3658.
[35]SUN N,YANG X,LIU Y.TableQA:a Large-Scale ChineseText-to-SQL Dataset for Table-Aware SQL Generation[J].arXiv:2006.06434,2020.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!