计算机科学 ›› 2020, Vol. 47 ›› Issue (9): 60-66.doi: 10.11896/jsjkx.190800138

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于字段嵌入的数据库自然语言查询接口

田野1, 寿黎但1,2, 陈珂1,2, 骆歆远1,2, 陈刚1,2   

  1. 1 浙江大学计算机科学与技术学院 杭州310027
    2 浙江省大数据智能计算重点实验室 杭州310027
  • 收稿日期:2019-08-28 发布日期:2020-09-10
  • 通讯作者: 陈珂(chenk@zju.edu.cn)
  • 作者简介:tianye_zju@zju.edu.cn
  • 基金资助:
    国家重点研发计划(2017YFB1201001);国家自然科学基金(61672455);浙江省自然科学基金(LY18F020005)

Natural Language Interface for Databases with Content-based Table Column Embeddings

TIAN Ye1, SHOU Li-dan1,2, CHEN Ke1,2, LUO Xin-yuan1,2, CHEN Gang1,2   

  1. 1 College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China
    2 Key Laboratory of Big Data Intelligent Computing of Zhejiang Province,Hangzhou 310027,China
  • Received:2019-08-28 Published:2020-09-10
  • About author:TIAN Ye,born in 1996,postgraduate.His main research interests include knowledge graph and natural language processing.
    CHEN Ke,born in 1977,Ph.D,associate professor.Her main research interests include spatial temporal data management,Web data mining and data privacy protection,etc.
  • Supported by:
    National Key R&D Program of China (2017YFB1201001),National Natural Science Foundation of China (61672455) and Natural Science Foundation of Zhejiang Province,China (LY18F020005).

摘要: 将自然语言转化成数据库可以执行的查询语句,是目前智能交互和人机对话系统的核心难题,也是新型供电列车大数据运用支撑平台对接应用平台及建立城轨列车个性化运维系统的难点。现有的基于神经网络的方法没有充分利用数据表的丰富信息,影响了查询的准确率。针对数据表内容作为输入的情况下,如何提升自然语言查询接口的查询准确率的问题,文中创新地提出了基于数据表内容的字段嵌入方法,利用数据表中每个字段存储的内容对字段进行嵌入表示,并据此提出了新的模型嵌入层结构;此外,提出了一种基于数据表内容的数据增强方法,通过用数据表相同字段中的其他记录去代替查询语句中的属性值,来产生新的训练样本。最后,针对提出的字段嵌入表示和数据增强方法,在WikiSQL数据集上进行了对比实验。实验结果显示,相比当前效果最好的模型,单独使用这两种方法时能够提升0.6%~0.8%的查询准确率,共同使用时则能够提升接近1%的查询准确率,证明所提字段嵌入和数据增强方法对查询准确率有一定的提升作用。

关键词: SQL, 词嵌入, 数据库查询, 自然语言处理

Abstract: Converting natural language into query statements that can be executed in database is the core problem of intelligent interaction and human-computer dialogue system,and is also the urgent need of personalized operation and maintenance system for urban rail trains.At the same time,it is the difficulty of docking the bottom application platform with the support platform for large data application of the new power supply train.The existing neural network-based methods don’t utilizing semantic-rich table content or utilize it partially,which limits the improvement of the execution accuracy.This paper studies how to improve the query accuracy of natural language query interfaces when table content is included in the inputs.Aiming at this problem,this paper proposes a table column embedding method based on table content which embeds the table columns by utilizing the content stored in each table column.Based on the method,this paper proposes a new structure of embedding layer.This paper also proposes a method of data augmentation by utilize table content.It generates new training samples by replacing attribute values in queries with other records in the same column of the table.This paper finally conducts experiments on WikiSQL dataset for the proposed methods of column embedding and data augmentation.The experimental results show that,on the basis of the state-of-the-art methods,the two methods can improve the query accuracy by 0.6%~0.8% when they are used separately and nearly 1% when they are used together.Therefore,it proves that the methods of column embedding and data augmentation proposed in this paper can achieve good improvements on execution accuracy.

Key words: Database query, Natural language processing, SQL, Word embedding

中图分类号: 

  • TP391.1
[1] DONG L,LAPATA M.Language to Logical Form with Neural Attention[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:33-43.
[2] ZHONG V,XIONG C,SOCHER R.Seq2sql:Generating structured queries from natural language using reinforcement learning[J].arXiv:1709.00103,2017.
[3] XU X,LIU C,SONG D.Sqlnet:Generating structured queries from natural language without reinforcement learning[J].arXiv:1711.04436,2017.
[4] YU T,LI Z,ZHANG Z,et al.TypeSQL:Knowledge-BasedType-Aware Neural Text-to-SQL Generation[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:588-594.
[5] GUO J,ZHAN Z,GAO Y,et al.Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation[J].arXiv:1905.08205,2019.
[6] HWANG W,YIM J,PARK S,et al.A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization[J].arXiv:1902.01069,2019.
[7] ANDROUTSOPOULOS I,RITCHIE G D,THANISCH P.Na-tural language interfaces to databases-an introduction[J].Natural Language Engineering,1995,1(1):29-81.
[8] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[9] PETROVSKI B,AGUADO I,HOSSMANN A,et al.EmbeddingIndividual Table Columns for Resilient SQL Chatbots[J].EMN-.LP 2018,2018:67.
[10] SUN Y,TANG D,DUAN N,et al.Semantic Parsing with Syntax-and Table-Aware SQL Generation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:361-372.
[11] YAVUZ S,GUR I,SU Y,et al.What It Takes to Achieve 100% Condition Accuracy on WikiSQL[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.2018:1702-1711.
[12] VINYALS O,FORTUNATO M,JAITLY N.Pointer networks[C]//Advances in Neural Information Processing Systems.2015:2692-2700.
[13] PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[C]//Advances in Neural Information Processing Systems.2019:8026-8037.
[14] PENNINGTON J,SOCHER R,MANNING C.Glove:Globalvectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1532-1543.
[15] WIETING J,GIMPEL K.Paranmt-50m:Pushing the limits of paraphrastic sentence embeddings with millions of machine translations[J].arXiv:1711.05732,2017.
[16] KINGMA D P,BA J.Adam:A method for stochastic optimization[J].arXiv:1412.6980,2014.
[1] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[2] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[3] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[4] 曹合心, 赵亮, 李雪峰.
图神经网络在Text-to-SQL解析中的技术研究
Technical Research of Graph Neural Network for Text-to-SQL Parsing
计算机科学, 2022, 49(4): 110-115. https://doi.org/10.11896/jsjkx.210200173
[5] 李玉强, 张伟江, 黄瑜, 李琳, 刘爱华.
基于高斯分布的改进词嵌入主题情感模型
Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution
计算机科学, 2022, 49(2): 256-264. https://doi.org/10.11896/jsjkx.201200082
[6] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[7] 李昭奇, 黎塔.
基于wav2vec预训练的样例关键词识别
Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining
计算机科学, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007
[8] 陈志毅, 隋杰.
基于DeepFM和卷积神经网络的集成式多模态谣言检测方法
DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection
计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[9] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[10] 程希, 曹晓梅.
基于信息携带的SQL注入攻击检测方法
SQL Injection Attack Detection Method Based on Information Carrying
计算机科学, 2021, 48(7): 70-76. https://doi.org/10.11896/jsjkx.200600010
[11] 裴莹, 李天祥, 王鏖清, 付加胜, 韩霄松.
基于新闻的国际天然气价格趋势预测方法
Prediction Method of International Natural Gas Price Trends Based on News
计算机科学, 2021, 48(6A): 235-239. https://doi.org/10.11896/jsjkx.201000056
[12] 刘立成, 徐一凡, 谢贵才, 段磊.
面向NoSQL数据库的JSON文档异常检测与语义消歧模型
Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database
计算机科学, 2021, 48(2): 93-99. https://doi.org/10.11896/jsjkx.200900039
[13] 吴俣, 李舟军.
检索式聊天机器人技术综述
Survey on Retrieval-based Chatbots
计算机科学, 2021, 48(12): 278-285. https://doi.org/10.11896/jsjkx.210900250
[14] 鲁佳文, 严丽.
对象关系数据库到RDF(S)的映射方法
Mapping Method from Object-relational Database to RDF(S)
计算机科学, 2021, 48(10): 145-151. https://doi.org/10.11896/jsjkx.200800006
[15] 仝鑫, 王斌君, 王润正, 潘孝勤.
面向自然语言处理的深度学习对抗样本综述
Survey on Adversarial Sample of Deep Learning Towards Natural Language Processing
计算机科学, 2021, 48(1): 258-267. https://doi.org/10.11896/jsjkx.200500078
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!