计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 104-112.doi: 10.11896/j.issn.1002-137X.2018.09.016

• 第十六届全国软件与应用学术会议 • 上一篇    下一篇

基于Spark SQL的分布式全文检索框架的设计与实现

崔光范1,2, 许利杰2, 刘杰2, 叶丹2, 钟华2   

  1. 中国科学院大学 北京1000491
    中国科学院软件研究所 北京1000492
  • 收稿日期:2017-10-11 出版日期:2018-09-20 发布日期:2018-10-10
  • 通讯作者: 许利杰(1987-),男,博士,助理研究员,CCF会员,主要研究方向为大数据系统,E-mail:xulijie@iscas.ac.cn
  • 作者简介:崔光范(1991-),男,硕士生,主要研究方向为分布式计算;刘 杰(1982-),男,博士,副研究员,CCF会员,主要研究方向为大数据挖掘分析;叶 丹(1971-),女,博士,研究员,CCF高级会员,主要研究方向为大数据挖掘分析;钟 华(1971-),男,博士,研究员,CCF会员,主要研究方向为分布式计算、软件工程。
  • 基金资助:
    本文受北京市科技重大项目(D171100003417002)资助。

Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL

CUI Guang-fan1,2, XU Li-jie2, LIU Jie2, YE Dan2, ZHONG Hua2   

  1. University of Chinese Academy of Sciences,Beijing 100049,China1
    Institute of Software,Chinese Academy of Sciences,Beijing 100049,China2
  • Received:2017-10-11 Online:2018-09-20 Published:2018-10-10

摘要: 随着信息化的深入,大数据在各个领域产生了巨大的价值,海量数据的存储和快速分析成为新的挑战。传统的关系型数据库由于性能、扩展性的不足以及价格昂贵等方面的缺点,难以满足大数据的存储和分析需求。Spark SQL是基于大数据处理框架Spark的数据分析工具,目前已支持TPC-DS基准,成为大数据背景下传统数据仓库的替代解决方案。全文检索作为一种文本搜索的有效方式,能够与一般的查询操作结合使用,提供更加丰富的查询和分析操作。目前,Spark SQL仅支持简单的查询操作,不支持全文检索。为了满足传统业务迁移和现有业务的使用需求,提出了分布式全文检索框架,涵盖了SQL文法、SQL翻译转换框架、全文检索并行化、检索优化4个模块,并在Spark SQL上进行了实现。实验结果表明相比于传统的数据库,在两种检索优化策略下,该框架的索引构建时间、查询时间分别减少到传统数据库的0.6%/0.5%和1%/10%,索引存储量减少为传统数据库的55.0%。

关键词: Spark SQL, 翻译转换框架, 检索并行化, 检索优化, 全文检索

Abstract: With the development of information technology,big data has generated great value in various fields.Huge data storage and rapid analysis have become new challenges.The traditional relational database is difficult to meet the needs of big data storage and analysis because of its shortcomings in terms of performance,scalability and high cost.Spark SQL is a data analysis tool based on Spark,which is a big data processing framework.Spark SQL currently supports the TPC-DS benchmark and has become an alternative solution to the traditional data warehouse under the background of big data.Full-text search,as a kind of effective method of text search,can be used in combination with general query operation to provide richer queries and analysis operations.Spark SQL doesn’t support full-text search now.In order to meet the needs of traditional business migration and existing business,this paper proposed a Spark SQL distributed text retrieval framework,covering the design and implementation of 4 modules including SQL grammar,SQL translation framework,full-text search parallelism and search optimization.The results of experiment show that,under the two search optimization strategies,index construction time and query time of this framework are reduced to 0.6%/0.5% and 1%/10% respectively compared with the traditional database,and index storage volume is reduced to 55.0%.

Key words: Full-text search, Search optimization, Search parallelism, Spark SQL, Translation framework

中图分类号: 

  • TP391.3
[1]SUN D W,ZHANG G Y,ZHENG W M.Big Data Flow computation:Key technology and system examples[J].Journal of Software,2014,25(4):839-862.(in Chinese)
孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862.
[2]MENG X F,CI X.Big Data Management:Concept,technology and challenge[J].Computer Research and Development,2013,51(1):146-169(in Chinese).
孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,51(1):146-169.
[3]CHENG X Q,JIN X L,WANG Y Z,et al.A summary of large data systems and analysis techniques[J].Journal of Software,2014(9):1889-1908.(in Chinese)
程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014(9):1889-1908.
[4]DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[5]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10.
[6]THUSOO A,SARMA J S,JAIN N,et al.Hive:a warehousing solution over a map-reduce framework[C]∥Proceedings of the VLDB Endowment.2009:1626-1629.
[7]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2.
[8]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015:1383-1394.
[9]OLSTON C,REED B,SRIVASTAVA U,et al.Pig latin:a not-so-foreign language for data processing[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.ACM,2008:1099-1110.
[10]KORNACKER M,BEHM A,BITTORF V,et al.Impala:A
Modern,Open-Source SQL Engine for Hadoop[C]∥Procee-dings of the 7th Biennial Conference on Innovative Data Systems Research.2015.
[11]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:interactive analysis of web-scale datasets[J].Proceedings of the VLDB Endowment,2010,3(1/2):330-339.
[12]AGARWAL S,MOZAFARI B,PANDA A,et al.BlinkDB:queries with bounded errors and bounded response times on very large data[C]∥Proceedings of the 8th ACM European Confe-rence on Computer Systems.ACM,2013:29-42.
[13]PARR T J,QUONG R W.ANTLR:A Predicated[J].Soft-ware—Practice and Experience,1995,25(7):789-810.
[14]ZOUZIAS A.Spark-lucenerdd (Version0.3.0)[EB/OL].
http://github.com/zouzias/spark-lucenerdd.
[1] 王占兵, 宋伟, 彭智勇, 杨先娣, 崔一辉, 申远.
一种面向密文基因数据的子序列外包查询方法
Subsequence Outsourcing Query Method over Encrypted Genomic Data
计算机科学, 2018, 45(6): 51-56. https://doi.org/10.11896/j.issn.1002-137X.2018.06.009
[2] 秦杰,宋金玉,张广星.
基于Lucene的本地搜索引擎研究与实现
Research and Application of Local Search Engine Based on Lucene
计算机科学, 2014, 41(Z11): 368-370.
[3] 霍林,黄保华,鲍洋,胡和平.
用于对等全文检索的安全覆盖网
Secure Overlay Network for Peer-to-Peer Full Text Search
计算机科学, 2011, 38(1): 104-106.
[4] 申展 江宝林 陈祎 唐磊 胡运发.
全文检索模型综述

计算机科学, 2004, 31(5): 61-64.
[5] 饶祎 郭辉 蔡庆生.
一种基于全文检索系统的文档关联研究与实现

计算机科学, 2003, 30(12): 78-79.
[6] 邹涛 王继成.
文本信息检索技术

计算机科学, 1999, 26(9): 72-75.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!