计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 104-112.doi: 10.11896/j.issn.1002-137X.2018.09.016
崔光范1,2, 许利杰2, 刘杰2, 叶丹2, 钟华2
CUI Guang-fan1,2, XU Li-jie2, LIU Jie2, YE Dan2, ZHONG Hua2
摘要: 随着信息化的深入,大数据在各个领域产生了巨大的价值,海量数据的存储和快速分析成为新的挑战。传统的关系型数据库由于性能、扩展性的不足以及价格昂贵等方面的缺点,难以满足大数据的存储和分析需求。Spark SQL是基于大数据处理框架Spark的数据分析工具,目前已支持TPC-DS基准,成为大数据背景下传统数据仓库的替代解决方案。全文检索作为一种文本搜索的有效方式,能够与一般的查询操作结合使用,提供更加丰富的查询和分析操作。目前,Spark SQL仅支持简单的查询操作,不支持全文检索。为了满足传统业务迁移和现有业务的使用需求,提出了分布式全文检索框架,涵盖了SQL文法、SQL翻译转换框架、全文检索并行化、检索优化4个模块,并在Spark SQL上进行了实现。实验结果表明相比于传统的数据库,在两种检索优化策略下,该框架的索引构建时间、查询时间分别减少到传统数据库的0.6%/0.5%和1%/10%,索引存储量减少为传统数据库的55.0%。
中图分类号:
[1]SUN D W,ZHANG G Y,ZHENG W M.Big Data Flow computation:Key technology and system examples[J].Journal of Software,2014,25(4):839-862.(in Chinese) 孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862. [2]MENG X F,CI X.Big Data Management:Concept,technology and challenge[J].Computer Research and Development,2013,51(1):146-169(in Chinese). 孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,51(1):146-169. [3]CHENG X Q,JIN X L,WANG Y Z,et al.A summary of large data systems and analysis techniques[J].Journal of Software,2014(9):1889-1908.(in Chinese) 程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014(9):1889-1908. [4]DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [5]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10. [6]THUSOO A,SARMA J S,JAIN N,et al.Hive:a warehousing solution over a map-reduce framework[C]∥Proceedings of the VLDB Endowment.2009:1626-1629. [7]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2. [8]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015:1383-1394. [9]OLSTON C,REED B,SRIVASTAVA U,et al.Pig latin:a not-so-foreign language for data processing[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.ACM,2008:1099-1110. [10]KORNACKER M,BEHM A,BITTORF V,et al.Impala:A Modern,Open-Source SQL Engine for Hadoop[C]∥Procee-dings of the 7th Biennial Conference on Innovative Data Systems Research.2015. [11]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:interactive analysis of web-scale datasets[J].Proceedings of the VLDB Endowment,2010,3(1/2):330-339. [12]AGARWAL S,MOZAFARI B,PANDA A,et al.BlinkDB:queries with bounded errors and bounded response times on very large data[C]∥Proceedings of the 8th ACM European Confe-rence on Computer Systems.ACM,2013:29-42. [13]PARR T J,QUONG R W.ANTLR:A Predicated[J].Soft-ware—Practice and Experience,1995,25(7):789-810. [14]ZOUZIAS A.Spark-lucenerdd (Version0.3.0)[EB/OL]. http://github.com/zouzias/spark-lucenerdd. |
[1] | 王占兵, 宋伟, 彭智勇, 杨先娣, 崔一辉, 申远. 一种面向密文基因数据的子序列外包查询方法 Subsequence Outsourcing Query Method over Encrypted Genomic Data 计算机科学, 2018, 45(6): 51-56. https://doi.org/10.11896/j.issn.1002-137X.2018.06.009 |
[2] | 秦杰,宋金玉,张广星. 基于Lucene的本地搜索引擎研究与实现 Research and Application of Local Search Engine Based on Lucene 计算机科学, 2014, 41(Z11): 368-370. |
[3] | 霍林,黄保华,鲍洋,胡和平. 用于对等全文检索的安全覆盖网 Secure Overlay Network for Peer-to-Peer Full Text Search 计算机科学, 2011, 38(1): 104-106. |
[4] | 申展 江宝林 陈祎 唐磊 胡运发. 全文检索模型综述 计算机科学, 2004, 31(5): 61-64. |
[5] | 饶祎 郭辉 蔡庆生. 一种基于全文检索系统的文档关联研究与实现 计算机科学, 2003, 30(12): 78-79. |
[6] | 邹涛 王继成. 文本信息检索技术 计算机科学, 1999, 26(9): 72-75. |
|