Computer Science ›› 2018, Vol. 45 ›› Issue (9): 104-112.doi: 10.11896/j.issn.1002-137X.2018.09.016

• NASAC 2017 • Previous Articles     Next Articles

Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL

CUI Guang-fan1,2, XU Li-jie2, LIU Jie2, YE Dan2, ZHONG Hua2   

  1. University of Chinese Academy of Sciences,Beijing 100049,China1
    Institute of Software,Chinese Academy of Sciences,Beijing 100049,China2
  • Received:2017-10-11 Online:2018-09-20 Published:2018-10-10

Abstract: With the development of information technology,big data has generated great value in various fields.Huge data storage and rapid analysis have become new challenges.The traditional relational database is difficult to meet the needs of big data storage and analysis because of its shortcomings in terms of performance,scalability and high cost.Spark SQL is a data analysis tool based on Spark,which is a big data processing framework.Spark SQL currently supports the TPC-DS benchmark and has become an alternative solution to the traditional data warehouse under the background of big data.Full-text search,as a kind of effective method of text search,can be used in combination with general query operation to provide richer queries and analysis operations.Spark SQL doesn’t support full-text search now.In order to meet the needs of traditional business migration and existing business,this paper proposed a Spark SQL distributed text retrieval framework,covering the design and implementation of 4 modules including SQL grammar,SQL translation framework,full-text search parallelism and search optimization.The results of experiment show that,under the two search optimization strategies,index construction time and query time of this framework are reduced to 0.6%/0.5% and 1%/10% respectively compared with the traditional database,and index storage volume is reduced to 55.0%.

Key words: Full-text search, Search optimization, Search parallelism, Spark SQL, Translation framework

CLC Number: 

  • TP391.3
[1]SUN D W,ZHANG G Y,ZHENG W M.Big Data Flow computation:Key technology and system examples[J].Journal of Software,2014,25(4):839-862.(in Chinese)
孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862.
[2]MENG X F,CI X.Big Data Management:Concept,technology and challenge[J].Computer Research and Development,2013,51(1):146-169(in Chinese).
孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,51(1):146-169.
[3]CHENG X Q,JIN X L,WANG Y Z,et al.A summary of large data systems and analysis techniques[J].Journal of Software,2014(9):1889-1908.(in Chinese)
程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014(9):1889-1908.
[4]DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[5]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10.
[6]THUSOO A,SARMA J S,JAIN N,et al.Hive:a warehousing solution over a map-reduce framework[C]∥Proceedings of the VLDB Endowment.2009:1626-1629.
[7]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2.
[8]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015:1383-1394.
[9]OLSTON C,REED B,SRIVASTAVA U,et al.Pig latin:a not-so-foreign language for data processing[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.ACM,2008:1099-1110.
[10]KORNACKER M,BEHM A,BITTORF V,et al.Impala:A
Modern,Open-Source SQL Engine for Hadoop[C]∥Procee-dings of the 7th Biennial Conference on Innovative Data Systems Research.2015.
[11]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:interactive analysis of web-scale datasets[J].Proceedings of the VLDB Endowment,2010,3(1/2):330-339.
[12]AGARWAL S,MOZAFARI B,PANDA A,et al.BlinkDB:queries with bounded errors and bounded response times on very large data[C]∥Proceedings of the 8th ACM European Confe-rence on Computer Systems.ACM,2013:29-42.
[13]PARR T J,QUONG R W.ANTLR:A Predicated[J].Soft-ware—Practice and Experience,1995,25(7):789-810.
[14]ZOUZIAS A.Spark-lucenerdd (Version0.3.0)[EB/OL].
http://github.com/zouzias/spark-lucenerdd.
[1] ZOU Hua-fu,XIE Cheng-wang,ZHOU Yang-ping,WANG Li-ping. Group Search Optimization with Opposition-based Learning and Differential Evolution [J]. Computer Science, 2018, 45(6A): 124-129.
[2] LI Zhi-gang and TANG Xue-ming. User Incentive Mechanism Based on Crowd Search Optimization and Cooperative Competition for Mobile Crowd Sensing Networks [J]. Computer Science, 2016, 43(11): 184-189.
[3] QIN Jie,SONG Jin-yu and ZHANG Guang-xing. Research and Application of Local Search Engine Based on Lucene [J]. Computer Science, 2014, 41(Z11): 368-370.
[4] . Clouds Search Optimization Algorithm with Difference Quotient Information and its Convergence Analysis [J]. Computer Science, 2012, 39(1): 252-255.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!