Computer Science ›› 2018, Vol. 45 ›› Issue (9): 104-112.doi: 10.11896/j.issn.1002-137X.2018.09.016

• NASAC 2017 • Previous Articles     Next Articles

Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL

CUI Guang-fan1,2, XU Li-jie2, LIU Jie2, YE Dan2, ZHONG Hua2   

  1. University of Chinese Academy of Sciences,Beijing 100049,China1
    Institute of Software,Chinese Academy of Sciences,Beijing 100049,China2
  • Received:2017-10-11 Online:2018-09-20 Published:2018-10-10

Abstract: With the development of information technology,big data has generated great value in various fields.Huge data storage and rapid analysis have become new challenges.The traditional relational database is difficult to meet the needs of big data storage and analysis because of its shortcomings in terms of performance,scalability and high cost.Spark SQL is a data analysis tool based on Spark,which is a big data processing framework.Spark SQL currently supports the TPC-DS benchmark and has become an alternative solution to the traditional data warehouse under the background of big data.Full-text search,as a kind of effective method of text search,can be used in combination with general query operation to provide richer queries and analysis operations.Spark SQL doesn’t support full-text search now.In order to meet the needs of traditional business migration and existing business,this paper proposed a Spark SQL distributed text retrieval framework,covering the design and implementation of 4 modules including SQL grammar,SQL translation framework,full-text search parallelism and search optimization.The results of experiment show that,under the two search optimization strategies,index construction time and query time of this framework are reduced to 0.6%/0.5% and 1%/10% respectively compared with the traditional database,and index storage volume is reduced to 55.0%.

Key words: Full-text search, Search optimization, Search parallelism, Spark SQL, Translation framework

CLC Number: 

  • TP391.3
[1]SUN D W,ZHANG G Y,ZHENG W M.Big Data Flow computation:Key technology and system examples[J].Journal of Software,2014,25(4):839-862.(in Chinese)
[2]MENG X F,CI X.Big Data Management:Concept,technology and challenge[J].Computer Research and Development,2013,51(1):146-169(in Chinese).
[3]CHENG X Q,JIN X L,WANG Y Z,et al.A summary of large data systems and analysis techniques[J].Journal of Software,2014(9):1889-1908.(in Chinese)
[4]DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[5]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]∥2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10.
[6]THUSOO A,SARMA J S,JAIN N,et al.Hive:a warehousing solution over a map-reduce framework[C]∥Proceedings of the VLDB Endowment.2009:1626-1629.
[7]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2.
[8]ARMBRUST M,XIN R S,LIAN C,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015:1383-1394.
[9]OLSTON C,REED B,SRIVASTAVA U,et al.Pig latin:a not-so-foreign language for data processing[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.ACM,2008:1099-1110.
Modern,Open-Source SQL Engine for Hadoop[C]∥Procee-dings of the 7th Biennial Conference on Innovative Data Systems Research.2015.
[11]MELNIK S,GUBAREV A,LONG J J,et al.Dremel:interactive analysis of web-scale datasets[J].Proceedings of the VLDB Endowment,2010,3(1/2):330-339.
[12]AGARWAL S,MOZAFARI B,PANDA A,et al.BlinkDB:queries with bounded errors and bounded response times on very large data[C]∥Proceedings of the 8th ACM European Confe-rence on Computer Systems.ACM,2013:29-42.
[13]PARR T J,QUONG R W.ANTLR:A Predicated[J].Soft-ware—Practice and Experience,1995,25(7):789-810.
[14]ZOUZIAS A.Spark-lucenerdd (Version0.3.0)[EB/OL].
[1] ZOU Hua-fu,XIE Cheng-wang,ZHOU Yang-ping,WANG Li-ping. Group Search Optimization with Opposition-based Learning and Differential Evolution [J]. Computer Science, 2018, 45(6A): 124-129.
[2] LI Zhi-gang and TANG Xue-ming. User Incentive Mechanism Based on Crowd Search Optimization and Cooperative Competition for Mobile Crowd Sensing Networks [J]. Computer Science, 2016, 43(11): 184-189.
[3] QIN Jie,SONG Jin-yu and ZHANG Guang-xing. Research and Application of Local Search Engine Based on Lucene [J]. Computer Science, 2014, 41(Z11): 368-370.
[4] . Clouds Search Optimization Algorithm with Difference Quotient Information and its Convergence Analysis [J]. Computer Science, 2012, 39(1): 252-255.
Full text



No Suggested Reading articles found!