计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 51-56.doi: 10.11896/j.issn.1002-137X.2018.06.009
• 第十四届全国Web信息系统及其应用学术会议 • 上一篇 下一篇
王占兵, 宋伟, 彭智勇, 杨先娣, 崔一辉, 申远
WANG Zhan-bing, SONG Wei, PENG Zhi-yong, YANG Xian-di, CUI Yi-hui, SHEN Yuan
摘要: 精准医疗是一种强烈依赖病人基因组分析结果的医疗模式,而子串检索是执行基因组分析的重要方法。近年来,基因数据的数据量急剧增长,其存储代价和处理复杂度已远超医疗方可承受的范围。于是,利用云服务提供商廉价的存储设备和强大的计算能力,将基因数据托管至云服务提供商成为切实可行的解决方案。考虑到云服务提供商并不完全可信,在数据上传至云端之前执行数据加密是保证数据安全性和隐私性的有效方法。然而,如何基于加密数据执行序列检索成为亟待解决的问题。针对这一问题,对基因数据处理和密文检索领域进行调研,提出采用q-gram技术对序列数据的定长窗口创建前缀签名的方案,并在执行查询时在每个窗口中完成前缀查询的解决方案。在子序列查询过程中,云端并不能获取用户数据明文。最后通过实验验证了所提方案具有较好的性能和存储开销,例如当窗口大小为100且q取6时,对100000长序列串执行构建索引耗时15.06 s。与GPSE相比,所提方法的性能更优。
中图分类号:
[1]WHEELER D A,SRINIVASAN M,EGHOLM M,et al.The complete genome of an individual by massively parallel DNA sequencing[J].Nature,2008,452(7189):872-876. [2]WANG J Y,WANG B,YANG X C.Efficient compressed genomic data oriented query approach[J].Journal of Software,2016,27(7):1715-1728.(in Chinese) 王佳英,王斌,杨晓春.面向压缩生物基因数据的高效的查询方法[J].软件学报,2016,27(7):1715-1728. [3]SCHNEEBERGER K,HAGMANN J,OSSOWSKI S,et al. Simultaneous alignment of short reads against multiple genomes[J].Genome biology,2009,10(9):98. [4]FERRADA H,GAGIE T,HIRVOLA T,et al.Hybrid indexes for repetitive datasets[J].Philosophical Transactions of the Royal Society a Mathematical Physical and Engineering Scien-ces,2013,372(372):20130137. [5]KOBORI Y,MIZUTA S.Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images[J].Genomics,Proteomics & Bioinformatics,2016,14(2):103-112. [6]CLAUDE F,FARINA A,MARTÍNEZ-PRIETO M A,et al. Compressed q-gram indexing for highly repetitive biological sequences[C]//IEEE International Conference on Bioinformatics and Bioengineering.2010:86-91. [7]YANG X,WANG B,LI C,et al.Efficient direct search on compressed genomic data[C]//IEEE International Confernece on Data Engineering.2013:961-972. [8]AYDAY E,HUBAUX J P.Privacy and Security in the Genomic Era[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.ACM,2016:1863-1865. [9]AZIZ A,MOMIN M,HASAN M Z,et al.Secure and efficient multiparty computation on genomic data[C]//Proceedings of the 20th International Database Engineering & Applications Symposium.ACM,2016:278-283. [10]BANERJEE S S,ATHREYA A P,MAINZER L S,et al.Efficient and Scalable Workflows for Genomic Analyses[C]//Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing.ACM,2016:27-36. [11]CERI S,KAITOUA A,MASSEROLI M,et al.Data Management for Next Generation Genomic Computing[C]//EDBT.2016:485-490. [12]FRIZZO-BARKER J,CHOW-WHITE P A,CHARTERS A,et al.Genomic Big Data and Privacy:Challenges and Opportunities for Precision Medicine[J].Computer Supported Cooperative Work (CSCW),2016,25(2-3):115-136. [13]QIN Y,YALAMANCHILI H K,QIN J,et al.The current status and challenges in computational analysis of genomic big data[J].Big Data Research,2015,2(1):12-18. [14]KANG S,AUNG K M M,VEERAVALLI B.Towards Secure and Fast Mapping of Genomic Sequences on Public Clouds[C]//Proceedings of the 4th ACM International Workshop on Security in Cloud Computing.ACM,2016:59-66. [15]KOBORI Y,MIZUTA S.Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images[J].Genomics,Proteomics & Bioinformatics,2016,14(2):103-112. [16]CLAUDE F,FARINA A,MARTÍNEZ-PRIETO M A,et al. Compressed q-gram indexing for highly repetitive biological sequences[C]//IEEE International Conference on Bioinformatics and Bioengineering.2010:86-91. [17]SONG W,WANG B,WANG Q,et al.A privacy-preserved full-text retrieval algorithm over encrypted data for cloud storage applications[J].Journal of Parallel and Distributed Computing,2017,99:14-27. [18]WANG B,SONG W,LOU W,et al.Inverted index based multi-keyword public-key searchable encryption with strong privacy guarantee[C]//IEEE Conference on Computer Communiations.2015:2092-2100. [19]WANG D,JIA X,WANG C,et al.Generalized pattern matching string search on encrypted data in cloud systems[C]//IEEE Conference on Computer Communiations.2015:2101-2109. |
[1] | 崔光范, 许利杰, 刘杰, 叶丹, 钟华. 基于Spark SQL的分布式全文检索框架的设计与实现 Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL 计算机科学, 2018, 45(9): 104-112. https://doi.org/10.11896/j.issn.1002-137X.2018.09.016 |
[2] | 秦杰,宋金玉,张广星. 基于Lucene的本地搜索引擎研究与实现 Research and Application of Local Search Engine Based on Lucene 计算机科学, 2014, 41(Z11): 368-370. |
[3] | 霍林,黄保华,鲍洋,胡和平. 用于对等全文检索的安全覆盖网 Secure Overlay Network for Peer-to-Peer Full Text Search 计算机科学, 2011, 38(1): 104-106. |
[4] | 申展 江宝林 陈祎 唐磊 胡运发. 全文检索模型综述 计算机科学, 2004, 31(5): 61-64. |
[5] | 饶祎 郭辉 蔡庆生. 一种基于全文检索系统的文档关联研究与实现 计算机科学, 2003, 30(12): 78-79. |
[6] | 邹涛 王继成. 文本信息检索技术 计算机科学, 1999, 26(9): 72-75. |
|