计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 51-56.doi: 10.11896/j.issn.1002-137X.2018.06.009

• 第十四届全国Web信息系统及其应用学术会议 • 上一篇    下一篇

一种面向密文基因数据的子序列外包查询方法

王占兵, 宋伟, 彭智勇, 杨先娣, 崔一辉, 申远   

  1. 武汉大学计算机学院 武汉430070
  • 出版日期:2018-06-15 发布日期:2018-07-24
  • 作者简介:王占兵(1992-),男,硕士生,主要研究方向为云安全,E-mail:bingo711x@whu.edu.cn;宋 伟(1978-),男,副教授,主要研究方向为云安全、大数据安全、应用密码学等,E-mail:songwei@whu.edu.cn(通信作者);彭智勇(1963-),男,教授,主要研究方向为数据库等,E-mail:peng@whu.edu.cn;杨先娣(1974-),女,副教授,主要研究方向为可信数据管理等,E-mail:xiandiy@whu.edu.cn;崔一辉(1981-),男,博士生,主要研究方向为可信数据管理等,E-mail:cuiyihui@whu.edu.cn;申 远(1983-),男,博士生,主要研究方向为隐私保护等,E-mail:shenyuan@whu.edu.cn
  • 基金资助:
    本文受国家自然科学基金(61232002,61572378)资助

Subsequence Outsourcing Query Method over Encrypted Genomic Data

WANG Zhan-bing, SONG Wei, PENG Zhi-yong, YANG Xian-di, CUI Yi-hui, SHEN Yuan   

  1. School of Computer,Wuhan University,Wuhan 430070,China
  • Online:2018-06-15 Published:2018-07-24

摘要: 精准医疗是一种强烈依赖病人基因组分析结果的医疗模式,而子串检索是执行基因组分析的重要方法。近年来,基因数据的数据量急剧增长,其存储代价和处理复杂度已远超医疗方可承受的范围。于是,利用云服务提供商廉价的存储设备和强大的计算能力,将基因数据托管至云服务提供商成为切实可行的解决方案。考虑到云服务提供商并不完全可信,在数据上传至云端之前执行数据加密是保证数据安全性和隐私性的有效方法。然而,如何基于加密数据执行序列检索成为亟待解决的问题。针对这一问题,对基因数据处理和密文检索领域进行调研,提出采用q-gram技术对序列数据的定长窗口创建前缀签名的方案,并在执行查询时在每个窗口中完成前缀查询的解决方案。在子序列查询过程中,云端并不能获取用户数据明文。最后通过实验验证了所提方案具有较好的性能和存储开销,例如当窗口大小为100且q取6时,对100000长序列串执行构建索引耗时15.06 s。与GPSE相比,所提方法的性能更优。

关键词: 精准医疗, 密文查询, 全文检索, 子序列检索

Abstract: Precision medicine is a medical model that relies heavily on patient genome analysis.The subsequence search plays an important role in performing genome analysis.Recently,the amount of genomic data are increasing dramatically,and the storage cost and processing complexity of them have been far beyond the capacity of hospitals.So,utilizing the powerful cloud computing capability to analyze and process such massive genomic sequence data is becoming popular.Considering that cloud service provider is not completely trusted,encrypting genomic data before uploading is a straightforward and effective solution to guarantee the privacy and security of DNA sequence data.However,how to perform queries over the encrypted genomic sequence data becomes another difficult problem.To address this problem,this paper made a detailed survey on genomic data processing and full-text retrieval fields.It constructed indexes on fix-length windows of the genomic sequence using q-gram mapping,and performed queries in every window.If the query sequence is the prefix of any window in genomic sequence,the query hits.Throughout all the processes,cloud service provider stores indexes and performs subsequence query,without obtaining any privacy details.Moreover,this paper set up the system model and several security assumptions,and proved their security.Experiments were carried out to evaluate the performance of scheme on a public dataset.The results show that the proposed solution achieves better performance in time cost and storage cost,i.e.when w is 100 and q is 6,the building index algorithm costs 15.60s for sequence of 100000 length.Compared with GPSE,the proposed solution has higher execution efficiency in performing queries.

Key words: Ciphertext query, Full-text query, Precision medicine, Subsequence query

中图分类号: 

  • TP309.2
[1]WHEELER D A,SRINIVASAN M,EGHOLM M,et al.The complete genome of an individual by massively parallel DNA sequencing[J].Nature,2008,452(7189):872-876.
[2]WANG J Y,WANG B,YANG X C.Efficient compressed genomic data oriented query approach[J].Journal of Software,2016,27(7):1715-1728.(in Chinese)
王佳英,王斌,杨晓春.面向压缩生物基因数据的高效的查询方法[J].软件学报,2016,27(7):1715-1728.
[3]SCHNEEBERGER K,HAGMANN J,OSSOWSKI S,et al. Simultaneous alignment of short reads against multiple genomes[J].Genome biology,2009,10(9):98.
[4]FERRADA H,GAGIE T,HIRVOLA T,et al.Hybrid indexes for repetitive datasets[J].Philosophical Transactions of the Royal Society a Mathematical Physical and Engineering Scien-ces,2013,372(372):20130137.
[5]KOBORI Y,MIZUTA S.Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images[J].Genomics,Proteomics & Bioinformatics,2016,14(2):103-112.
[6]CLAUDE F,FARINA A,MARTÍNEZ-PRIETO M A,et al. Compressed q-gram indexing for highly repetitive biological sequences[C]//IEEE International Conference on Bioinformatics and Bioengineering.2010:86-91.
[7]YANG X,WANG B,LI C,et al.Efficient direct search on compressed genomic data[C]//IEEE International Confernece on Data Engineering.2013:961-972.
[8]AYDAY E,HUBAUX J P.Privacy and Security in the Genomic Era[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.ACM,2016:1863-1865.
[9]AZIZ A,MOMIN M,HASAN M Z,et al.Secure and efficient multiparty computation on genomic data[C]//Proceedings of the 20th International Database Engineering & Applications Symposium.ACM,2016:278-283.
[10]BANERJEE S S,ATHREYA A P,MAINZER L S,et al.Efficient and Scalable Workflows for Genomic Analyses[C]//Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing.ACM,2016:27-36.
[11]CERI S,KAITOUA A,MASSEROLI M,et al.Data Management for Next Generation Genomic Computing[C]//EDBT.2016:485-490.
[12]FRIZZO-BARKER J,CHOW-WHITE P A,CHARTERS A,et al.Genomic Big Data and Privacy:Challenges and Opportunities for Precision Medicine[J].Computer Supported Cooperative Work (CSCW),2016,25(2-3):115-136.
[13]QIN Y,YALAMANCHILI H K,QIN J,et al.The current status and challenges in computational analysis of genomic big data[J].Big Data Research,2015,2(1):12-18.
[14]KANG S,AUNG K M M,VEERAVALLI B.Towards Secure and Fast Mapping of Genomic Sequences on Public Clouds[C]//Proceedings of the 4th ACM International Workshop on Security in Cloud Computing.ACM,2016:59-66.
[15]KOBORI Y,MIZUTA S.Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images[J].Genomics,Proteomics & Bioinformatics,2016,14(2):103-112.
[16]CLAUDE F,FARINA A,MARTÍNEZ-PRIETO M A,et al. Compressed q-gram indexing for highly repetitive biological sequences[C]//IEEE International Conference on Bioinformatics and Bioengineering.2010:86-91.
[17]SONG W,WANG B,WANG Q,et al.A privacy-preserved full-text retrieval algorithm over encrypted data for cloud storage applications[J].Journal of Parallel and Distributed Computing,2017,99:14-27.
[18]WANG B,SONG W,LOU W,et al.Inverted index based multi-keyword public-key searchable encryption with strong privacy guarantee[C]//IEEE Conference on Computer Communiations.2015:2092-2100.
[19]WANG D,JIA X,WANG C,et al.Generalized pattern matching string search on encrypted data in cloud systems[C]//IEEE Conference on Computer Communiations.2015:2101-2109.
[1] 崔光范, 许利杰, 刘杰, 叶丹, 钟华.
基于Spark SQL的分布式全文检索框架的设计与实现
Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL
计算机科学, 2018, 45(9): 104-112. https://doi.org/10.11896/j.issn.1002-137X.2018.09.016
[2] 秦杰,宋金玉,张广星.
基于Lucene的本地搜索引擎研究与实现
Research and Application of Local Search Engine Based on Lucene
计算机科学, 2014, 41(Z11): 368-370.
[3] 霍林,黄保华,鲍洋,胡和平.
用于对等全文检索的安全覆盖网
Secure Overlay Network for Peer-to-Peer Full Text Search
计算机科学, 2011, 38(1): 104-106.
[4] 申展 江宝林 陈祎 唐磊 胡运发.
全文检索模型综述

计算机科学, 2004, 31(5): 61-64.
[5] 饶祎 郭辉 蔡庆生.
一种基于全文检索系统的文档关联研究与实现

计算机科学, 2003, 30(12): 78-79.
[6] 邹涛 王继成.
文本信息检索技术

计算机科学, 1999, 26(9): 72-75.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!