计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 322-332.doi: 10.11896/jsjkx.230600142

• 信息安全 • 上一篇    下一篇

本地差分隐私下的高维数据发布方法

蔡梦男1,2, 沈国华1,2,3, 黄志球1,2,3, 杨阳1,2   

  1. 1 南京航空航天大学计算机科学与技术学院 南京211106
    2 南京航空航天大学高安全系统的软件开发与验证技术工业和信息化部重点实验室 南京211106
    3 软件新技术与产业化协同创新中心 南京210093
  • 收稿日期:2023-06-16 修回日期:2023-11-15 出版日期:2024-02-15 发布日期:2024-02-22
  • 通讯作者: 沈国华(ghshen@nuaa.edu.cn)
  • 作者简介:(cai_mengnan@nuaa.edu.cn)
  • 基金资助:
    国家自然科学基金(U2241216,61772270);民航应急科学与技术重点实验室开放基金(NJ2022022)

High-dimensional Data Publication Under Local Differential Privacy

CAI Mengnan1,2, SHEN Guohua1,2,3, HUANG Zhiqiu1,2,3, YANG Yang1,2   

  1. 1 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    2 Key Laboratory of Safety-Critical Software,Ministry of Industry and Information Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    3 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China
  • Received:2023-06-16 Revised:2023-11-15 Online:2024-02-15 Published:2024-02-22
  • About author:CAI Mengnan,born in 2000,postgra-duate.Her main research interests include privacy protection and deep lear-ning.SHEN Guohua,born in 1976,Ph.D,associate professor.His main research interests include software engineering,privacy protection,semantic web and description logic.
  • Supported by:
    National Natural Science Foundation of China(U2241216,61772270) and Opening Fund of Key Laboratory of Civil Aviation Emergency Science&Technology(CAAC)(NJ2022022).

摘要: 从众多用户收集的高维数据可用性越来越高,庞大的高维数据涉及用户个人隐私,如何在使用高维数据的同时保护用户的隐私极具挑战性。文中主要关注本地差分隐私下的高维数据发布问题。现有的解决方案首先构建概率图模型,生成输入数据的一组带噪声的低维边缘分布,然后使用它们近似输入数据集的联合分布以生成合成数据集。然而,现有方法在计算大量属性对的边缘分布构建概率图模型,以及计算概率图模型中规模较大的属性子集的联合分布时存在局限性。基于此,提出了一种本地差分隐私下的高维数据发布方法PrivHDP(High-dimensional Data Publication Under Local Differential Privacy)。首先,该方法使用随机采样响应代替传统的隐私预算分割策略扰动用户数据,提出自适应边缘分布计算方法计算成对属性的边缘分布构建Markov网。其次,使用新的方法代替互信息度量成对属性间的相关性,引入了基于高通滤波的阈值过滤技术缩减概率图构建过程的搜索空间,结合充分三角化操作和联合树算法获得一组属性子集。最后,基于联合分布分解和冗余消除,计算属性子集上的联合分布。在4个真实数据集上进行实验,结果表明,PrivHDP 算法在k-way 查询和 SVM 分类精度方面优于同类算法,验证了所提方法的可用性与高效性。

关键词: 本地差分隐私, 高维数据, 数据发布, 边缘分布, 联合分布

Abstract: With the increasing availability of high-dimensional data collected from numerous users,preserving user privacy while utilizing high-dimensional data poses significant challenges.This paper focuses on the problem of high-dimensional data publication under local differential privacy.State-of-the-art solutions first construct probabilistic graphical models to generate a set of noisy low-dimensional marginal distributions of the input data,and then use them to approximate the joint distribution of the input dataset for generating synthetic datasets.However,existing methods have limitations in computing marginal distributions for a large number of attribute pairs to construct probabilistic graphical models,as well as in calculating joint distributions for attribute subsets within the probabilistic graphical models.To address these limitations,this paper proposes a method PrivHDP(high-dimensional data publication under local differential privacy) for high-dimensional data publication under local differential privacy.Firstly,it uses random sampling response instead of the traditional privacy budget splitting strategy to perturb user data.It proposes an adaptive marginal distribution computation method to compute the marginal distributions of pairwise attributes and construct a Markov network.Secondly,it employs a novel method to measure the correlation between pairwise attributes,replacing mutual information.This method introduces a threshold technique based on high-pass filtering to reduce the search space during the construction of the probabilistic graphical model.It combines sufficient triangulation operations and a joint tree algorithm to obtain a set of attribute subsets.Finally,based on joint distribution decomposition and redundancy elimination,the proposed method computes the joint distribution over attribute subsets.Experimental results on four real datasets demonstrate that the PrivHDP algorithm outperforms similar algorithms in terms of k-way query and SVM classification accuracy,validating its effectiveness and efficiency.

Key words: Local differential privacy, High-dimensional data, Data publication, Marginal distribution, Joint distribution

中图分类号: 

  • TP309
[1]SHEN H,ZHANG M W,SHEN J.Efficient privacy-preserving cube-data aggregation scheme for smart grids[J].IEEE Tran-sactions on Information Forensics and Security,2017,12(6):1369-1381.
[2]NAVEED M,AYDAY E,CLAYTON E W,etal.Privacy in the genomic era[J].ACM Computing Surveys(CSUR),2015,48(1):1-44.
[3]KHAN F,REHMAN A U,ZHENG J,et al.Mobile crowdsen-sing:A survey on privacy-preservation,task management,assignment models,and incentives mechanisms[J].Future Generation Computer Systems,2019,100:456-472.
[4]DWORK C.Differential privacy[C]//International colloquium on automata,languages,and programming.Berlin:Springer,2006:1-12.
[5]DWORK C,ROTH A.The algorithmic foundations ofdifferential privacy[J].Foundations and Trends© in Theoretical Computer Science,2014,9(3/4):211-407.
[6]DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy and statistical minimax rates[C]//2013 IEEE 54th Annual Symposium on Foundations of Computer Science.IEEE,2013:429-438.
[7]REN X B,YU C M,YU W,et al.LoPub:high-dimensionalcrowdsourced data publication with local differential privacy[J].IEEE Transactions on Information Forensics and Security,2018,13(9):2151-2166.
[8]ZHANG J,CORMODE G,PROCOPIUC C M,et al.Privbayes:Private data release via bayesian networks[J].ACM Transactions on Database Systems(TODS),2017,42(4):1-41.
[9]CHEN R,XIAO Q,ZHANG Y,et al.Differentially private high-dimensional data publication via sampling-based inference[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2015:129-138.
[10]CHEN X,LIU J,FENG X Q,et al.Differentially Private Synthetic Dataset Releasing Algorithm Based on Naive Bayes[J].Computer Science,2015,42(1):236-238.
[11]ZHANG Z K,WANG T H,LI N H,et al.PrivSyn:Differentially Private Data Synthesis[C]//30th USENIX Security Symposium(USENIX Security 21).2021:929-946.
[12]ZHANG X J,CHEN L,JIN K Z,et al.Private High-DimensionalData Publication with Junction Tree [J].Computer Research and Development,2018,55(12):2794-2809.
[13]LI H,XIONG L,JIANG X.Differentially private synthesization of multi-dimensional data using copula functions[C]//Advances in Database Technology:Proceedings.International Conference on Extending Database Technology.NIH Public Access,2014.
[14]CAI K T,LEI X Y,WEI J X,et al.DataSynthesis via Differen-tially Private Markov Random Fields[C]//Proceedings of the VLDB Endowment.New York:ACM,2021:2190-2202.
[15]GIUSEPPE V,GRACE T,MARK B,et al.New oracle-efficient algorithms for private synthetic data release[C]//International Conference on Machine Learning(ICML).Vienna:ACM,2020:9765-9774.
[16]FANTI G,PIHUR V,ERLINGSSON L.Building a RAPPORwith the Unknown:Privacy-Preserving Learning of Associations and Data Dictionaries[C]//Proceedings on Privacy Enhancing Technologies.2016:41-61.
[17]REN X B,YU C M,YU W,et al.High-dimensional crowd-sourced data distribution estimation with local privacy[C]//2016 IEEE International Conference on Computer and Information Technology(CIT).IEEE,2016:226-233.
[18]CORMODE G,KULKARNI T,SRIVASTAVA D.Marginal release under local differential privacy[C]//Proceedings of the 2018 International Conference on Management of Data.2018:131-146.
[19]WANG T,YANG X Y,REN X B,et al.Locally private high-dimensional crowdsourced data release based on copula functions[J].IEEE Transactions on Services Computing,2019,15(2):778-792.
[20]DU L K,ZHANG Z K,BAI S J,et al.AHEAD:adaptive hierarchical decomposition for range query under local differential privacy[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:1266-1288.
[21]KULKARNI T.Answering range queries under local differential privacy[C]//Proceedings of the 2019 International Conference on Management of Data.2019:1832-1834.
[22]LIU L K,ZHOU C L.RCP:Mean Value Protection Technology Under Local Differential Privacy[J].Computer Science,2023,50(2):333-345.
[23]WANG N,XIAO X K,YANG Y,et al.Collecting and analyzing multidimensional data with local differential privacy[C]//2019 IEEE 35th International Conference on Data Engineering(ICDE).IEEE,2019:638-649.
[24]WARNER S L.Randomized response:A survey technique foreliminating evasive answer bias[J].Journal of the American Statistical Association,1965,60(309):63-69.
[25]KAIROUZ P,BONAWITZ K,RAMAGE D.Discrete distribu-tion estimation under local privacy[C]//International Confe-rence on Machine Learning.PMLR,2016:2436-2444.
[26]WANG T H,BLOCKI J,LI N H,et al.Locally differentiallyprivate protocols for frequency estimation[C]//26th USENIX Security Symposium(USENIX Security 17).2017:729-745.
[27]QARDAJI W,YANG W N,LI N H.Priview:practical differentially private release of marginal contingency tables[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.2014:1435-1446.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!