计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 278-283.doi: 10.11896/jsjkx.201200122

• 信息安全 • 上一篇    下一篇

基于本地化差分隐私的键值数据关联分析

孙林, 平国楼, 叶晓俊   

  1. 清华大学软件学院 北京100084
  • 收稿日期:2020-12-13 修回日期:2021-05-13 发布日期:2021-08-10
  • 通讯作者: 叶晓俊(yexj@tsinghua.edu.cn)
  • 基金资助:
    国家重点研发计划项目(2019QY1402)

Correlation Analysis for Key-Value Data with Local Differential Privacy

SUN Lin, PING Guo-lou, YE Xiao-jun   

  1. School of Software,Tsinghua University,Beijing 100084,China
  • Received:2020-12-13 Revised:2021-05-13 Published:2021-08-10
  • About author:SUN Lin,born in 1993,Ph.D.His main research interests include privacy protection and data mining.(sunl16@mails.tsinghua.edu.cn)YE Xiao-jun,born in 1964,professor.His main research interests include cloud data management,data security and privacy,and database system testing.
  • Supported by:
    National Key Research and Development Program of China(2019QY1402).

摘要: 在群智感知系统中,从分布式数据源中持续收集和分析数据可以为先进的数据挖掘模型提供决策支持。由于数据中可能包含个人相关的信息,数据的采集和分析过程中通常伴随着隐私泄露的风险。本地化差分隐私作为先进的隐私保护方案可在用户的隐私性和数据的可用性之间提供较好的权衡。当前,键值数据作为异构类型数据,其同时含有分类数据和数值数据,基于本地化差分隐私在多维度下对键值数据进行关联分析面临着一定的挑战。针对隐私保护前提下键值数据的发布和关联分析问题,首先定义了键值数据的频率关联和均值关联问题,然后提出了适用于键值对的索引独热编码,为键值数据提供本地化差分隐私保护,最后在扰动的数据上对键值数据进行关联分析。基于仿真数据集和真实数据集的实验和理论分析验证了所提方案的有效性。

关键词: 本地化差分隐私, 关联分析, 键值数据, 均值估计, 频率估计

Abstract: Crowdsourced data from distributed sources are routinely collected and analyzed to produce effective data-mining mo-dels in crowdsensing systems.Data usually contains personal information,which leads to possible privacy leakage in data collection and analysis.The local differential privacy (LDP) has been deemed as the de facto measure for trade-off between privacy guarantee and data utility.Currently,the key-value data is a kind of heterogeneous data types in which the key is categorical data and the value is numerical data.Achieving LDP for key-value data is challenging.This paper focuses on key-value data publishing and correlation analysis under the framework of LDP.Firstly,the frequency correlation and mean correlation in key-value data are defined.Then the indexing one-hot perturbation mechanism is proposed to provide LDP guarantees.At last,the correlation results can be estimated in the perturbed space.Theoretical analysis and experimental results on both real-word and synthetic dataset va-lidate the effectiveness of proposed mechanism.

Key words: Correlation analysis, frequency estimation, Key-value data, Local differential privacy, Mean estimation

中图分类号: 

  • TP391
[1]YANG G M,YANG J,ZHANG J P.Research on Data Publi-shing of Privacy Preserving[J].Computer Science,2011(9):17-23.
[2]WANG B,YANG J.Research on Anonymity Technique forPersonalization Privacy-preserving Data Publishing[J].Compu-ter Science,2012,39(4):168-171.
[3]DWORK C,MCSHERRY F,NISSIM K,et al.Calibrating noise to sensitivity in private data analysis[C]//Theory of Cryptography Conference.Heidelberg:Springer,2006:265-284.
[4]ABADI M,CHU A,GOODFELLOW I,et al.Deep learningwith differential privacy[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Vienna:ACM,2016:308-318.
[5]PHAN N H,WANG Y,WU X,et al.Differential privacy preservation for deep auto-encoders:an application of human behavior prediction[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Arizona:AAAI Press,2016:1309-1315.
[6]CHEN R,XIAO Q,ZHANG Y,et al.Differentially privatehigh-dimensional data publication via sampling-based inference[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Sydeny:ACM,2015:129-138.
[7]DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy and statistical minimax rates[C]//2013 IEEE 54th Annual Symposium on Foundations of Computer Science.Washington:IEEE,2013:429-438.
[8]BLOOM B H.Space/time trade-offs in hash coding with allowa-ble errors[J].Communications of the ACM,1970,13(7):422-426.
[9]ERLINGSSON U,PIHUR V,KOROLOVA A.RAPPOR:Randomized aggregatable privacy-preserving ordinal response[C]//Proceedings of the 2014 ACM SIGSAC Conference on Compu-ter and Communications Security.Arizona:ACM,2014:1054-1067.
[10]NGUYEN T T,XIAO X X,YANG Y,et al.Collecting and Ana-lyzing Data from Smart Device Users with Local Differential Privacy[EB/OL].https://arxiv.org/abs/1606.05053.
[11]WANG N,XIAO X K,YANG Y,et al.Collecting and Analyzing Multidimensional Data with Local Differential Privacy[C]//IEEE 35th International Conference on Data Engineering.Macau:IEEE Press,2019:638-649.
[12]QIN Z,YANG Y,YU T,et al.Heavy hitter estimation over set-valued data with local differential privacy[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Vienna:ACM,2016:192-203.
[13]SWANG T H,LI N H,JHA S.Locally differentially private frequent itemset mining[C]//2018 IEEE Symposium on Security and Privacy.San Francisco:IEEE Press,2018:127-143.
[14]BASSILY R,NISSIM K,STEMMER U,et al.Practical locally private heavy hitters[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York:Curran Associates Inc,2017:2285-2293.
[15]CORMODE G,KULKARNI T,SRIVASTAVA D.Marginalrelease under local differential privacy[C]//Proceedings of the 2018 International Conference on Management of Data.Houston:ACM,2018:131-146.
[16]YE Q Q,HU H B,MENG X F.PrivKV:Key-Value Data Collection with Local Differential Privacy[C]//2019 IEEE Symposium on Security and Privacy (SP).San Francisco:IEEE Press,2019:317-331.
[17]GU X L,LI M,CHENG Y Q,et al.PCKV:Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility[C]//29th USENIX Security Symposium.USENIX Association,2020:967-984.
[18]ZHANG J,CORMODE G,PROCOPIUC C M,et al.Privbayes:Private data release via bayesian networks[J].ACM Transactions on Database Systems,2017,42(4):1-41.
[19]BARAK B,CHAUDHURI K,DWORK C,et al.Privacy,accuracy,and consistency too:a holistic solution to contingency table release[C]//Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.New York:ACM,2007:273-282.
[20]DING B L,KULKARNI J,YEKHANIN S.Collecting telemetry data privately[C]//Proceedings of the 31st International Conference on Neural Information Processing.California:ACM,2017:3574-3583.
[21]DUCHI J C,JORDAN M I.Minimax Optimal Procedures forLocally Private Estimation[J].Journal of the American Statistical Association,2018,113(521):182-201.
[22]WANG T H,BLOCKI J,LI N H,et al.Locally differentiallyprivate protocols for frequency estimation[C]//26th USENIX Security Symposium.Vancouver:USENIX Association,2017:729-745.
[1] 黄觉, 周春来.
基于本地化差分隐私的频率特征提取
Frequency Feature Extraction Based on Localized Differential Privacy
计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229
[2] 李思颖, 徐杨, 王欣, 赵若成.
基于关联分析的铁路旅客同行预测方法
Railway Passenger Co-travel Prediction Based on Association Analysis
计算机科学, 2021, 48(9): 95-102. https://doi.org/10.11896/jsjkx.200700097
[3] 孙明玮, 司维超, 董琪.
基于多维度数据的网络服务质量的综合评估研究
Research on Comprehensive Evaluation of Network Quality of Service Based on Multidimensional Data
计算机科学, 2021, 48(6A): 246-249. https://doi.org/10.11896/jsjkx.200900131
[4] 彭春春, 陈燕俐, 荀艳梅.
支持本地化差分隐私保护的k-modes聚类方法
k-modes Clustering Guaranteeing Local Differential Privacy
计算机科学, 2021, 48(2): 105-113. https://doi.org/10.11896/jsjkx.200700172
[5] 张琴, 陈红梅, 封云飞.
一种基于粗糙集和密度峰值的重叠社区发现方法
Overlapping Community Detection Method Based on Rough Sets and Density Peaks
计算机科学, 2020, 47(5): 72-78. https://doi.org/10.11896/jsjkx.190400160
[6] 李刚, 王超, 韩德鹏, 刘强伟, 李莹.
基于深度主成分相关自编码器的多模态影像遗传数据研究
Study on Multimodal Image Genetic Data Based on Deep Principal Correlated Auto-encoders
计算机科学, 2020, 47(4): 60-66. https://doi.org/10.11896/jsjkx.190300073
[7] 戴欢, 蒋敬敬, 束沁冬, 石鹏展, 史文华.
基于信道状态相位信息的生命体征监测算法
Vital Signs Monitoring Method Based on Channel State Phase Information
计算机科学, 2020, 47(10): 48-54. https://doi.org/10.11896/jsjkx.200500057
[8] 王妍, 韩笑, 曾辉, 刘荆欣, 夏长清.
边缘计算环境下服务质量可信的任务迁移节点选择
Task Migration Node Selection with Reliable Service Quality in Edge Computing Environment
计算机科学, 2020, 47(10): 240-246. https://doi.org/10.11896/jsjkx.190900054
[9] 鲁显光, 杜学绘, 王文娟.
基于改进FP growth的告警关联算法
Alert Correlation Algorithm Based on Improved FP Growth
计算机科学, 2019, 46(8): 64-70. https://doi.org/10.11896/j.issn.1002-137X.2019.08.010
[10] 付泽强, 王晓锋, 孔军.
高性能网络安全告警信息的关联分析方法
High-performance Association Analysis Method for Network Security Alarm Information
计算机科学, 2019, 46(5): 116-121. https://doi.org/10.11896/j.issn.1002-137X.2019.05.018
[11] 茹锋, 徐锦, 常琪, 阚丹会.
一种用于影像遗传学关联分析的高阶统计量结构化稀疏算法
High Order Statistics Structured Sparse Algorithm for Image Genetic Association Analysis
计算机科学, 2019, 46(4): 66-72. https://doi.org/10.11896/j.issn.1002-137X.2019.04.010
[12] 李广璞, 黄妙华.
频繁项集挖掘的研究进展及主流方法
Research Progress and Mainstream Methods of Frequent Itemsets Mining
计算机科学, 2018, 45(11A): 1-11.
[13] 吴珺,王春枝.
面向大数据的多维粒矩阵关联分析及应用
Multiple Correlation Analysis and Application of Granular Matrix Based on Big Data
计算机科学, 2017, 44(Z11): 407-410. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.086
[14] 琚安康,郭渊博,朱泰铭,王通.
网络安全事件关联分析技术与工具研究
Survey on Network Security Event Correlation Analysis Methods and Tools
计算机科学, 2017, 44(2): 38-45. https://doi.org/10.11896/j.issn.1002-137X.2017.02.004
[15] 胡文生,杨剑锋,赵明.
类设计质量评估方法的研究
Methodology for Classes Design Quality Assessment
计算机科学, 2017, 44(12): 150-155. https://doi.org/10.11896/j.issn.1002-137X.2017.12.029
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!