计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 105-113.doi: 10.11896/jsjkx.200700172

• 数据库&大数据&数据科学 • 上一篇    下一篇

支持本地化差分隐私保护的k-modes聚类方法

彭春春, 陈燕俐, 荀艳梅   

  1. 南京邮电大学计算机学院、软件学院、网络空间安全学院 南京210003
  • 收稿日期:2020-07-25 修回日期:2020-09-01 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 陈燕俐(chenyl@njupt.edu.cn)
  • 作者简介:1219043729@njupt.edu.cn
  • 基金资助:
    国家自然科学基金(61572263,61272084)

k-modes Clustering Guaranteeing Local Differential Privacy

PENG Chun-chun, CHEN Yan-li, XUN Yan-mei   

  1. College of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
  • Received:2020-07-25 Revised:2020-09-01 Online:2021-02-15 Published:2021-02-04
  • About author:PENG Chun-chun,born in 1996,postgraduate.His main research interests include privacy preserving and data mining.
    CHEN Yan-li,born in 1969,Ph.D,professor.Her main research interests include network security and computer architecture.
  • Supported by:
    The National Natural Science Foundation of China(61572263,61272084).

摘要: 如何在保护数据隐私的同时进行可用性的数据挖掘已成为热点问题。鉴于在很多实际应用场景中,很难找到一个真正可信的第三方对用户的敏感数据进行处理,文中首次提出了一种支持本地化差分隐私技术的聚类方案——LDPK-modes(Local Differential Privacy K-modes)。与传统的基于中心化差分隐私的聚类算法相比,其不再需要一个可信的第三方对数据进行收集和处理,而由用户担任数据隐私化的工作,极大地降低了第三方窃取用户隐私的可能性。用户使用满足本地d-隐私(带有距离度量的本地差分隐私技术)定义的随机响应机制对敏感数据进行扰动,第三方收集到用户扰动数据后,恢复其统计特征,生成合成数据集,并进行k-modes聚类。在聚类过程中,将数据集上频繁出现的特征分配给初始聚类中心点,进一步提高了聚类结果的可用性。理论分析和实验结果表明了LDPK-modes的隐私性和聚类可用性。

关键词: d-隐私, k-modes, 本地化差分隐私, 聚类, 隐私保护

Abstract: How to conduct usability data mining while protecting data privacy has become a hot issue.In many practical scena-rios,it is difficult to find a trusted third party to process the sensitive data.This paper proposes the first locally differentially private k-modes mechanism(LDPK-modes) under this distributed scenario.Differing from standard differentially private clustering mechanisms,the proposed mechanism doesn't need any trusted third party to collect and preprocess users data.Users disturb their data using a random response mechanism that satisfies the definition of local d-privacy (local differential privacy with distance metric).When the third party collects the user's disturbed data,it restores its statistical features and generates a synthetic data set.The frequent attributes on the data set are assigned to the initial cluster center and then start k-modes clustering.Theoretical analysis shows that the proposed algorithm satisfies local d-privacy.Experimental results show that our proposal can well preserve the quality of clustering results without a trusted third-party data collector.

Key words: Clustering, d-privacy, k-modes, Local differential privacy, Privacy preserving

中图分类号: 

  • TP309
[1] DWORK C.Differential privacy:A survey of results[C]//International Conference on Theory and Applications of Models of Computation.Springer,Berlin,Heidelberg,2008:1-19.
[2] SWEENEY L.k-anonymity:A model for protecting privacy[J].International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,2002,10(5):557-570.
[3] MACHANAVAJJHALA A,KIFER D,GEHRKE J,et al.l-diversity:Privacy beyond k-anonymity[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):3.
[4] YE Q Q,MENG X F,ZHU M J,et al.Survey on local differen-tial privacy[J].Journal of Software,2018,29(7):1981-2005.
[5] DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy and statistical minimax rates[C]//2013 IEEE 54th Annual Symposium on Foundations of Computer Science.IEEE,2013:429-438.
[6] KASIVISWANATHAN S P,LEE H K,NISSIM K,et al.What can we learn privately[C]// Proc.of the 49th Annual IEEE Symp.on Foundations of Computer Science (FOCS).IEEE,2008:531-540.
[7] HOPE T,CHAN J,KITTUR A,et al.Accelerating innovation through analogy mining[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2017:235-243.
[8] MENDES R,VILELA J P.Privacy-preserving data mining:methods,metrics,and applications[J].IEEE Access,2017,5:10562-10582.
[9] HAMMING R W.Error detecting and error correcting codes[J].The Bell System Technical Journal,1950,29(2):147-160.
[10] GU X,LI M,CAO Y,et al.Supporting both range queries and frequency estimation with local differential privacy[C]//2019 IEEE Conference on Communications and Network Security (CNS).IEEE,2019:124-132.
[11] WARNER S L.Randomized response:A survey technique foreliminating evasive answer bias[J].Journal of the American Statistical Association,1965,60(309):63-69.
[12] ERLINGSSON Ú,PIHUR V,KOROLOVA A.Rappor:Randomized aggregatable privacy-preserving ordinal response[C]//Proceedings of the 2014 ACM SIGSAC Conference on Compu-ter and Communications Security.2014:1054-1067.
[13] BLOOM B H.Space/Time trade-offs in hash coding with allowable errors[J].Communications of the ACM,1970,13(7):422-426.
[14] BASSILY R,SMITH A.Local,private,efficient protocols forsuccinct histograms[C]//Proc.of the 47th Annual ACM on Symp.on Theory of Computing.ACM,2015:127-135.
[15] DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy,data processing inequalities,and statistical minimax rates[J].arXiv:1302.3203,2013.
[16] WAINWRIGHT M J,JORDAN M I,DUCHI J C.Privacy aware learning[C]//Advances in Neural Information Processing Systems.2012:1430-1438.
[17] NGUYÊN T T,XIAO X,YANG Y,et al.Collecting and analyzing data from smart device users with local differential privacy[J].arXiv:1606.05053,2016.
[18] BLUM A,DWORK C,MCSHERRY F,et al.Practical privacy:the SuLQ framework[C]//Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.2005:128-138.
[19] REN J,XIONG J,YAO Z,et al.DPLK-means:A novel Differential Privacy K-means Mechanism[C]//2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).IEEE,2017:133-139.
[20] FU Y M,LI Z Z.Research on k-means++ Clustering Algorithm Based on Laplace Mechanism for Differential Privacy Protection[J].Netinfo Security,2019,19(2):43-52.
[21] HU C,YANG G,BAI Y L.Clustering Algorithm in Differential Privacy Preserving[J].Computer Science,2019,46(2):120-126.
[22] XIA C,HUA J,TONG W,et al.Distributed K-Means clustering guaranteeing local differential privacy[J].Computers & Security,2020,90:1-11.
[23] NGUYEN H H.Privacy-preserving mechanisms for k-modesclustering[J].Computers & Security,2018,78:60-75.
[24] LYU Z,WANG L,GUAN Z,et al.An optimizing and differentially private clustering algorithm for mixed data in SDN-based smart grid[J].IEEE Access,2019,7:45773-45782.
[25] WANG T,BLOCKI J,LI N,et al.Locally differentially private protocols for frequency estimation[C]//26th Security Sympo-sium (Security 17).2017:729-745.
[26] NEWEY K W,MCFADDEN D.Large sample estimation andhypothesis[J].Handbook of Econometrics,1994,4:2111-2245.
[27] NISSIM K,RASKHODNIKOVA S,SMITH A.Smooth sensitivity and sampling in private data analysis[C]//Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing.2007:75-84.
[28] JIANG H,YI S,LI J,et al.Ant clustering algorithm with K-harmonic means clustering[J].Expert Systems with Applications,2010,37(12):8679-8684.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 吕由, 吴文渊.
隐私保护线性回归方案与应用
Privacy-preserving Linear Regression Scheme and Its Application
计算机科学, 2022, 49(9): 318-325. https://doi.org/10.11896/jsjkx.220300190
[4] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[5] 黄觉, 周春来.
基于本地化差分隐私的频率特征提取
Frequency Feature Extraction Based on Localized Differential Privacy
计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229
[6] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[7] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于DBSCAN聚类的集群联邦学习方法
Clustered Federated Learning Methods Based on DBSCAN Clustering
计算机科学, 2022, 49(6A): 232-237. https://doi.org/10.11896/jsjkx.211100059
[8] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[9] 毛森林, 夏镇, 耿新宇, 陈剑辉, 蒋宏霞.
基于密度敏感距离和模糊划分的改进FCM算法
FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition
计算机科学, 2022, 49(6A): 285-290. https://doi.org/10.11896/jsjkx.210700042
[10] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[11] 王健.
基于隐私保护的反向传播神经网络学习算法
Back-propagation Neural Network Learning Algorithm Based on Privacy Preserving
计算机科学, 2022, 49(6A): 575-580. https://doi.org/10.11896/jsjkx.211100155
[12] 陈佳舟, 赵熠波, 徐阳辉, 马骥, 金灵枫, 秦绪佳.
三维城市场景中的小物体检测
Small Object Detection in 3D Urban Scenes
计算机科学, 2022, 49(6): 238-244. https://doi.org/10.11896/jsjkx.210400174
[13] 李利, 何欣, 韩志杰.
群智感知的隐私保护研究综述
Review of Privacy-preserving Mechanisms in Crowdsensing
计算机科学, 2022, 49(5): 303-310. https://doi.org/10.11896/jsjkx.210400077
[14] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[15] 朱哲清, 耿海军, 钱宇华.
面向化学结构的线段聚类算法
Line-Segment Clustering Algorithm for Chemical Structure
计算机科学, 2022, 49(5): 113-119. https://doi.org/10.11896/jsjkx.210700131
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!