计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 105-113.doi: 10.11896/jsjkx.200700172

• 数据库&大数据&数据科学 • 上一篇    下一篇

支持本地化差分隐私保护的k-modes聚类方法

彭春春, 陈燕俐, 荀艳梅   

  1. 南京邮电大学计算机学院、软件学院、网络空间安全学院 南京210003
  • 收稿日期:2020-07-25 修回日期:2020-09-01 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 陈燕俐(chenyl@njupt.edu.cn)
  • 作者简介:1219043729@njupt.edu.cn
  • 基金资助:
    国家自然科学基金(61572263,61272084)

k-modes Clustering Guaranteeing Local Differential Privacy

PENG Chun-chun, CHEN Yan-li, XUN Yan-mei   

  1. College of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
  • Received:2020-07-25 Revised:2020-09-01 Online:2021-02-15 Published:2021-02-04
  • About author:PENG Chun-chun,born in 1996,postgraduate.His main research interests include privacy preserving and data mining.
    CHEN Yan-li,born in 1969,Ph.D,professor.Her main research interests include network security and computer architecture.
  • Supported by:
    The National Natural Science Foundation of China(61572263,61272084).

摘要: 如何在保护数据隐私的同时进行可用性的数据挖掘已成为热点问题。鉴于在很多实际应用场景中,很难找到一个真正可信的第三方对用户的敏感数据进行处理,文中首次提出了一种支持本地化差分隐私技术的聚类方案——LDPK-modes(Local Differential Privacy K-modes)。与传统的基于中心化差分隐私的聚类算法相比,其不再需要一个可信的第三方对数据进行收集和处理,而由用户担任数据隐私化的工作,极大地降低了第三方窃取用户隐私的可能性。用户使用满足本地d-隐私(带有距离度量的本地差分隐私技术)定义的随机响应机制对敏感数据进行扰动,第三方收集到用户扰动数据后,恢复其统计特征,生成合成数据集,并进行k-modes聚类。在聚类过程中,将数据集上频繁出现的特征分配给初始聚类中心点,进一步提高了聚类结果的可用性。理论分析和实验结果表明了LDPK-modes的隐私性和聚类可用性。

关键词: 本地化差分隐私, k-modes, d-隐私, 聚类, 隐私保护

Abstract: How to conduct usability data mining while protecting data privacy has become a hot issue.In many practical scena-rios,it is difficult to find a trusted third party to process the sensitive data.This paper proposes the first locally differentially private k-modes mechanism(LDPK-modes) under this distributed scenario.Differing from standard differentially private clustering mechanisms,the proposed mechanism doesn't need any trusted third party to collect and preprocess users data.Users disturb their data using a random response mechanism that satisfies the definition of local d-privacy (local differential privacy with distance metric).When the third party collects the user's disturbed data,it restores its statistical features and generates a synthetic data set.The frequent attributes on the data set are assigned to the initial cluster center and then start k-modes clustering.Theoretical analysis shows that the proposed algorithm satisfies local d-privacy.Experimental results show that our proposal can well preserve the quality of clustering results without a trusted third-party data collector.

Key words: Local differential privacy, k-modes, d-privacy, Clustering, Privacy preserving

中图分类号: 

  • TP309
[1] DWORK C.Differential privacy:A survey of results[C]//International Conference on Theory and Applications of Models of Computation.Springer,Berlin,Heidelberg,2008:1-19.
[2] SWEENEY L.k-anonymity:A model for protecting privacy[J].International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,2002,10(5):557-570.
[3] MACHANAVAJJHALA A,KIFER D,GEHRKE J,et al.l-diversity:Privacy beyond k-anonymity[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):3.
[4] YE Q Q,MENG X F,ZHU M J,et al.Survey on local differen-tial privacy[J].Journal of Software,2018,29(7):1981-2005.
[5] DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy and statistical minimax rates[C]//2013 IEEE 54th Annual Symposium on Foundations of Computer Science.IEEE,2013:429-438.
[6] KASIVISWANATHAN S P,LEE H K,NISSIM K,et al.What can we learn privately[C]// Proc.of the 49th Annual IEEE Symp.on Foundations of Computer Science (FOCS).IEEE,2008:531-540.
[7] HOPE T,CHAN J,KITTUR A,et al.Accelerating innovation through analogy mining[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2017:235-243.
[8] MENDES R,VILELA J P.Privacy-preserving data mining:methods,metrics,and applications[J].IEEE Access,2017,5:10562-10582.
[9] HAMMING R W.Error detecting and error correcting codes[J].The Bell System Technical Journal,1950,29(2):147-160.
[10] GU X,LI M,CAO Y,et al.Supporting both range queries and frequency estimation with local differential privacy[C]//2019 IEEE Conference on Communications and Network Security (CNS).IEEE,2019:124-132.
[11] WARNER S L.Randomized response:A survey technique foreliminating evasive answer bias[J].Journal of the American Statistical Association,1965,60(309):63-69.
[12] ERLINGSSON Ú,PIHUR V,KOROLOVA A.Rappor:Randomized aggregatable privacy-preserving ordinal response[C]//Proceedings of the 2014 ACM SIGSAC Conference on Compu-ter and Communications Security.2014:1054-1067.
[13] BLOOM B H.Space/Time trade-offs in hash coding with allowable errors[J].Communications of the ACM,1970,13(7):422-426.
[14] BASSILY R,SMITH A.Local,private,efficient protocols forsuccinct histograms[C]//Proc.of the 47th Annual ACM on Symp.on Theory of Computing.ACM,2015:127-135.
[15] DUCHI J C,JORDAN M I,WAINWRIGHT M J.Local privacy,data processing inequalities,and statistical minimax rates[J].arXiv:1302.3203,2013.
[16] WAINWRIGHT M J,JORDAN M I,DUCHI J C.Privacy aware learning[C]//Advances in Neural Information Processing Systems.2012:1430-1438.
[17] NGUYÊN T T,XIAO X,YANG Y,et al.Collecting and analyzing data from smart device users with local differential privacy[J].arXiv:1606.05053,2016.
[18] BLUM A,DWORK C,MCSHERRY F,et al.Practical privacy:the SuLQ framework[C]//Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.2005:128-138.
[19] REN J,XIONG J,YAO Z,et al.DPLK-means:A novel Differential Privacy K-means Mechanism[C]//2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).IEEE,2017:133-139.
[20] FU Y M,LI Z Z.Research on k-means++ Clustering Algorithm Based on Laplace Mechanism for Differential Privacy Protection[J].Netinfo Security,2019,19(2):43-52.
[21] HU C,YANG G,BAI Y L.Clustering Algorithm in Differential Privacy Preserving[J].Computer Science,2019,46(2):120-126.
[22] XIA C,HUA J,TONG W,et al.Distributed K-Means clustering guaranteeing local differential privacy[J].Computers & Security,2020,90:1-11.
[23] NGUYEN H H.Privacy-preserving mechanisms for k-modesclustering[J].Computers & Security,2018,78:60-75.
[24] LYU Z,WANG L,GUAN Z,et al.An optimizing and differentially private clustering algorithm for mixed data in SDN-based smart grid[J].IEEE Access,2019,7:45773-45782.
[25] WANG T,BLOCKI J,LI N,et al.Locally differentially private protocols for frequency estimation[C]//26th Security Sympo-sium (Security 17).2017:729-745.
[26] NEWEY K W,MCFADDEN D.Large sample estimation andhypothesis[J].Handbook of Econometrics,1994,4:2111-2245.
[27] NISSIM K,RASKHODNIKOVA S,SMITH A.Smooth sensitivity and sampling in private data analysis[C]//Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing.2007:75-84.
[28] JIANG H,YI S,LI J,et al.Ant clustering algorithm with K-harmonic means clustering[J].Expert Systems with Applications,2010,37(12):8679-8684.
[1] 邓丽, 武金达, 李科学, 卢亚康. 基于TPE的SpaRC算法超参数优化方法[J]. 计算机科学, 2021, 48(2): 70-75.
[2] 王晓飞, 周超, 刘利刚. 基于特征聚类的轻量级图像搜索系统[J]. 计算机科学, 2021, 48(2): 148-152.
[3] 郭上铜, 王瑞锦, 张凤荔. 区块链技术原理与应用综述[J]. 计算机科学, 2021, 48(2): 271-281.
[4] 余雪勇, 陈涛. 边缘计算场景中基于虚拟映射的隐私保护卸载算法[J]. 计算机科学, 2021, 48(1): 65-71.
[5] 梁伟, 段晓东, 徐健锋. 基于差异性度量的基础聚类三支过滤算法[J]. 计算机科学, 2021, 48(1): 136-144.
[6] 张煜, 陆亿红, 黄德才. 基于密度峰值的加权犹豫模糊聚类算法[J]. 计算机科学, 2021, 48(1): 145-151.
[7] 李彦, 申德荣, 聂铁铮, 寇月. 面向加密云数据的多关键字语义搜索方法[J]. 计算机科学, 2020, 47(9): 318-323.
[8] 杨帆, 王俊斌, 白亮. 基于安全性的成对约束扩充算法[J]. 计算机科学, 2020, 47(9): 324-329.
[9] 徐守坤, 倪楚涵, 吉晨晨, 李宁. 基于YOLOv3的施工场景安全帽佩戴的图像描述[J]. 计算机科学, 2020, 47(8): 233-240.
[10] 陈庆超, 王韬, 冯文博, 尹世庄, 刘丽君. 基于最长连续间隔的未知二进制协议格式推断[J]. 计算机科学, 2020, 47(8): 313-318.
[11] 曹素娥, 杨泽民. 基于聚类分析算法和优化支持向量机的无线网络流量预测[J]. 计算机科学, 2020, 47(8): 319-322.
[12] 高方远, 王秀美. 一种基于块对角表示和近邻约束的子空间聚类方法[J]. 计算机科学, 2020, 47(7): 66-70.
[13] 李向利, 贾梦雪. 基于预处理的超图非负矩阵分解算法[J]. 计算机科学, 2020, 47(7): 71-77.
[14] 高玉潼, 雷为民, 原玥. 复杂环境下基于聚类分析的人脸目标识别[J]. 计算机科学, 2020, 47(7): 111-117.
[15] 张衡, 马明栋, 王得玉. 基于聚类网络的文本-视频特征学习[J]. 计算机科学, 2020, 47(7): 125-129.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王海涛, 宋丽华, 向婷婷, 刘力军. 人工智能发展的新方向——人机物三元融合智能[J]. 计算机科学, 2020, 47(11A): 1 -5 .
[2] 陈训敏, 叶书函, 詹瑞. 基于多任务学习及由粗到精的卷积神经网络人群计数模型[J]. 计算机科学, 2020, 47(11A): 183 -187 .
[3] . 目录[J]. 计算机科学, 2020, 47(12): 0 .
[4] . 复杂系统的软件工程和需求工程专题前言[J]. 计算机科学, 2020, 47(12): 2 .
[5] 孟繁祎, 王莹, 于海, 朱志良. 复杂软件系统的重构技术:现状、问题与展望[J]. 计算机科学, 2020, 47(12): 1 -10 .
[6] 吴文峻, 于鑫, 蒲彦均, 汪群博, 于笑明. 微服务时代的复杂服务软件开发[J]. 计算机科学, 2020, 47(12): 11 -17 .
[7] 杨经纬, 魏子麒, 刘璘. 用户如何看待产品中的预测分析功能?——面向非功能性需求的调研报告[J]. 计算机科学, 2020, 47(12): 18 -24 .
[8] 贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析[J]. 计算机科学, 2020, 47(12): 25 -34 .
[9] 周凯, 任怡, 汪哲, 管剑波, 张芳, 赵言亢. 基于主题模型的Ubuntu操作系统缺陷报告的分类及分析[J]. 计算机科学, 2020, 47(12): 35 -41 .
[10] 杨立, 马佳佳, 江华禧, 马肖肖, 梁赓, 左春. 面向机器学习系统的需求建模与决策选择[J]. 计算机科学, 2020, 47(12): 42 -49 .