计算机科学 ›› 2022, Vol. 49 ›› Issue (2): 162-173.doi: 10.11896/jsjkx.201200008

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于差分隐私的K-means算法优化研究综述

孔钰婷, 谭富祥, 赵鑫, 张正航, 白璐, 钱育蓉   

  1. 新疆大学软件学院 乌鲁木齐830000
    新疆维吾尔自治区信号检测与处理重点实验室 乌鲁木齐830046
    新疆大学软件工程重点实验室 乌鲁木齐830000
  • 收稿日期:2020-12-01 修回日期:2021-03-29 出版日期:2022-02-15 发布日期:2022-02-23
  • 通讯作者: 钱育蓉(qyr@xju.edu.cn)
  • 作者简介:1565066023@qq.com
  • 基金资助:
    国家自然科学基金(61966035);自治区科技厅国际合作项目(2020E01023);自治区研究生科研创新项目(XJ2019G072)

Review of K-means Algorithm Optimization Based on Differential Privacy

KONG Yu-ting, TAN Fu-xiang, ZHAO Xin, ZHANG Zheng-hang, BAI Lu, QIAN Yu-rong   

  1. College of Software,Xinjiang University,Urumqi 830000,China
    Key Laboratory of Signal Detection & Processing in Xinjiang Autonomous Region,Xinjiang University,Urumqi 830046,China
    Key Laboratory of Software Engineering,Xinjiang University,Xinjiang University,Urumqi 830000,China
  • Received:2020-12-01 Revised:2021-03-29 Online:2022-02-15 Published:2022-02-23
  • About author:KONG Yu-ting,born in 1997,postgra-duate,is a member of China Computer Federation.Her main research interests include data mining and data security.
    QIAN Yu-rong,born in 1980,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include network computing and remote sensing image processing.
  • Supported by:
    National Natural Science Foundation of China(61966035),International Cooperation Project of the Science and Technology Department of the Autonomous Region(2020E01023) and Autonomous Region Graduate Research Innovation Project(XJ2019G072).

摘要: 差分隐私K-means算法(Differential Privacy K-means Algorithm,DP K-means)作为一种基于差分隐私技术的隐私保护数据挖掘(Privacy Preserving Data Mining,PPDM)模型,因简单高效且可保障数据的隐私而备受研究者的关注。文中首先阐述了差分隐私K-means算法的原理、隐私攻击模型,以分析算法的不足。然后从数据预处理、隐私预算分配、聚簇划分等3个角度讨论分析DP K-means算法改进研究的优缺点,并对研究中的相关数据集和通用评价指标进行了总结。最后指出DP K-means算法改进研究中亟待解决的挑战性问题,并展望了DP K-means算法的未来发展趋势。

关键词: 差分隐私, 差分隐私K-means算法, 隐私保护, 隐私保护数据挖掘

Abstract: Differential privacy K-means algorithm (DP K-means),as a kind of privacy preserving data mining (PPDM) model based on differential privacy technology,has attracted much attention from researchers because of its simplicity,efficiency and ability to guarantee data privacy.Firstly,the principle and privacy attack model of differential privacy K-means Algorithm are described,and the shortcomings of the algorithm are analyzed.Then,the advantages and disadvantages of the improvement research of DP K-means algorithm are discussed and analyzed from three perspectives,including data preprocessing,privacy budget allocation and cluster partition,and the relevant data sets and common evaluation indexes in the research are summarized.At last,the challenging problems to be solved in the improvement research of DP K-means algorithm are pointed out,and the future development trend of DP K-means algorithm is prospected.

Key words: Differential privacy, Differential privacy K-means algorithm, Privacy preservation, Privacy preserving data mining

中图分类号: 

  • TP309
[1]GAO Z,SUN Y,CUI X,et al.Privacy-Preserving Hybrid K-Means[J].International Journal of Data Warehousing and Mi-ning,2018,14(2):1-17.
[2]NELSON B,OLOVSSON T.Security and privacy for big data:A systematic literature review[C]//IEEE International Confe-rence on Big Data.IEEE,2017.
[3]ZHANG X J,YANG H Y,LI Z,et al.Differentially Private Location Privacy-preserving Scheme with Semantic Location[J].Computer Science,2021,48(8):300-308.
[4]PENG C C,CHEN Y L,XUN Y M.k-modes Clustering Gua-ranteeing Local Differential Privacy[J].Computer Science,2021,48(2):105-113.
[5]ZHOU S G,LI F,TAO Y F,et al.Privacy Preservation in Database Applications:A Survey[J].Chinese Journal of Computers,2009,32(5):847-861.
[6]MIN Z,YANG G,SANGAIAH A K,et al.A privacy protection-oriented parallel fully homomorphic encryption algorithm in cyber physical systems[J].EURASIP Journal on Wireless Communications and Networking,2019,2019(1):15.
[7]LIU J,YIN S L,LI H,et al.A density-based clustering method for k-anonymity privacy protection[J].Journal of Information Hiding and Multimedia Signal Processing,2017,8(1):12-18.
[8]TEMUUJIN O,AHN J,IM D H.Efficient L-Diversity algo-rithm for preserving privacy of dynamically published datasets[J].IEEE Access,2019,197:122878-122888.
[9]DWORK C.Differential Privacy:A Survey of Results[C]//Theory and Applications of Models of Computation(TAMC 2008).Springer,2008:1-19.
[10]FANG Y,ZHU J,ZHOU W,et al.A Survey on Data Mining Privacy Protection Algorithms[J].Netinfo Security,2017,2:6-11.
[11]WANG J Y,LIU C,FU X C,et al.Crucial Patterns Mining with Differential Privacy over Data Streams[J].Journal of Software,2019,30(3):648-666.
[12]INAN A,GURSOY M E,SAYGIN Y.Sensitivity analysis fornon-interactive differential privacy:bounds and efficient algorithms[J].IEEE Transactions on Dependable and Secure Computing,2020,17(1):194-207.
[13]GAO Z Q,WANG Y T.Survey on differential privacy and its progress[J].Journal on Communications,2017,38:151-155.
[14]BLUM A,DWORK C,MCSHERRY F,et al.Practical Privacy:The SuLQ Framework[C]//Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems(PODS '05).2016:128-138.
[15]MCSHERRY F D.Privacy integrated queries:an extensible platform for privacy-preserving data analysis[C]//Proceedings of the 2009 ACM SIGMOD International Conference on Management of data.ACM,2019:19-30.
[16]NISSIM K,RASKHODNIKOVA S,SMITH A.Smooth sensitivity and sampling in private data analysis[C]//Thirty-ninth ACM Symposium on Theory of Computing.ACM,2007.
[17]FELDMAN D,FIAT A,KAPLAN H,et al.Private coresets[C]//Proceedings of the 41st Annual ACM Symposium on Theory of Computing,STOC 2009.Bethesda,MD,USA:ACM,2009.
[18]MOHAN P,THAKURTA A,SHI E,et al.GUPT:privacy preserving data analysis made easy[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data.2012:349-360.
[19]FLETCHER S,ISLAM M Z.Decision tree classification withdifferential privacy:A survey[J].ACM Computing Surveys,2019,52(4):1-33.
[20]HAY M,MACHANAVAJJHALA A,MIKLAU G,et al.Principled Evaluation of Differentially Private Algorithms using DPBench[C]//ACM.2016:139-154.
[21]YE Q Q,MENG X F,ZHU M J,et al.Survey on Local Differential Privacy[J].Journal of Software,2018,29(7):1981-2005.
[22]HE X M,WANG X Y,CHEN H H.Study on choosing the parameter ε in differential privacy[J].Journal on Communications,2015,36(12):124-130.
[23]CONSUL S,WILLIAMSON S.Differentially Private MedianForests for Regression and Classification[J].arXiv:2006.08795.
[24]ALVIM M,CHATZIKOKOLAKIS K,PALAMIDESSI C,et al.Invited Paper:Local Differential Privacy on Metric Spaces:Optimizing the Trade-Off with Utility[C]//2018 IEEE 31st Computer Security Foundations Symposium (CSF).IEEE,2018:262-267.
[25]BUN M,STEINKE T.Average-Case Averages:Private Algo-rithms for Smooth Sensitivity and Mean Estimation[J].Advances in Neural Information Processing Systems,2019,17:181-191.
[26]WANG H,XU Z,XIONG L,et al.Conducting Correlated La-place Mechanism for Differential Privacy[C]//International Conference on Cloud Computing and Security.Cham:Springer,2017.
[27]DONG J,DURFEE D,ROGERS R.Optimal Differential Privacy Composition for Exponential Mechanisms and the Cost of Adaptivity[J].arXiv:1909.13830.
[28]YANG X,WANG T,REN X,et al.Survey on Improving Data Utility in Differentially Private Sequential Data Publishing[J].IEEE Transactions on Big Data,2021,7(4):729-749.
[29]LIU F.Generalized Gaussian Mechanism for Differential Privacy[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(4):747-756.
[30]TALWAR K,HARDT M A W.Geometric mechanism for privacy-preserving answers:U.S.Patent 8,661,047[P].2014-2-25.
[31]LI C,MIKLAU G,HAY M,et al.The matrix mechanism:optimizing linear counting queries under differential privacy[J].VLDB Journal — the International Journal on Very Large Data Bases,2015,24(6):757-781.
[32]ZHAO J,CHEN Y,ZHANG W.Differential Privacy Preservation in Deep Learning:Challenges,Opportunities and Solutions[J].IEEE Access,2019,7:48901-48911.
[33]GONG M,XIE Y,PAN K,et al.A Survey on Differentially Private Machine Learning[J].IEEE Computational Intelligence Magazine,2020,15(2):49-64.
[34]DWORK C,MCSHERRY F,NISSIM K,et al.Calibrating noise to sensitivity in private data analysis[C]//Theory of cryptography conference.Springer,2006:265-284.
[35]LIN W C,TSAI C F,KE S W,et al.Top 10 data mining techniques in business applications:a brief survey[J].Kybernetes,2017,46(7):1158-1170.
[36]XIONG J,REN J,CHEN L,et al.Enhancing privacy and availability for data clustering in intelligent electrical service of IoT[J].IEEE Internet of Things Journal,2018,6(2):1530-1540.
[37]DWORK C,ROTHBLUM G N,VADHAN S P.Boosting andDifferential Privacy[C]//2010 IEEE 51st Annual Symposium on Foundations of Computer Science.IEEE,2010:51-60.
[38]YANG J C,ZHAO C.A Survey on K-Means Clustering Algorithm[J].Computer Engineering and Applications,2019,55(23):7-14,63.
[39]REN Y H.Survey of K-means algorithms on big data[J/OL].Application Research of Computers.https://doi.org/10.19734/j.issn.1001-3695.2019.10.0581.
[40]XING Y N,QIAN Y R,NAN F Z,et al.Survey of optimization on K-means algorithm in Spark[J].2020,37(3):641-647.
[41]LI Y,HAO Z F,WEN W,et al.Research on Differential Privacy Preserving k-means Clustering[J].Computer Science,2013,3:287-290.
[42]REN J,XIONG J,YAO Z,et al.DPLK-Means:A Novel Diffe-rential Privacy K-Means Mechanism[C]//IEEE Second International Conference on Data Science in Cyberspace.IEEE,2017.
[43]ZHANG G,ZHANG C,ZHANG H.Improved K-means Algo-rithm Based on Density Canopy[J].Knowledge-Based Systems,2018,145:289-297.
[44]XIA D,NING F,HE W.Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform[J].Journal of Grid Computing,2020,18(2):263-273.
[45]LI L F.The analysis of K-means Clustering with Differential Privacy[D].Sichuan:Southwest Jiaotong University,2016.
[46]YAO S.An Improved Differential Privacy K-Means Algorithm Based on MapReduce[C]//2018 11th International Symposium on Computational Intelligence and Design (ISCID).IEEE,2018:141-145.
[47]SHANG T,ZHAO Z,GUAN Z,et al.A DP canopy k-means algorithm for privacy preservation of Hadoop Platform[C]//International Symposium on Cyberspace Safety and Security.Cham:Springer,2017:189-198.
[48]SHAO R M,ZHANG L,LIU Y,et al.A-PAM Clustering Algorithm Based on Differential Privacy Preserving[C]//2015 International Conference on Software,Multimedia and Communication Engineering.2015.
[49]YU Q,LUO Y,CHEN C,et al.Outlier-eliminated k-means clustering algorithm based on differential privacy preservation[J].Applied Intelligence,2016,45(4):1179-1191.
[50]FAN Y K,LIU J W.Parallel K-means algorithm with differen-tial privacy preservation and outlier pruning[J].Application Research of Computers,2019,6:1776-1781.
[51]MENG Q,ZHOU L.Research On Differential Privacy Preserving Clustering Algorithm Based On Spark Platform[J].Journal of Computers (Taiwan),2018,29(1):47-62.
[52]DWORK C.Differential privacy[C]//International Colloquiumon Automata,Languages,and Programming.Springer,2006:1-12.
[53]FAN Z,XU X.APDP k-means:A New Differential PrivacyClustering Algorithm Based on Arithmetic Progression Privacy Budget Allocation[C]//2019 IEEE 21st International Confe-rence on High Performance Computing and Communications.IEEE,2019.
[54]SU D,CAO J,LI N,et al.Differentially Private K-Means Clustering[C]//ACM.2016:26-37.
[55]FU Y M,LI Z D.Research on k-means++ Clustering Algorithm Based on Laplace Mechanism for Differential Privacy Protection[J].Netinfo Security,2019,218(2):49-58.
[56]ZHANG Y,LIU N,WANG S,et al.A differential privacy protecting K-means clustering algorithm based on contour coefficients[J].Plos One,2018,13(11):1-15.
[57]MO R,LIU J,YU W,et al.A Differential Privacy-Based Protecting Data Preprocessing Method for Big Data Mining[C]//2019 18th IEEE International Conference on Trust,Security And Privacy In Computing And Communications.IEEE,2019.
[58]DONG S U,CAO J N E,NINGHUI L I,et al.Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization[J].ACM Transaction on Information & System Security,2017,20(4):16.1-16.33.
[59]ZHANG J,XIAO X,YANG Y,et al.PrivGene:differentially private model fitting using genetic algorithms[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Ma-nagement of Data.2013:665-676.
[60]LU Z,SHEN H.A convergent differentially private k-means clustering algorithm[C]//Pacific-Asia Conference on Know-ledge Discovery and Data Mining.Springer,2019:612-624.
[61]LU Z,SHEN H.Differentially Private k-Means Clustering with Guaranteed Convergence[J].arXiv:2002.01043,2020.
[62]ACS G,MELIS L,CASTELLUCCIA C,et al.Differentially private mixture of generative neural networks[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(6):1109-1121.
[63]LV Z,WANG L,GUAN Z,et al.An Optimizing and Differen-tially Private Clustering Algorithm for Mixed Data in SDN based Smart Grid[J].IEEE Access,2019,7:45773-45782.
[64]NGUYEN H H.Privacy-Preserving Mechanisms for k-ModesClustering[J].Computers & Security,2018,78:60-75.
[65]NI T,QIAO M,CHEN Z,et al.Utility-efficient Differentially Private K-means Clustering based on Cluster Merging-ScienceDirect[J].Neurocomputing,2021,424:205-214.
[66]GAO Z Q,ZHANG L J.DPHKMS:An Efficient Hybrid Clus-tering Preserving Differential Privacy in Spark[C]//Internatio-nal Conference on Emerging Internetworking.Cham:Springer,2017.
[67]FRANK A,ASUNCION A.UCI machine learning repository[EB/OL]. [2022-01-13]. https://archive.ics.uci.edu/ml/index.php.
[68]PENG H L,JIN K Z,FU C C,et al.Private Time Series Pattern Mining with Sequential Lattice[J].Acta Electronica Sinica,2020,48(1):153-163.
[69]LI H C,WU X P,CHEN Y.k-means clustering method preserving differential privacy in MapReduce framework[J].Journal on Communications,2016,37(2):124-130.
[70]XIA C,HUA J,TONG W,et al.Distributed K-Means clustering guaranteeing local differential privacy[J].Computers & Security,2020,90:101699.1-101699.11.
[71]NGUYEN T D,GUPTA S,RANA S,et al.Privacy Aware K-Means Clustering with High Utility[C]//Pacific-Asia Confe-rence on Knowledge Discovery & Data Mining.Springer International Publishing,2016:388-400.
[72]ZHANG T,ZHU T,XIONG P,et al.Correlated Differential Privacy:Feature Selection in Machine Learning[J].IEEE Transactions on Industrial Informatics,2019,16(3):2115-2124.
[73]RATHI M,RAJAVAT A.High Dimensional Data Processing in Privacy Preserving Data Mining[C]//2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT).IEEE,2020.
[74]ZHANG T,ZHU T,LIU R,et al.Correlated Data in Differential Privacy:Definition and Analysis[J].Concurrency and Computation:Practice and Experience,2020:e6015.
[75]TRAN H Y,HU J.Privacy-preserving big data analytics a comprehensive survey[J].Journal of Parallel and Distributed Computing,2019,134:207-218.
[76]GUAN Z,LV Z,DU X,et al.Achieving Data Utility-PrivacyTradeoff in Internet of Medical Things:A Machine Learning Approach[J].Future Generation Computer Systems,2019,98:60-68.
[77]SAKELLARIOU G,GOUNARIS A.Homomorphically encryp-ted k-means on cloud-hosted servers with low client-side load[J].Computing,2019,101(12):1813-1836.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[3] 吕由, 吴文渊.
隐私保护线性回归方案与应用
Privacy-preserving Linear Regression Scheme and Its Application
计算机科学, 2022, 49(9): 318-325. https://doi.org/10.11896/jsjkx.220300190
[4] 黄觉, 周春来.
基于本地化差分隐私的频率特征提取
Frequency Feature Extraction Based on Localized Differential Privacy
计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229
[5] 王健.
基于隐私保护的反向传播神经网络学习算法
Back-propagation Neural Network Learning Algorithm Based on Privacy Preserving
计算机科学, 2022, 49(6A): 575-580. https://doi.org/10.11896/jsjkx.211100155
[6] 李利, 何欣, 韩志杰.
群智感知的隐私保护研究综述
Review of Privacy-preserving Mechanisms in Crowdsensing
计算机科学, 2022, 49(5): 303-310. https://doi.org/10.11896/jsjkx.210400077
[7] 王美珊, 姚兰, 高福祥, 徐军灿.
面向医疗集值数据的差分隐私保护技术研究
Study on Differential Privacy Protection for Medical Set-Valued Data
计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[8] 吕由, 吴文渊.
基于同态加密的线性系统求解方案
Linear System Solving Scheme Based on Homomorphic Encryption
计算机科学, 2022, 49(3): 338-345. https://doi.org/10.11896/jsjkx.201200124
[9] 金华, 朱靖宇, 王昌达.
视频隐私保护技术综述
Review on Video Privacy Protection
计算机科学, 2022, 49(1): 306-313. https://doi.org/10.11896/jsjkx.201200047
[10] 雷羽潇, 段玉聪.
面向跨模态隐私保护的AI治理法律技术化框架
AI Governance Oriented Legal to Technology Bridging Framework for Cross-modal Privacy Protection
计算机科学, 2021, 48(9): 9-20. https://doi.org/10.11896/jsjkx.201000011
[11] 董晓梅, 王蕊, 邹欣开.
面向推荐应用的差分隐私方案综述
Survey on Privacy Protection Solutions for Recommended Applications
计算机科学, 2021, 48(9): 21-35. https://doi.org/10.11896/jsjkx.201100083
[12] 孙林, 平国楼, 叶晓俊.
基于本地化差分隐私的键值数据关联分析
Correlation Analysis for Key-Value Data with Local Differential Privacy
计算机科学, 2021, 48(8): 278-283. https://doi.org/10.11896/jsjkx.201200122
[13] 张学军, 杨昊英, 李桢, 何福存, 盖继扬, 鲍俊达.
融合语义位置的差分私有位置隐私保护方法
Differentially Private Location Privacy-preserving Scheme withSemantic Location
计算机科学, 2021, 48(8): 300-308. https://doi.org/10.11896/jsjkx.200900198
[14] 陈天荣, 凌捷.
基于特征映射的差分隐私保护机器学习方法
Differential Privacy Protection Machine Learning Method Based on Features Mapping
计算机科学, 2021, 48(7): 33-39. https://doi.org/10.11896/jsjkx.201200224
[15] 王辉, 朱国宇, 申自浩, 刘琨, 刘沛骞.
基于用户偏好和位置分布的假位置生成方法
Dummy Location Generation Method Based on User Preference and Location Distribution
计算机科学, 2021, 48(7): 164-171. https://doi.org/10.11896/jsjkx.200800069
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!