公平谱聚类方法用于提高簇的公平性

doi:10.11896/jsjkx.211100279

摘要/Abstract

摘要： 最近,算法的公平性问题引起了机器学习领域学者的广泛讨论。鉴于谱聚类在现代数据科学中的广泛流行,研究谱聚类的算法公平性是一个至关重要的话题。现有的公平谱聚类算法主要存在两个缺点:1)公平性能差;2)仅在单个敏感属性下工作。文中将公平问题视为一种约束谱聚类问题,通过求解约束谱聚类的可行解集,提出了一种非规范化公平谱聚类方法(Unnormalized Fair Spectral Clustering,UFSC),用于提升公平性能。此外,文中还提出了一种适用于多个敏感属性约束的公平聚类算法(Multi-sensitive Attributes Fair Spectral Clustering,MFSC)。在多个真实数据集上进行了实验,结果表明,UFSC和MFSC算法比现有的公平谱聚类算法生成的聚类结果更加公平。

关键词: 算法公平性, 公平谱聚类, 约束谱聚类, 机器学习, 数据分析

Abstract: Recently,the fairness of the algorithm has aroused extensive discussion in the machine learning community.Given the widespread popularity of spectral clustering in modern data science,studying the algorithm fairness of spectral clustering is a crucial topic.Existing fair spectral clustering algorithms have two shortcomings:1) poor fairness performance;2) work only for single sensitive attribute.In this paper,the fair spectral clustering problem is regarded as a constrained spectral clustering problem.By solving the feasible solution set of constrained spectral clustering,an unnormalized fair spectral clustering(UFSC) method is proposed to improve fairness performance.In addition,the paper also proposes a fair clustering algorithm suitable for multiple sensitive attribute constraints.Experimental results on multiple real-world datasets demonstrate that the UFSC and MFSC are fairer than the existing fair spectral clustering algorithms.

Key words: Algorithm fairness, Fair spectral clustering, Constrained spectral clustering, Machine learning, Data analysis

中图分类号:

TP301

徐夏, 张晖, 杨春明, 李波, 赵旭剑. 公平谱聚类方法用于提高簇的公平性[J]. 计算机科学, 2023, 50(2): 158-165. https://doi.org/10.11896/jsjkx.211100279

XU Xia, ZHANG Hui, YANG Chunming, LI Bo, ZHAO Xujian. Fair Method for Spectral Clustering to Improve Intra-cluster Fairness[J]. Computer Science, 2023, 50(2): 158-165. https://doi.org/10.11896/jsjkx.211100279

参考文献

[1]USHIODA A.Hierarchical clustering of words and application to NLP tasks[C]//Fourth Workshop on Very Large Corpora.1996.
[2]WU Z,LEAHY R.An optimal graph theoretic approach to data clustering:Theory and its application to image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1993,15(11):1101-1113.
[3]HUTTENHOWER C,FLAMHOLZ A I,LANDIS J N,et al.NearestNeighbor Networks:clustering expression data based on gene neighborhoods[J].BMC Bioinformatics,2007,8(1):1-13.
[4]CHUANG K H,CHIU M J,LIN C C,et al.Model-free functional MRI analysis using Kohonen clustering neural network and fuzzy C-means[J].IEEE Transactions on Medical Imaging,1999,18(12):1117-1128.
[5]IYER N S,KANDEL A,SCHNEIDER M.Feature-based fuzzy classification for interpretation of mammograms[J].Fuzzy Sets and Systems,2000,114(2):271-280.
[6]DWORK C,HARDT M,PITASSI T,et al.Fairness throughawareness[C]//Proceedings of the 3rd Innovations in Theoretical Computer Science Conference.New York:ACM,2012:214-226.
[7]KLEINBERG J,LAKKARAJU H,LESKOVEC J,et al.Human decisions and machine predictions[J].The Quarterly Journal of Economics,2018,133(1):237-293.
[8]ROMEI A,RUGGIERI S.A multidisciplinary survey on dis-crimination analysis[J].The Knowledge Engineering Review,2014,29(5):582-638.
[9]FELDMAN M,FRIEDLER S A,MOELLER J,et al.Certifying and removing disparate impact[C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.New York:ACM,2015:259-268.
[10]HARDT M,PRICE E,SREBRO N.Equality of opportunity in supervised learning[C]//Advances in Neural Information Processing Systems.2016:3315-3323.
[11]NARAYANAN A.Translation tutorial:21 fairness definitions and their politics[C]//Proceedings of the Conference on Fairness Accountability Transp.New York,USA,2018.
[12]ZAFAR M B,VALERA I,GOMEZ RODRIGUEZ M,et al.Fairness beyond disparate treatment & disparate impact:Lear-ning classification without disparate mistreatment[C]//Procee-dings of the 26th International Conference on World Wide Web.Republic and Canton of Geneva,CHE:International World Wide Web Conferences Steering Committee,2017:1171-1180.
[13]HASHIMOTO T,SRIVASTAVA M,NAMKOONG H,et al.Fairness without demographics in repeated loss minimization[C]//International Conference on Machine Learning.PMLR,2018:1929-1938.
[14]CHIERICHETTI F,KUMAR R,LATTANZI S,et al.Fair clustering through fairlets[C]//Advances in Neural Information Processing Systems.2017:5029-5037.
[15]RÖSNER C,SCHMIDT M.Privacy Preserving Clustering with Constraints[C]//45th International Colloquium on Automata,Languages,and Programming(ICALP 2018).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2018.
[16]BERCEA I O,GROß M,KHULLER S,et al.On the Cost of Essentially Fair Clusterings[C]//Approximation,Randomization,and Combinatorial Optimization.Algorithms and Techniques(APPROX/RANDOM 2019).Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,2019.
[17]BACKURS A,INDYK P,ONAK K,et al.Scalable fair clustering[C]//International Conference on Machine Learning.PMLR,2019:405-413.
[18]KLEINDESSNER M,SAMADI S,AWASTHI P,et al.Guarantees for spectral clustering with fairness constraints[C]//International Conference on Machine Learning.PMLR,2019:3458-3467.
[19]BERA S K,CHAKRABARTY D,FLORES N J,et al.Fair algorithms for clustering[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:4954-4965.
[20]AHMADIAN S,EPASTO A,KUMAR R,et al.Clusteringwithout over-representation[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York:ACM,2019:267-275.
[21]ABRAHAM S S,PADMANABHAN D,SUNDARAM S S.Fairness in Clustering with Multiple Sensitive Attributes[C]//EDBT/ICDT 2020 Joint Conference.2020:287-298.
[22]SCHMIDT M,SCHWIEGELSHOHN C,SOHLER C.Fair coresets and streaming algorithms for fair k-means[C]//International Workshop on Approximation and Online Algorithms.Cham:Springer,2019:232-251.
[23]AHMADIAN S,EPASTO A,KUMAR R,et al.Fair correlation clustering[C]//International Conference on Artificial Intelligence and Statistics.PMLR,2020:4195-4205.
[24]KLEINDESSNER M,AWASTHI P,MORGENSTERN J.Fairk-center clustering for data summarization[C]//International Conference on Machine Learning.PMLR,2019:3448-3457.
[25]ZIKO I M,GRANGER E,YUAN J,et al.Clustering with fairness constraints:A flexible and scalable approach[J].arXiv:1906.08207,2019.
[26]CHEN X,FAIN B,LYU L,et al.Proportionally fair clustering[C]//International Conference on Machine Learning.PMLR,2019:1032-1041.
[27]JUNG C,KANNAN S,LUTZ N.Service in Your Neighbor-hood:Fairness in Center Location[C]//1st Symposium on Foundations of Responsible Computing(FORC 2020).Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2020.
[28]DAVIDSON I,RAVI S S.Making existing clusterings fairer:Algorithms,complexity results and insights[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:3733-3740.
[29]GHADIRI M,SAMADI S,VEMPALA S.Socially fair k-means clustering[C]//Proceedings of the 2021 ACM Conference on Fairness,Accountability,and Transparency.2021:438-448.
[30]ABBASI M,BHASKARA A,VENKATASUBRAMANIAN S.Fair clustering via equitable group representations[C]//Proceedings of the 2021 ACM Conference on Fairness,Accountabi-lity,and Transparency.2021:504-514.
[31]MAKARYCHEV Y,VAKILIAN A.Approximation Algorithms for Socially Fair Clustering[J].arXiv:2103.02512,2021.
[32]CHHABRA A,MOHAPATRA P.Fair algorithms for hierarchical agglomerative clustering[J].arXiv:2005.03197,2020.
[33]AHMADIAN S,EPASTO A,KNITTEL M,et al.Fair Hierarchical Clustering[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:21050-21060.
[34]QUY T L,ROY A,FRIEGE G,et al.Fair-Capacitated Clustering[J].arXiv:2104.12116,2021.
[35]VON LUXBURG U.A tutorial on spectral clustering[J].Statistics and Computing,2007,17(4):395-416.
[36]YU S X,SHI J.Segmentation given partial grouping constraints[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(2):173-183.
[37]BERTSEKAS D P.Nonlinear programming[J].Journal of the Operational Research Society,1997,48(3):334-334.
[38]SHI J,MALIK J.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
[39]WEEKS M R,CLAIR S,BORGATTI S P,et al.Social networks of drug users in high-risk sites:Finding the connections[J].AIDS and Behavior,2002,6(2):193-206.
[40]ZHOU Z H,CHEN Z Q.Hybrid decision tree[J].Knowledge-based systems,2002,15(8):515-528.
[41]PALECHOR F M,DE LA HOZ MANOTAS A.Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia,Peru and Mexico[J].Data in Brief,2019,25:104344.
[42]MORO S,CORTEZ P,RITA P.A data-driven approach to predict the success of bank telemarketing[J].Decision Support Systems,2014,62:22-31.
[43]MEEK C,THIESSON B,HECKERMAN D.The learning-curve sampling method applied to model-based clustering[J].Journal of Machine Learning Research,2002,2(2):397-418.

相关文章 15

[1]	彭钺峰, 赵波, 刘会, 安杨. 针对机器学习的成员推断攻击综述 Survey on Membership Inference Attacks Against Machine Learning 计算机科学, 2023, 50(3): 351-359. https://doi.org/10.11896/jsjkx.220100016
[2]	王艺潭, 王一舒, 袁野. 学习索引研究综述 Survey of Learned Index 计算机科学, 2023, 50(1): 1-8. https://doi.org/10.11896/jsjkx.211000149
[3]	陈得鹏, 刘肖, 崔杰, 何道敬. 面向机器学习的成员推理攻击综述 Survey of Membership Inference Attacks for Machine Learning 计算机科学, 2023, 50(1): 302-317. https://doi.org/10.11896/jsjkx.220800227
[4]	冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[5]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[6]	何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[7]	李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[8]	张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[9]	陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[10]	李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[11]	赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[12]	肖治鸿, 韩晔彤, 邹永攀. 基于多源数据和逻辑推理的行为识别技术研究 Study on Activity Recognition Based on Multi-source Data and Logical Reasoning 计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[13]	姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮. 一种基于异质模型融合的 Android 终端恶意软件检测方法 Android Malware Detection Method Based on Heterogeneous Model Fusion 计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[14]	王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[15]	许杰, 祝玉坤, 邢春晓. 机器学习在金融资产定价中的应用研究综述 Application of Machine Learning in Financial Asset Pricing:A Review 计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed