人工智能训练中合成数据的融贯性法律治理

doi:10.11896/jsjkx.240900163

摘要/Abstract

摘要： 人工智能需要大规模、多样化和高质量的数据来训练机器学习模型,而收集这些真实世界的数据可能成本高昂,并可能威胁个人隐私、引发偏见或歧视以及侵犯版权。在实践中,合成数据作为一种替代性解决方案,受到广泛关注,被越来越多地用于训练机器学习模型。从数据法学的角度,借助数据科学以及计算机科学领域的研究成果,对人工智能训练中合成数据的治理框架进行了探索。首先,从规范层面分析了在人工智能训练中合成数据之所以受到重视的逻辑前提,即个人信息保护法所追求的“小隐私”保护与人工智能训练的“大数据”需求之间存在明显的不兼容性,使训练数据的开发面临挑战,而现有的法律和技术解决方案均存在治理效能不彰的问题。在此基础上,探讨了人工智能训练中合成数据的应用场景与风险类型。最后,提出以“法律3.0理论”和“数据治理理论”作为指引,从3个方面构建人工智能训练中合成数据的融贯性法律治理框架:制定合成数据的处理规则,强化合成数据的过程治理,开发合成数据的评估工具。

关键词: 人工智能, 合成数据, 法律3.0, 融贯性治理, 数据法学

Abstract: Artificial intelligence requires large,diverse,and high-quality data to train machine learning models,and collecting this real-world data can be very difficult and can threaten individual privacy,trigger bias or discrimination,and violate copyright.In practice,synthetic data,as an alternative solutionhas received widespread attention and is increasingly being used to train machine learning models.This paper explores the governance framework of synthetic data in AI training from the perspective of data jurisprudence,drawing on research from both data science and computer science.It first analyzes the logical premise of the importance of synthetic data in AI training from the normative level,i.e.,there is an obvious incompatibility between the protection of “small privacy” pursued by the personal information protection law and the demand for “big data” in AI training,which makes the deve-lopment of training data challenging,and the development of synthetic data for machine learning models challenging.The development of training data faces challenges,while existing legal and technological solutions suffer from ineffective governance.On this basis,the application scenarios and risk types of synthetic data in AI training are discussed.Finally,it is proposed to build a coherent legal governance framework for synthetic data in AI training from three aspects,guided by the “law 3.0 theory” and “data governance theory”:formulating rules for handling synthetic data,strengthening process governance of synthetic data,and developing assessment tools for synthetic data.

Key words: Artificial intelligence, Synthetic data, Law 3.0, Coherent governance, Data law

中图分类号:

P181

张涛. 人工智能训练中合成数据的融贯性法律治理[J]. 计算机科学, 2025, 52(2): 20-32. https://doi.org/10.11896/jsjkx.240900163

ZHANG Tao. Coherent Legal Governance of Synthetic Data in AI Training[J]. Computer Science, 2025, 52(2): 20-32. https://doi.org/10.11896/jsjkx.240900163

参考文献

[1]CRAWFORD K.Atlas of AI:Power,Politics,and the Planetary Costs of Artificial Intelligence[M].New Haven:Yale University Press,2021:97-98.
[2]EUROPEAN COMMISSION.White Paper on Artificial Intelligence:a European approach to excellence and trust[EB/OL].[2024-11-06].https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52020DC0065.
[3]KAPLAN J.Generative Artificial Intelligence:What EveryoneNeeds to Know[M].New York:Oxford University Press,2024:31-32.
[4]KHAN M,HANNA A.The Subjects and Stages of AI Dataset Development:A Framework for Dataset Accountability[J].Ohio State Technology Law Journal,2023,19(2):171-256.
[5]DING X D.On Data Institution that Promotes Artificial Intelligence[J].China Law Review,2023(6):175-191.
[6]KURAPATI S,GILLI L.Synthetic Data:A Convergence be-tween Innovation and GDPR[J].Journal of Open Access to Law,2023(11):1-12.
[7]HEAVEN W D.Synthetic data for AI[EB/OL].[2024-11-06].https://www.technologyreview.com/2022/02/23/1044965/ai-synthetic-data-2/.
[8]HRADEC J,CRAGLIA M,DI L M,et al.Multipurpose synthe-tic population for policy applications[M].Luxembourg:Publications Office of the European Union,2022:15.
[9]ICO.Privacy-enhancing technologies(PETs) [EB/OL].[2024-11-06].https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/.
[10]BELLOVIN S M,DUTTA P K,REITINGER N.Privacy and Synthetic Datasets[J].Stanford Technology Law Review,2019,22(1):1-52.
[11]LEE P.Synthetic Data and the Future of AI [EB/OL].[2024-11-06].https://ssrn.com/abstract=4722162.
[12]ALEXANDER L.Is Synthetic Data the Future of AI? [EB/OL].[2024-11-06].https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai.
[13]GONZALES A,GURUSWAMY G,SMITH S R.Synthetic data in health care:A narrative review[J].PLOS Digit Health,2023,2(1):e0000082.
[14]IVE J,VIANI N,KAM J,et al.Generation and evaluation of artificial mental health records for Natural Language Processing[J].NPJ Digital Medicine,2020,69(3):1-9.
[15]CAIRO M.Synthetic Data and GDPR Compliance:How Artificial Intelligence Might Resolve the Privacy-Utility Tradeoff[J].Journal of Technology Law & Policy,2023(28):71-113.
[16]STADLER T,OPRISANU B,TRONCOSO C.Synthetic Data --Anonymisation Groundhog Day[EB/OL].[2024-11-06].https://doi.org/10.48550/arXiv.2011.07018.
[17]TAORI R,HASHIMOTO T B.Data feedback loops:model-driven amplification of dataset biases[C]//Proceedings of the 40th International Conference on Machine Learning(ICML'23),New York:JMLR.org,2023:33883-33920.
[18]NIKOLENKO S I.Synthetic Data for Deep Learning[M].Cham:Springer,2021:2.
[19]NASSIF J,TEKLI J,KAMRADT M.Synthetic Data:Revolu-tionizing the Industrial Metaverse[M].Cham:Springer,2024:10-11.
[20]FIGUERIRA A,VAZ B.Survey on Synthetic Data Generation,Evaluation Methods and GANs[J].Mathematics,2022,10(15):2733.
[21]HACKER P.A legal framework for AI training data—from first principles to the Artificial Intelligence Act[J].Law,Innovation and Technology,2021,13(2):257-301.
[22]ZARSKY T Z.Incompatible:The GDPR in the Age of Big Data[J].Seton Hall Law Review,2017,47(4):995-1020.
[23]ZHOU H H.Legal position of personal information protection[J].Studies in Law and Business,2020,37(3):44-56.
[24]ZARSKY T Z.The Privacy-Innovation Conundrum[J].Lewis & Clark Law Review,2015,19(1):115-168.
[25]SCHWARTZ P M,SOLOVE D J.The PII Problem:Privacy and a New Concept of Personally Identifiable Information[J].New York University Law Review,2011,86(6):1814-1894.
[26]ELLIOT M,HARA K,RAAB C,et al.Functional anonymisa-tion:Personal data and the data environment[J].Computer Law &Security Review,2018,34(2):204-221.
[27]RAGHUNATHAN T E.Synthetic Data[J].Annual Review of Statistics and Its Application,2021(8):129-140.
[28]ZHANG X B.Interpretation of the Personal Information Protection Law of the People's Republic of China[M].Beijing:People's Publishing House,2021:41.
[29]QI A M,ZHANG Z.Identification and reidentification:The definition of personal information and the legislative choice[J].Journal of Chongqing University(Social Science Edition),2018(2):119-131.
[30]PURTOVA N.The law of everything:Broad concept of personal data and future of EU data protection law[J].Law,Innovation and Technology,2018,10(1):40-81.
[31]CORTE L D.Scoping personal data:Towards a nuanced interpretation of the material scope of EU data protection law[J].European Journal of Law and Technology,2019,10(1):1-26.
[32]LUPTON D.How do data come to matter? Living and becoming with personal data[J].Big Data & Society,2018,5(2):1-11.
[33]EMAM K E,ARBUCKLE L.Anonymizing Health Data[M].Sebastopol:O'Reilly Media,2014:4-5.
[34]JI S L,MITTAL P,BEYAH R.Graph Data Anonymization,De-Anonymization Attacks,and De-Anonymizability Quantification:A Survey[J].IEEE Communications Surveys & Tutorials,2017,19(2):1305-1326.
[35]RUBINSTEIN I S,HARTZOG W.Anonymization and Risk[J].Washington Law Review,2016,91(2):703-760.
[36]OHM P.Broken Promises of Privacy:Responding to the Surprising Failure of Anonymization[J].UCLA Law Review,2010,57(6):1701-1778.
[37]BRASHER E A.Addressing the Failure of Anonymization:Guidance from the European Union's General Data Protection Regulation[J].Columbia Business Law Review,2018(1):209-253.
[38]JORDON J,SZPRUCH L,HOUSSIAU F,et al.Synthetic Data-what,why and how?[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2205.03257.
[39]OFFENHUBER D.Shapes and frictions of synthetic data[J].Big Data & Society,2024,11(2):1-16.
[40]JACOBSEN B N.Machine learning and the politics of synthetic data[J].Big Data & Society,2023,10(1):1-12.
[41]EMAM K E,MOSQUERA L,HOPTROFF R.Practical Syn-thetic Data Generation[M].Sebastopol:O'Reilly Media,2020:1,2-3,4-6,19-20,69.
[42]GURSAKAL N,ÇELIK S,BIRISÇI E.Synthetic Data for Deep Learning[M].New York:Apress Media,2022:1,3,5-6.
[43]JACOBSEN B N.The Logic of the Synthetic Supplement in Algorithmic Societies[J].Theory,Culture & Society,2024,41(4):41-56.
[44]BUOLAMWINI J,GEBRU T.Gender Shades:IntersectionalAccuracy Disparities in Commercial Gender Classification[J].Proceedings of Machine Learning Research,2018(81):1-15.
[45]CHEN R J,LU M Y,CHEN T Y,et al.Synthetic data in machine learning for medi-cine and healthcare[J].Nature Biome-dical Engineering,2021(5):493-497.
[46]MAYSON S G.Bias in,Bias out[J].Yale Law Journal,2019,128(8):2218-2301.
[47]PARDEDE S,KOVA V B.Distinguishing the Need to Belong and Sense of Belongingness:The Relation between Need to Belong and Personal Appraisals under Two Different Belongingness-Conditions[J].European Journal of Investigation in Health,Psychology and Education,2023,13(2):331-344.
[48]CRISTOFARO E D.Synthetic Data:Methods,Use Cases,andRisks[EB/OL].[2024-11-07].https://doi.org/10.48550/ar-Xiv.2303.01230.
[49]FCA.Using Synthetic Data in Financial Services[EB/OL].[2024-11-07].https://www.fca.org.uk/publication/corporate/report-using-synthetic-data-in-financial-services.pdf.
[50]RUSCH T K,BRONSTEIN M M,MISHRA S.A Survey on Oversmoothing in Graph Neural Networks[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2303.10993.
[51]LIU R B,WEI J,LIU F Y,et al.Best Practices and Lessons Learned on Synthetic Data [EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2404.07503.
[52]ZHI Z F.Information Content Governance of Large Model of Generative Artificial Intelligence[J].Tribune of Political Science and Law,2023,41(4):34-48.
[53]SHUMAILOY I,SHUMAYLOY Z,ZHAO Y R,et al.TheCurse of Recursion:Training on Generated Data Makes Models Forget[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2305.17493.
[54]GABRIEL I.Artificial Intelligence,Values,and Alignment[J].Minds and Machines,2020(30):411-437.
[55]RUSSELL S.Human Compatible:Artificial Intelligence and the Problem of Control[M].New York:Viking Press,2019:137.
[56]ZHOU X H,SU Z,EISAPE T,et al.Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2403.05020.
[57]ZOU A,WANG Z F,CARLINI N,et al.Universal and Transferable Adversarial Attacks on Aligned Language Models[EB/OL].[2024-11-08].https://doi.org/10.48550/arXiv.2307.15043.
[58]GIUFFRE M,SHUNG D L.Harnessing the power of synthetic data in healthcare:innovation,application,and privacy[J].NPJ Digital Medicine,2023,6(1):1-8.
[59]WHITNEY C D,NORMAN J.Real Risks of Fake Data:Synthetic Data,Diversity-Washing and Consent Circumvention[EB/OL].[2024-11-08].https://doi.org/10.1145/3630106.3659002.
[60]BROWNSWORD R.Law 3.0:Rules,Regulation,and Technology[M].New York:Routledge,2021:32-33.
[61]VILJOEN S.A Relational Theory of Data Governance[J].Yale Law Journal,2021,131(2):573-654.
[62]WACHTER S,MITTELSTADT B.A Right to Reasonable Inferences:Re-Thinking Data Protection Law in the Age of Big Data and AI[J].Columbia Business Law Review,2019(2):494-620.
[63]GAL M S,LYNSKEY O.Synthetic Data:Legal Implications of the Data-Generation Revolution[J].Iowa Law Review,2024,109(3):1087-1156.
[64]BEDUSCHI A.Synthetic data protection:Towards a paradigm change in data regulation?[J].Big Data & Society,2024,11(1):1-5.
[65]HUANG W Y.On Precautionary Rule of Law[J].Chinese Journal of Law,2024,46(2):20-38.
[66]EBERS M.Standardizing AI:The Case of the European Commission's Proposal for an ‘Artificial Intelligence Act'[M] // The Cambridge Handbook of Artificial Intelligence:Global Perspectives on Law and Ethics.Cambridge:Cambridge University Press,2022:331.
[67]METROPOLIS N,ULAM S.The Monte Carlo method[J].Journal of the American Statistical Association,1949,44(247):335-341.
[68]ABUFADDA M,MANSOUR K.A Survey of Synthetic DataGeneration for Machine Learning[C]// 2021 22nd International Arab Conference on Information Technology.Muscat,Oman,2021:1-7.
[69]ZHAO P.The Legal Implications of ‘Ethical' Governance ofTechnology[J].Peking University Law Journal,2022,34(5):1201-1220.
[70]JOBIN A,LENCA M,VAYENA E.The global landscape of AI ethics guidelines[J].Nature Machine Intelligence,2019,1(9):389-399.
[71]ZHOU J L,CHEN F.AI ethics:from principles to practice[J].AI & SOCIETY,2023,38(6):2693-2703.
[72]CABALLERO I,GUALO F,RODRIGUEZ M,et al.MaturityModels for Data Governance[M]//Data Governance.Cham:Springer,2023:139.
[73]ABRAHAM R,SCHNEIDER J,BROCKE J.Data governance:A conceptual framework,structured review,and research agenda[J].International Journal of Information Management,2019(49):424-438.
[74]ALMASLUKH A,ALAMEER A,ALSALEH H,et al.DataMesh Meets Blockchain[J].International Journal of Computational Intelligence Systems,2024(17):1-15.
[75]MASOOD I,DAUD A,WANG Y L,et al.A blockchain-based system for patient data privacy and security[J].Multimedia Tools and Applications,2024(83):60443-60467.
[76]HASAN H R,SALAH K.Combating Deepfake Videos Using Blockchain and Smart Contracts[J].IEEE Access,2019(7):41596-41606.
[77]PESTANA G,ANTUNES W,CARVALHO J.Digital Chain of Custody Operational Framework[C]//2023 IEEE International Workshop on Technologies for Defense and Security.Rome,Italy,2023:417-422.
[78]LESSIG L.Code:Version 2.0[M].Cambridge:Basic Books,2006:6-7.
[79]ZACCAGNINO R,CAPO C,GUARINO A,et al.Techno-regulation and intelligent safeguards[J].Multimedia Tools and Applications,2021(80):15803-15824.
[80]HILDEBRANDT M.Legal Protection by Design:Objections and Refutations[J].Legisprudence,2011,5(2):223-248.
[81]ALMADA M.Regulation by Design and the Governance ofTechnological Futures[J].European Journal of Risk Regulation,2023,14(4):697-709.
[82]VANNA F D.The Construction of a Normative Framework for Technology-Driven Innovations:A Legal Theory Perspective[M] // Use and Misuse of New Technologies.Cham:Springer,2019:193-194.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed