Computer Science ›› 2025, Vol. 52 ›› Issue (2): 20-32.doi: 10.11896/jsjkx.240900163

• Discipline Frontier • Previous Articles     Next Articles

Coherent Legal Governance of Synthetic Data in AI Training

ZHANG Tao   

  1. Ministry of Education Laboratory of Philosophy and Social Sciences-Data Law Laboratory of China University of Political Science and Law,Beijing 100088,China
    Institute for Data Law,China University of Political Science and Law,Beijing 100088,China
    Institute of Digital Society Governance,China University of Political Science and Law,Beijing 100088,China
  • Received:2024-09-27 Revised:2024-12-09 Online:2025-02-15 Published:2025-02-17
  • About author:ZHANG Tao,born in 1991,Ph.D,associate professor,is a member of CCF(No.Y1355M).His main research interests include data law,administrative law and artificial intelligence law.

Abstract: Artificial intelligence requires large,diverse,and high-quality data to train machine learning models,and collecting this real-world data can be very difficult and can threaten individual privacy,trigger bias or discrimination,and violate copyright.In practice,synthetic data,as an alternative solutionhas received widespread attention and is increasingly being used to train machine learning models.This paper explores the governance framework of synthetic data in AI training from the perspective of data jurisprudence,drawing on research from both data science and computer science.It first analyzes the logical premise of the importance of synthetic data in AI training from the normative level,i.e.,there is an obvious incompatibility between the protection of “small privacy” pursued by the personal information protection law and the demand for “big data” in AI training,which makes the deve-lopment of training data challenging,and the development of synthetic data for machine learning models challenging.The development of training data faces challenges,while existing legal and technological solutions suffer from ineffective governance.On this basis,the application scenarios and risk types of synthetic data in AI training are discussed.Finally,it is proposed to build a coherent legal governance framework for synthetic data in AI training from three aspects,guided by the “law 3.0 theory” and “data governance theory”:formulating rules for handling synthetic data,strengthening process governance of synthetic data,and developing assessment tools for synthetic data.

Key words: Artificial intelligence, Synthetic data, Law 3.0, Coherent governance, Data law

CLC Number: 

  • P181
[1]CRAWFORD K.Atlas of AI:Power,Politics,and the Planetary Costs of Artificial Intelligence[M].New Haven:Yale University Press,2021:97-98.
[2]EUROPEAN COMMISSION.White Paper on Artificial Intelligence:a European approach to excellence and trust[EB/OL].[2024-11-06].https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52020DC0065.
[3]KAPLAN J.Generative Artificial Intelligence:What EveryoneNeeds to Know[M].New York:Oxford University Press,2024:31-32.
[4]KHAN M,HANNA A.The Subjects and Stages of AI Dataset Development:A Framework for Dataset Accountability[J].Ohio State Technology Law Journal,2023,19(2):171-256.
[5]DING X D.On Data Institution that Promotes Artificial Intelligence[J].China Law Review,2023(6):175-191.
[6]KURAPATI S,GILLI L.Synthetic Data:A Convergence be-tween Innovation and GDPR[J].Journal of Open Access to Law,2023(11):1-12.
[7]HEAVEN W D.Synthetic data for AI[EB/OL].[2024-11-06].https://www.technologyreview.com/2022/02/23/1044965/ai-synthetic-data-2/.
[8]HRADEC J,CRAGLIA M,DI L M,et al.Multipurpose synthe-tic population for policy applications[M].Luxembourg:Publications Office of the European Union,2022:15.
[9]ICO.Privacy-enhancing technologies(PETs) [EB/OL].[2024-11-06].https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/.
[10]BELLOVIN S M,DUTTA P K,REITINGER N.Privacy and Synthetic Datasets[J].Stanford Technology Law Review,2019,22(1):1-52.
[11]LEE P.Synthetic Data and the Future of AI [EB/OL].[2024-11-06].https://ssrn.com/abstract=4722162.
[12]ALEXANDER L.Is Synthetic Data the Future of AI? [EB/OL].[2024-11-06].https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai.
[13]GONZALES A,GURUSWAMY G,SMITH S R.Synthetic data in health care:A narrative review[J].PLOS Digit Health,2023,2(1):e0000082.
[14]IVE J,VIANI N,KAM J,et al.Generation and evaluation of artificial mental health records for Natural Language Processing[J].NPJ Digital Medicine,2020,69(3):1-9.
[15]CAIRO M.Synthetic Data and GDPR Compliance:How Artificial Intelligence Might Resolve the Privacy-Utility Tradeoff[J].Journal of Technology Law & Policy,2023(28):71-113.
[16]STADLER T,OPRISANU B,TRONCOSO C.Synthetic Data --Anonymisation Groundhog Day[EB/OL].[2024-11-06].https://doi.org/10.48550/arXiv.2011.07018.
[17]TAORI R,HASHIMOTO T B.Data feedback loops:model-driven amplification of dataset biases[C]//Proceedings of the 40th International Conference on Machine Learning(ICML'23),New York:JMLR.org,2023:33883-33920.
[18]NIKOLENKO S I.Synthetic Data for Deep Learning[M].Cham:Springer,2021:2.
[19]NASSIF J,TEKLI J,KAMRADT M.Synthetic Data:Revolu-tionizing the Industrial Metaverse[M].Cham:Springer,2024:10-11.
[20]FIGUERIRA A,VAZ B.Survey on Synthetic Data Generation,Evaluation Methods and GANs[J].Mathematics,2022,10(15):2733.
[21]HACKER P.A legal framework for AI training data—from first principles to the Artificial Intelligence Act[J].Law,Innovation and Technology,2021,13(2):257-301.
[22]ZARSKY T Z.Incompatible:The GDPR in the Age of Big Data[J].Seton Hall Law Review,2017,47(4):995-1020.
[23]ZHOU H H.Legal position of personal information protection[J].Studies in Law and Business,2020,37(3):44-56.
[24]ZARSKY T Z.The Privacy-Innovation Conundrum[J].Lewis & Clark Law Review,2015,19(1):115-168.
[25]SCHWARTZ P M,SOLOVE D J.The PII Problem:Privacy and a New Concept of Personally Identifiable Information[J].New York University Law Review,2011,86(6):1814-1894.
[26]ELLIOT M,HARA K,RAAB C,et al.Functional anonymisa-tion:Personal data and the data environment[J].Computer Law &Security Review,2018,34(2):204-221.
[27]RAGHUNATHAN T E.Synthetic Data[J].Annual Review of Statistics and Its Application,2021(8):129-140.
[28]ZHANG X B.Interpretation of the Personal Information Protection Law of the People's Republic of China[M].Beijing:People's Publishing House,2021:41.
[29]QI A M,ZHANG Z.Identification and reidentification:The definition of personal information and the legislative choice[J].Journal of Chongqing University(Social Science Edition),2018(2):119-131.
[30]PURTOVA N.The law of everything:Broad concept of personal data and future of EU data protection law[J].Law,Innovation and Technology,2018,10(1):40-81.
[31]CORTE L D.Scoping personal data:Towards a nuanced interpretation of the material scope of EU data protection law[J].European Journal of Law and Technology,2019,10(1):1-26.
[32]LUPTON D.How do data come to matter? Living and becoming with personal data[J].Big Data & Society,2018,5(2):1-11.
[33]EMAM K E,ARBUCKLE L.Anonymizing Health Data[M].Sebastopol:O'Reilly Media,2014:4-5.
[34]JI S L,MITTAL P,BEYAH R.Graph Data Anonymization,De-Anonymization Attacks,and De-Anonymizability Quantification:A Survey[J].IEEE Communications Surveys & Tutorials,2017,19(2):1305-1326.
[35]RUBINSTEIN I S,HARTZOG W.Anonymization and Risk[J].Washington Law Review,2016,91(2):703-760.
[36]OHM P.Broken Promises of Privacy:Responding to the Surprising Failure of Anonymization[J].UCLA Law Review,2010,57(6):1701-1778.
[37]BRASHER E A.Addressing the Failure of Anonymization:Guidance from the European Union's General Data Protection Regulation[J].Columbia Business Law Review,2018(1):209-253.
[38]JORDON J,SZPRUCH L,HOUSSIAU F,et al.Synthetic Data-what,why and how?[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2205.03257.
[39]OFFENHUBER D.Shapes and frictions of synthetic data[J].Big Data & Society,2024,11(2):1-16.
[40]JACOBSEN B N.Machine learning and the politics of synthetic data[J].Big Data & Society,2023,10(1):1-12.
[41]EMAM K E,MOSQUERA L,HOPTROFF R.Practical Syn-thetic Data Generation[M].Sebastopol:O'Reilly Media,2020:1,2-3,4-6,19-20,69.
[42]GURSAKAL N,ÇELIK S,BIRISÇI E.Synthetic Data for Deep Learning[M].New York:Apress Media,2022:1,3,5-6.
[43]JACOBSEN B N.The Logic of the Synthetic Supplement in Algorithmic Societies[J].Theory,Culture & Society,2024,41(4):41-56.
[44]BUOLAMWINI J,GEBRU T.Gender Shades:IntersectionalAccuracy Disparities in Commercial Gender Classification[J].Proceedings of Machine Learning Research,2018(81):1-15.
[45]CHEN R J,LU M Y,CHEN T Y,et al.Synthetic data in machine learning for medi-cine and healthcare[J].Nature Biome-dical Engineering,2021(5):493-497.
[46]MAYSON S G.Bias in,Bias out[J].Yale Law Journal,2019,128(8):2218-2301.
[47]PARDEDE S,KOVA V B.Distinguishing the Need to Belong and Sense of Belongingness:The Relation between Need to Belong and Personal Appraisals under Two Different Belongingness-Conditions[J].European Journal of Investigation in Health,Psychology and Education,2023,13(2):331-344.
[48]CRISTOFARO E D.Synthetic Data:Methods,Use Cases,andRisks[EB/OL].[2024-11-07].https://doi.org/10.48550/ar-Xiv.2303.01230.
[49]FCA.Using Synthetic Data in Financial Services[EB/OL].[2024-11-07].https://www.fca.org.uk/publication/corporate/report-using-synthetic-data-in-financial-services.pdf.
[50]RUSCH T K,BRONSTEIN M M,MISHRA S.A Survey on Oversmoothing in Graph Neural Networks[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2303.10993.
[51]LIU R B,WEI J,LIU F Y,et al.Best Practices and Lessons Learned on Synthetic Data [EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2404.07503.
[52]ZHI Z F.Information Content Governance of Large Model of Generative Artificial Intelligence[J].Tribune of Political Science and Law,2023,41(4):34-48.
[53]SHUMAILOY I,SHUMAYLOY Z,ZHAO Y R,et al.TheCurse of Recursion:Training on Generated Data Makes Models Forget[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2305.17493.
[54]GABRIEL I.Artificial Intelligence,Values,and Alignment[J].Minds and Machines,2020(30):411-437.
[55]RUSSELL S.Human Compatible:Artificial Intelligence and the Problem of Control[M].New York:Viking Press,2019:137.
[56]ZHOU X H,SU Z,EISAPE T,et al.Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs[EB/OL].[2024-11-07].https://doi.org/10.48550/arXiv.2403.05020.
[57]ZOU A,WANG Z F,CARLINI N,et al.Universal and Transferable Adversarial Attacks on Aligned Language Models[EB/OL].[2024-11-08].https://doi.org/10.48550/arXiv.2307.15043.
[58]GIUFFRE M,SHUNG D L.Harnessing the power of synthetic data in healthcare:innovation,application,and privacy[J].NPJ Digital Medicine,2023,6(1):1-8.
[59]WHITNEY C D,NORMAN J.Real Risks of Fake Data:Synthetic Data,Diversity-Washing and Consent Circumvention[EB/OL].[2024-11-08].https://doi.org/10.1145/3630106.3659002.
[60]BROWNSWORD R.Law 3.0:Rules,Regulation,and Technology[M].New York:Routledge,2021:32-33.
[61]VILJOEN S.A Relational Theory of Data Governance[J].Yale Law Journal,2021,131(2):573-654.
[62]WACHTER S,MITTELSTADT B.A Right to Reasonable Inferences:Re-Thinking Data Protection Law in the Age of Big Data and AI[J].Columbia Business Law Review,2019(2):494-620.
[63]GAL M S,LYNSKEY O.Synthetic Data:Legal Implications of the Data-Generation Revolution[J].Iowa Law Review,2024,109(3):1087-1156.
[64]BEDUSCHI A.Synthetic data protection:Towards a paradigm change in data regulation?[J].Big Data & Society,2024,11(1):1-5.
[65]HUANG W Y.On Precautionary Rule of Law[J].Chinese Journal of Law,2024,46(2):20-38.
[66]EBERS M.Standardizing AI:The Case of the European Commission's Proposal for an ‘Artificial Intelligence Act'[M] // The Cambridge Handbook of Artificial Intelligence:Global Perspectives on Law and Ethics.Cambridge:Cambridge University Press,2022:331.
[67]METROPOLIS N,ULAM S.The Monte Carlo method[J].Journal of the American Statistical Association,1949,44(247):335-341.
[68]ABUFADDA M,MANSOUR K.A Survey of Synthetic DataGeneration for Machine Learning[C]// 2021 22nd International Arab Conference on Information Technology.Muscat,Oman,2021:1-7.
[69]ZHAO P.The Legal Implications of ‘Ethical' Governance ofTechnology[J].Peking University Law Journal,2022,34(5):1201-1220.
[70]JOBIN A,LENCA M,VAYENA E.The global landscape of AI ethics guidelines[J].Nature Machine Intelligence,2019,1(9):389-399.
[71]ZHOU J L,CHEN F.AI ethics:from principles to practice[J].AI & SOCIETY,2023,38(6):2693-2703.
[72]CABALLERO I,GUALO F,RODRIGUEZ M,et al.MaturityModels for Data Governance[M]//Data Governance.Cham:Springer,2023:139.
[73]ABRAHAM R,SCHNEIDER J,BROCKE J.Data governance:A conceptual framework,structured review,and research agenda[J].International Journal of Information Management,2019(49):424-438.
[74]ALMASLUKH A,ALAMEER A,ALSALEH H,et al.DataMesh Meets Blockchain[J].International Journal of Computational Intelligence Systems,2024(17):1-15.
[75]MASOOD I,DAUD A,WANG Y L,et al.A blockchain-based system for patient data privacy and security[J].Multimedia Tools and Applications,2024(83):60443-60467.
[76]HASAN H R,SALAH K.Combating Deepfake Videos Using Blockchain and Smart Contracts[J].IEEE Access,2019(7):41596-41606.
[77]PESTANA G,ANTUNES W,CARVALHO J.Digital Chain of Custody Operational Framework[C]//2023 IEEE International Workshop on Technologies for Defense and Security.Rome,Italy,2023:417-422.
[78]LESSIG L.Code:Version 2.0[M].Cambridge:Basic Books,2006:6-7.
[79]ZACCAGNINO R,CAPO C,GUARINO A,et al.Techno-regulation and intelligent safeguards[J].Multimedia Tools and Applications,2021(80):15803-15824.
[80]HILDEBRANDT M.Legal Protection by Design:Objections and Refutations[J].Legisprudence,2011,5(2):223-248.
[81]ALMADA M.Regulation by Design and the Governance ofTechnological Futures[J].European Journal of Risk Regulation,2023,14(4):697-709.
[82]VANNA F D.The Construction of a Normative Framework for Technology-Driven Innovations:A Legal Theory Perspective[M] // Use and Misuse of New Technologies.Cham:Springer,2019:193-194.
[1] LIANG Binghao, ZHANG Chuangang, YUAN Mingming. Large Model Driven AI Application Service Platform [J]. Computer Science, 2025, 52(6A): 240900022-4.
[2] LIU Qingyun, YOU Xiong, ZHANG Xin, ZUO Jiwei, LI Jia. Review of Path Planning Algorithms for Mobile Robots [J]. Computer Science, 2025, 52(6A): 240900074-10.
[3] SU Zhiyuan, ZHAO Lixu, HAO Zhiheng, BAI Rufeng. Suvery of Artificial Intelligence Ensuring eVTOL Flight Safety in the Context of Low-altitudeEconomy [J]. Computer Science, 2025, 52(6A): 250200050-13.
[4] YANG Jixiang, JIANG Huiping, WANG Sen, MA Xuan. Research Progress and Challenges in Forest Fire Risk Prediction [J]. Computer Science, 2025, 52(6A): 240400177-8.
[5] WANG Yun, ZHAO Jianming, GUO Yifeng, ZHOU Huanhuan, ZHOU Wuai, ZHANG Wanzhe, FENG Jianhua. Automation and Security Strategies and Empirical Research on Operation and Maintenance of Digital Government Database [J]. Computer Science, 2025, 52(6A): 240500045-8.
[6] TU Ji, XIAO Wendong, TU Wenji, LI Lijian. Application of Large Language Models in Medical Education:Current Situation,Challenges and Future [J]. Computer Science, 2025, 52(6A): 240400121-6.
[7] TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan. AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+” [J]. Computer Science, 2025, 52(5): 1-10.
[8] WANG Yifei, ZHANG Shengjie, XUE Dizhan, QIAN Shengsheng. Self-supervised Backdoor Attack Defence Method Based on Poisoned Classifier [J]. Computer Science, 2025, 52(4): 336-342.
[9] WANG Yuan, HUO Peng, HAN Yi, CHEN Tun, WANG Xiang, WEN Hui. Survey on Deep Learning-based Meteorological Forecasting Models [J]. Computer Science, 2025, 52(3): 112-126.
[10] JIANG Rui, YANG Kaihui, WANG Xiaoming, LI Dapeng, XU Youyun. Attentional Interaction-based Deep Learning Model for Chinese Question Answering [J]. Computer Science, 2024, 51(6): 325-330.
[11] GUO Shangzhi, LIAO Xiaofeng, XIAN Kaiyi. Logical Regression Click Prediction Algorithm Based on Combination Structure [J]. Computer Science, 2024, 51(2): 73-78.
[12] WANG Wentong, ZHANG Zhijun, ZHANG Mingyang. Review of Key Technologies,Research Progress and Applications of Metaverse [J]. Computer Science, 2024, 51(12): 2-11.
[13] RAO Yi, YUAN Bochuan, YUAN Yubo. Recognition Method of Online Classroom Interaction Based on Learner State [J]. Computer Science, 2024, 51(11A): 231200133-9.
[14] WANG Shuaiwei, LEI Jie, FENG Zunlei, LIANG Ronghua. Review of Visual Representation Learning [J]. Computer Science, 2024, 51(11): 112-132.
[15] YAO Tianlei, CHEN Xiliang, YU Peiyi. Review of Generative Reinforcement Learning Based on Sequence Modeling [J]. Computer Science, 2024, 51(11): 213-228.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!