计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 3-12.doi: 10.11896/jsjkx.250800044

• 数智赋能金融科技前沿 • 上一篇    下一篇

表格数据生成技术综述

王永鑫1,2, 徐鑫3, 朱鸿斌1,2   

  1. 1 复旦大学金融科技研究院 上海 200433
    2 复旦大学计算与智能创新学院 上海 200433
    3 上海立信会计金融学院计算机与人工智能学院 上海 201209
  • 收稿日期:2025-08-12 修回日期:2025-09-20 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 朱鸿斌(zhuhb@fudan.edu.cn)
  • 作者简介:(yongxinwang24@m.fudan.edu.cn)
  • 基金资助:
    国家自然科学基金青年基金(62306077);国家重点研发计划(2023YFC3305304)

Survey of Tabular Data Generation Techniques

WANG Yongxin1,2, XU Xin3, ZHU Hongbin 1,2   

  1. 1 Institute of Financial Technology,Fudan University,Shanghai 200433,China
    2 College of Computer Science and Artificial Intelligence,Fudan University,Shanghai 200433,China
    3 School of Computer Science and Artificial Intelligence,Shanghai Lixin University of Accounting and Finance,Shanghai 201219,China
  • Received:2025-08-12 Revised:2025-09-20 Online:2025-10-15 Published:2025-10-14
  • About author:WANG Yongxin,born in 2001,postgraduate.His main research interest is tabular data generation.
    ZHU Hongbin,born in 1991,assistant professor,is a member of CCF(No.Q5992M).His main research interests include generative AI,graph learning and financial technology.
  • Supported by:
    National Natural Science Foundation of China(62306077) and National Key Research and Development Program of China(2023YFC3305304).

摘要: 表格数据因在金融、医疗等关键领域广泛应用而具有重要价值。然而,对于表格数据的有效利用,常受到数据稀缺、类别不平衡及隐私法规的严格制约。为应对这些挑战,通过生成模型合成在统计特性上与真实数据高度相似的样本,已成为一种新兴的解决方案,旨在增强数据可用性并保护用户隐私。该领域的技术发展路径从传统的深度学习模型逐步演进至前沿范式。早期的探索以变分自编码器和生成对抗网络为代表,但这些方法常面临训练不稳定和模式坍塌等瓶颈,影响了生成数据的质量。为克服这些难题,扩散模型应运而生,其通过渐进式的去噪过程,在生成高保真度和多样性的样本方面展现出显著优势。尽管如此,这些模型的核心仍是模仿统计分布,缺乏对现实世界常识的理解。为此,最新的研究转向基于大型语言模型的方法,利用其丰富的世界知识,旨在生成不仅统计真实,而且在逻辑与语义上也更合理的合成表格数据。对该领域的系统性回顾,旨在为研究者和从业者提供全面的技术认知,并为不同应用场景下选择最合适的技术路径提供决策参考。

关键词: 表格数据生成, 大语言模型, 生成方法

Abstract: Tabular data holds significant value due to its widespread application in critical domains such as finance and healthcare.However,the effective utilization of tabular data is often constrained by data scarcity,class imbalance,and stringent privacy regulations.To address these challenges,synthesizing samples that are statistically highly similar to real data through generative models has emerged as a novel solution,aiming to enhance data availability and protect user privacy.The technological development path in this field has progressively evolved from traditional deep learning models to cutting-edge paradigms.Early explorations are represented by Variational Autoencoders and Generative Adversarial Networks,but these methods often face bottlenecks such as training instability and mode collapse,affecting the quality of generated data.To overcome these difficulties,diffusion models have emerged,demonstrating significant advantages in generating high-fidelity and diverse samples through a progressive denoising process.Nevertheless,the core of these models remains the imitation of statistical distributions,lacking an understanding of real-world common sense.Consequently,the latest research has shifted towards methods based on Large Language Models(LLMs),leveraging their rich world knowledge to generate synthetic tabular data that is not only statistically authentic but also logically and semantically more reasonable.A systematic review of this field aims to provide researchers and practitioners with a comprehensive understanding of the technology and offer decision-making references for selecting the most appropriate technical path in different application scenarios.

Key words: Tabular data generation,Large language model,Generative methods

中图分类号: 

  • TP183
[1]SHAILAJA K,SEETHARAMULU B,JABBAR M A.Machine learning in healthcare:A review [C]//2018 2nd International Conference on Electronics,Communication and Aerospace Technology(ICECA).IEEE,2018:910-914.
[2]CAO L.AI in finance:challenges,techniques,and opportunities[J].ACM Computing Surveys,2022,55(3):1-38.
[3]COMBRINK H M E,MARIVATE V,ROSMAN B.Comparing synthetic tabular data generation between a probabilistic model and a deep learning model for education use cases [J].arXiv:2210.08528,2022.
[4]SUN C,LI S,CAO D,et al.Tabular learning-based traffic event prediction for intelligent social transportation system [J].IEEE Transactions on Computational Social Systems,2022,10(3):1199-1210.
[5]LI L,FAN Y,TSE M,et al.A review of applications in federated learning [J].Computers & Industrial Engineering,2020,149:106854.
[6]ACAR A,AKSU H,ULUAGAC A S,et al.A survey on homomorphic encryption schemes:Theory and implementation [J].ACM Computing Surveys,2018,51(4):1-35.
[7]FRIEDMAN A,SCHUSTER A.Data mining with differentialprivacy [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2010:493-502.
[8]KINGMA D P,WELLING M.Auto-encoding variational Bayes [J].arXiv:1312.6114,2013.
[9]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets [C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2014:2672-2680.
[10]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics [C]// International Conference on Machine Lear-ning.PMLR,2015:2256-2265.
[11]ZHAO W X,ZHOU K,LI J,et al.A survey of large language models [J].arXiv:2303.18223,2023.
[12]SKLAR A.Random variables,joint distribution functions,andcopulas [J].Kybernetika,1973,9(6):449-460.
[13]RABAEY P,DELEU J,HEYTENS S,et al.Clinical reasoning over tabular data and text with bayesian networks [C]// International Conference on Artificial Intelligence in Medicine.Cham:Springer,2024:229-250.
[14]RUBIN D B.Statistical disclosure limitation [J].Journal of Official Statistics,1993,9(2):461-468.
[15]YOUNG J,GRAHAM P,PENNY R.Using Bayesian networks to create synthetic data [J].Journal of Official Statistics,2009,25(4):549-567.
[16]MARTINS L N A,GONÇALVES F B,GALLETTI T P.Gene-ration and analysis of synthetic data via Bayesian networks:a robust approach for uncertainty quantification via Bayesian paradigm [J].arXiv:2402.17915,2024.
[17]SKLAR M.Fonctions de répartitionàn dimensions et leursmarges [J].Annales de l'ISUP,1959,8(3):229-231.
[18]EMBRECHTS P,MCNEIL A,STRAUMANN D.Correlationand dependence in risk management:properties and pitfalls [M]//Risk Management:Value at Risk and Beyond.2002:176-223.
[19]RESTREPO J P.Nonparametric generation of synthetic datausing copulas [J].Electronics,2023,12(7):1601.
[20]JUTRAS-DUBÉ P,AL-KHASAWNEH M B,YANG Z C,et al.Copula-based synthetic population generation [J].arXiv:2302.09193,2023.
[21]KAMTHE S,ASSEFA S,DEISENROTH M.Copula flows for synthetic data generation [J].arXiv:2101.00598,2021.
[22]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique [J].Journal of Artificial Intelligence Research,2002,16:321-357.
[23]GOODMAN N R.Statistical analysis based on a certain multivariate complex Gaussian distribution(an introduction) [J].The Annals of Mathematical Statistics,1963,34(1):152-177.
[24]ALEMI A,POOLE B,FISCHER I,et al.Fixing a broken ELBO [C]// International Conference on Machine Learning.PMLR,2018.
[25]XU L,SKOULARIDOU M,CUESTA-INFANTE A,et al.Modeling tabular data using conditional GAN [C]//Procee-dings of the 33rd Internationa Conference on Neural Information Processing Systems.2019:7335-7345.
[26]MA C,TSCHIATSCHEK S,TURNER R,et al.VAEM:a deep generative model for heterogeneous mixed type data [C]// Advances in Neural Information Processing Systems.2020:11237-11247.
[27]LIU T,QIAN Z,BERREVOETS J,et al.Goggle:Generativemodelling for tabular data by learning relational structure [C]//The 11th International Conference on Learning Representations.2023.
[28]XU L,VEERAMACHANENI K.Synthesizing tabular datausing generative adversarial networks [J].arXiv:1811.11264,2018.
[29]PARK N,MOHAMMADI M,GORDE K,et al.Data synthesis based on generative adversarial networks [J].arXiv:1806.03384,2018.
[30]ZHAO Z,BIRKE R,CHEN L Y.FCT-GAN:Enhancing table synthesis via fourier transform [J].arXiv:2210.06239,2022.
[31]RAJABI A,GARIBAY O O.Tabfairgan:Fair tabular data ge-neration with generative adversarial networks [J].Machine Learning and Knowledge Extraction,2022,4(2):488-501.
[32]MIYATO T,KOYAMA M.cGANs with projection discriminator [J].arXiv:1802.05637,2018.
[33]LIN Z,KHETAN A,FANTI G,et al.PacGAN:The power of two samples in generative adversarial networks [J].IEEE Journal on Selected Areas in Information Theory,2020,1(1):324-335.
[34]XIE L,LIN K,WANG S,et al.Differentially private generative adversarial network [J].arXiv:1802.06739,2018.
[35]JORDON J,YOON J,VAN DER SCHAAR M.PATE-GAN:Generating synthetic data with differential privacy guarantees [C]//International Conference on Learning Representations.2018.
[36]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:6840-6851.
[37]KOTELNIKOV A,BARANCHUK D,RUBACHEV I,et al.Tabddpm:Modelling tabular data with diffusion models [C]// International Conference on Machine Learning.PMLR,2023:17564-17579.
[38]SHI J,XU M,HUA H,et al.Tabdiff:a multi-modal diffusion model for tabular data generation [J].arXiv:2410.20626,2024.
[39]LEE C,KIM J,PARK N.CODI:Co-evolving contrastive diffusion models for mixed-type tabular synthesis [C]// Internatio-nal Conference on Machine Learning.PMLR,2023:18940-18956.
[40]SUH N,LIN X,HSIEH D Y,et al.Autodiff:combining auto-encoder and diffusion model for tabular data synthesizing [J].ar-Xiv:2310.15479,2023.
[41]LIN X,XU C,YANG M,et al.Ctsyn:A foundational model for cross tabular data generation [J].arXiv:2406.04619,2024.
[42]CERITLI T,GHOSHEH G O,CHAUHAN V K,et al.Synthesizing mixed-type electronic health records using diffusion mo-dels [J].arXiv:2302.14679,2023.
[43]HE H,HAO W,XI Y,et al.A Flexible Generative Model for Heterogeneous Tabular {EHR} with Missing Modality [C]// The 12th International Conference on Learning Representations.2024.
[44]SATTAROV T,SCHREYER M,BORTH D.Findiff:Diffusion models for financial tabular data generation [C]// Proceedings of the 4th ACM International Conference on AI in Finance.2023:64-72.
[45]SCHREYER M,SATTAROV T,SIM A,et al.Imb-FinDiff:Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular Data [C]// Proceedings of the 5th ACM International Conference on AI in Finance.2024:617-625.
[46]KIM J,LEE C,SHIN Y,et al.Sos:Score-based oversampling for tabular data [C]// Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2022:762-772.
[47]OUYANG Y,XIE L,LI C,et al.Missdiff:Training diffusionmodels on tabular data with missing values [J].arXiv:2307.00467,2023.
[48]JOLICOEUR-MARTINEAU A,FATRAS K,KACHMAN T.Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees [C]// International Conference on Artificial Intelligence and Statistics.PMLR,2024:1288-1296.
[49]ZHANG H,ZHANG J,SRINIVASAN B,et al.Mixed-type ta-bular data synthesis with score-based diffusion in latent space [J].arXiv:2310.09656,2023.
[50]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[51]HEGSELMANN S,BUENDIA A,LANG H,et al.Tabllm:Few-shot classification of tabular data with large language mo-dels [C]// International Conference on Artificial Intelligence and Statistics.PMLR,2023:5549-5581.
[52]YIN P,NEUBIG G,YIH W,et al.TaBERT:Pretraining forjoint understanding of textual and tabular data [J].arXiv:2005.08314,2020.
[53]KALE M,RASTOGI A.Text-to-text pre-training for data-to-text tasks [J].arXiv:2005.10433,2020.
[54]BORISOV V,SEβLER K,LEEMANN T,et al.Language mo-dels are realistic tabular data generators [J].arXiv:2210.06280,2022.
[55]GRESHAKE K,ABDELNABI S,MISHRA S,et al.Not what you've signed up for:Compromising real-world llm-integrated applications with indirect prompt injection [C]// Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security.2023:79-90.
[56]SOLATORIO A V,DUPRIEZ O.Realtabformer:Generating realistic relational and tabular data using transformers [J].arXiv:2302.02041,2023.
[57]ZHAO Z,BIRKE R,CHEN L Y.Tabula:Harnessing language models for tabular data synthesis [C]// Pacific-Asia Conference on Knowledge Discovery and Data Mining.Singapore:Springer,2025:247-259.
[58]GULATI M,ROYSDON P.TabMT:Generating tabular datawith masked transformers [C]// Advances in Neural Information Processing Systems.2023:46245-46254.
[59]ZHANG T,WANG S,YAN S,et al.Generative table pre-training empowers models for tabular prediction [J].arXiv:2305.09696,2023.
[60]WANG Y,FENG D,DAI Y,et al.HARMONIC:HarnessingLLMs for tabular data synthesis and privacy protection [C]// Advances in Neural Information Processing Systems.2024:100196-100212.
[61]TRAN T V,XIONG L.Differentially private tabular data synthesis using large language models [J].arXiv:2406.01457,2024.
[62]NGUYEN D,GUPTA S,DO K,et al.Generating realistic tabular data with large language models [C]// 2024 IEEE International Conference on Data Mining(ICDM).IEEE,2024:330-339.
[63]YANG S,YUAN C,RONG Y,et al.P-ta:Using proximal policy optimization to enhance tabular data augmentation via large language models [J].arXiv:2406.11391,2024.
[64]ZHANG M,XIAO Z,LU G,et al.Aigt:AI generative tablebased on prompt [J].arXiv:2412.18111,2024.
[65]SEEDAT N,HUYNH N,VAN BREUGEL B,et al.CuratedLLM:Synergy of LLMs and data curation for tabular augmentation in low-data regimes [J].arXiv:2312.12112,2023.
[66]YANG J Y,PARK G,KIM J,et al.Language-interfaced tabular oversampling via progressive imputation and self-authentication [C]// The Twelfth International Conference on Learning Representations.2024.
[67]KIM J,KIM T,CHOO J.Epic:Effective prompting for imba-lanced-class data synthesis in tabular data classification via large language models [C]// Advances in Neural Information Processing Systems.2024:31504-31542.
[68]NAM J,KIM K,OH S,et al.Optimized feature generation for tabular data via llms with decision tree reasoning [C]// Advances in Neural Information Processing Systems.2024:92352-92380.
[69]FEKRI M N,GHOSH A M,GROLINGER K.Generating energy data for machine learning with recurrent generative adversa-rial networks [J].Energies,2019,13(1):130.
[70]BERGER V W,ZHOU Y Y.Kolmogorov-smirnov test:Over-view [EB/OL].https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat06558.
[71]TAO L,XU S,WANG C H,et al.Discriminative estimation of total variation distance:A fidelity auditor for generative data [J].arXiv:2405.15337,2024.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!