表格数据生成技术综述

doi:10.11896/jsjkx.250800044

Abstract

Abstract: Tabular data holds significant value due to its widespread application in critical domains such as finance and healthcare.However,the effective utilization of tabular data is often constrained by data scarcity,class imbalance,and stringent privacy regulations.To address these challenges,synthesizing samples that are statistically highly similar to real data through generative models has emerged as a novel solution,aiming to enhance data availability and protect user privacy.The technological development path in this field has progressively evolved from traditional deep learning models to cutting-edge paradigms.Early explorations are represented by Variational Autoencoders and Generative Adversarial Networks,but these methods often face bottlenecks such as training instability and mode collapse,affecting the quality of generated data.To overcome these difficulties,diffusion models have emerged,demonstrating significant advantages in generating high-fidelity and diverse samples through a progressive denoising process.Nevertheless,the core of these models remains the imitation of statistical distributions,lacking an understanding of real-world common sense.Consequently,the latest research has shifted towards methods based on Large Language Models(LLMs),leveraging their rich world knowledge to generate synthetic tabular data that is not only statistically authentic but also logically and semantically more reasonable.A systematic review of this field aims to provide researchers and practitioners with a comprehensive understanding of the technology and offer decision-making references for selecting the most appropriate technical path in different application scenarios.

CLC Number:

TP183

WANG Yongxin, XU Xin, ZHU Hongbin. Survey of Tabular Data Generation Techniques[J].Computer Science, 2025, 52(10): 3-12.

References

[1]SHAILAJA K,SEETHARAMULU B,JABBAR M A.Machine learning in healthcare:A review [C]//2018 2nd International Conference on Electronics,Communication and Aerospace Technology(ICECA).IEEE,2018:910-914.
[2]CAO L.AI in finance:challenges,techniques,and opportunities[J].ACM Computing Surveys,2022,55(3):1-38.
[3]COMBRINK H M E,MARIVATE V,ROSMAN B.Comparing synthetic tabular data generation between a probabilistic model and a deep learning model for education use cases [J].arXiv:2210.08528,2022.
[4]SUN C,LI S,CAO D,et al.Tabular learning-based traffic event prediction for intelligent social transportation system [J].IEEE Transactions on Computational Social Systems,2022,10(3):1199-1210.
[5]LI L,FAN Y,TSE M,et al.A review of applications in federated learning [J].Computers & Industrial Engineering,2020,149:106854.
[6]ACAR A,AKSU H,ULUAGAC A S,et al.A survey on homomorphic encryption schemes:Theory and implementation [J].ACM Computing Surveys,2018,51(4):1-35.
[7]FRIEDMAN A,SCHUSTER A.Data mining with differentialprivacy [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2010:493-502.
[8]KINGMA D P,WELLING M.Auto-encoding variational Bayes [J].arXiv:1312.6114,2013.
[9]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets [C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2014:2672-2680.
[10]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics [C]// International Conference on Machine Lear-ning.PMLR,2015:2256-2265.
[11]ZHAO W X,ZHOU K,LI J,et al.A survey of large language models [J].arXiv:2303.18223,2023.
[12]SKLAR A.Random variables,joint distribution functions,andcopulas [J].Kybernetika,1973,9(6):449-460.
[13]RABAEY P,DELEU J,HEYTENS S,et al.Clinical reasoning over tabular data and text with bayesian networks [C]// International Conference on Artificial Intelligence in Medicine.Cham:Springer,2024:229-250.
[14]RUBIN D B.Statistical disclosure limitation [J].Journal of Official Statistics,1993,9(2):461-468.
[15]YOUNG J,GRAHAM P,PENNY R.Using Bayesian networks to create synthetic data [J].Journal of Official Statistics,2009,25(4):549-567.
[16]MARTINS L N A,GONÇALVES F B,GALLETTI T P.Gene-ration and analysis of synthetic data via Bayesian networks:a robust approach for uncertainty quantification via Bayesian paradigm [J].arXiv:2402.17915,2024.
[17]SKLAR M.Fonctions de répartitionàn dimensions et leursmarges [J].Annales de l'ISUP,1959,8(3):229-231.
[18]EMBRECHTS P,MCNEIL A,STRAUMANN D.Correlationand dependence in risk management:properties and pitfalls [M]//Risk Management:Value at Risk and Beyond.2002:176-223.
[19]RESTREPO J P.Nonparametric generation of synthetic datausing copulas [J].Electronics,2023,12(7):1601.
[20]JUTRAS-DUBÉ P,AL-KHASAWNEH M B,YANG Z C,et al.Copula-based synthetic population generation [J].arXiv:2302.09193,2023.
[21]KAMTHE S,ASSEFA S,DEISENROTH M.Copula flows for synthetic data generation [J].arXiv:2101.00598,2021.
[22]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique [J].Journal of Artificial Intelligence Research,2002,16:321-357.
[23]GOODMAN N R.Statistical analysis based on a certain multivariate complex Gaussian distribution(an introduction) [J].The Annals of Mathematical Statistics,1963,34(1):152-177.
[24]ALEMI A,POOLE B,FISCHER I,et al.Fixing a broken ELBO [C]// International Conference on Machine Learning.PMLR,2018.
[25]XU L,SKOULARIDOU M,CUESTA-INFANTE A,et al.Modeling tabular data using conditional GAN [C]//Procee-dings of the 33rd Internationa Conference on Neural Information Processing Systems.2019:7335-7345.
[26]MA C,TSCHIATSCHEK S,TURNER R,et al.VAEM:a deep generative model for heterogeneous mixed type data [C]// Advances in Neural Information Processing Systems.2020:11237-11247.
[27]LIU T,QIAN Z,BERREVOETS J,et al.Goggle:Generativemodelling for tabular data by learning relational structure [C]//The 11th International Conference on Learning Representations.2023.
[28]XU L,VEERAMACHANENI K.Synthesizing tabular datausing generative adversarial networks [J].arXiv:1811.11264,2018.
[29]PARK N,MOHAMMADI M,GORDE K,et al.Data synthesis based on generative adversarial networks [J].arXiv:1806.03384,2018.
[30]ZHAO Z,BIRKE R,CHEN L Y.FCT-GAN:Enhancing table synthesis via fourier transform [J].arXiv:2210.06239,2022.
[31]RAJABI A,GARIBAY O O.Tabfairgan:Fair tabular data ge-neration with generative adversarial networks [J].Machine Learning and Knowledge Extraction,2022,4(2):488-501.
[32]MIYATO T,KOYAMA M.cGANs with projection discriminator [J].arXiv:1802.05637,2018.
[33]LIN Z,KHETAN A,FANTI G,et al.PacGAN:The power of two samples in generative adversarial networks [J].IEEE Journal on Selected Areas in Information Theory,2020,1(1):324-335.
[34]XIE L,LIN K,WANG S,et al.Differentially private generative adversarial network [J].arXiv:1802.06739,2018.
[35]JORDON J,YOON J,VAN DER SCHAAR M.PATE-GAN:Generating synthetic data with differential privacy guarantees [C]//International Conference on Learning Representations.2018.
[36]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:6840-6851.
[37]KOTELNIKOV A,BARANCHUK D,RUBACHEV I,et al.Tabddpm:Modelling tabular data with diffusion models [C]// International Conference on Machine Learning.PMLR,2023:17564-17579.
[38]SHI J,XU M,HUA H,et al.Tabdiff:a multi-modal diffusion model for tabular data generation [J].arXiv:2410.20626,2024.
[39]LEE C,KIM J,PARK N.CODI:Co-evolving contrastive diffusion models for mixed-type tabular synthesis [C]// Internatio-nal Conference on Machine Learning.PMLR,2023:18940-18956.
[40]SUH N,LIN X,HSIEH D Y,et al.Autodiff:combining auto-encoder and diffusion model for tabular data synthesizing [J].ar-Xiv:2310.15479,2023.
[41]LIN X,XU C,YANG M,et al.Ctsyn:A foundational model for cross tabular data generation [J].arXiv:2406.04619,2024.
[42]CERITLI T,GHOSHEH G O,CHAUHAN V K,et al.Synthesizing mixed-type electronic health records using diffusion mo-dels [J].arXiv:2302.14679,2023.
[43]HE H,HAO W,XI Y,et al.A Flexible Generative Model for Heterogeneous Tabular {EHR} with Missing Modality [C]// The 12th International Conference on Learning Representations.2024.
[44]SATTAROV T,SCHREYER M,BORTH D.Findiff:Diffusion models for financial tabular data generation [C]// Proceedings of the 4th ACM International Conference on AI in Finance.2023:64-72.
[45]SCHREYER M,SATTAROV T,SIM A,et al.Imb-FinDiff:Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular Data [C]// Proceedings of the 5th ACM International Conference on AI in Finance.2024:617-625.
[46]KIM J,LEE C,SHIN Y,et al.Sos:Score-based oversampling for tabular data [C]// Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.2022:762-772.
[47]OUYANG Y,XIE L,LI C,et al.Missdiff:Training diffusionmodels on tabular data with missing values [J].arXiv:2307.00467,2023.
[48]JOLICOEUR-MARTINEAU A,FATRAS K,KACHMAN T.Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees [C]// International Conference on Artificial Intelligence and Statistics.PMLR,2024:1288-1296.
[49]ZHANG H,ZHANG J,SRINIVASAN B,et al.Mixed-type ta-bular data synthesis with score-based diffusion in latent space [J].arXiv:2310.09656,2023.
[50]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[51]HEGSELMANN S,BUENDIA A,LANG H,et al.Tabllm:Few-shot classification of tabular data with large language mo-dels [C]// International Conference on Artificial Intelligence and Statistics.PMLR,2023:5549-5581.
[52]YIN P,NEUBIG G,YIH W,et al.TaBERT:Pretraining forjoint understanding of textual and tabular data [J].arXiv:2005.08314,2020.
[53]KALE M,RASTOGI A.Text-to-text pre-training for data-to-text tasks [J].arXiv:2005.10433,2020.
[54]BORISOV V,SEβLER K,LEEMANN T,et al.Language mo-dels are realistic tabular data generators [J].arXiv:2210.06280,2022.
[55]GRESHAKE K,ABDELNABI S,MISHRA S,et al.Not what you've signed up for:Compromising real-world llm-integrated applications with indirect prompt injection [C]// Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security.2023:79-90.
[56]SOLATORIO A V,DUPRIEZ O.Realtabformer:Generating realistic relational and tabular data using transformers [J].arXiv:2302.02041,2023.
[57]ZHAO Z,BIRKE R,CHEN L Y.Tabula:Harnessing language models for tabular data synthesis [C]// Pacific-Asia Conference on Knowledge Discovery and Data Mining.Singapore:Springer,2025:247-259.
[58]GULATI M,ROYSDON P.TabMT:Generating tabular datawith masked transformers [C]// Advances in Neural Information Processing Systems.2023:46245-46254.
[59]ZHANG T,WANG S,YAN S,et al.Generative table pre-training empowers models for tabular prediction [J].arXiv:2305.09696,2023.
[60]WANG Y,FENG D,DAI Y,et al.HARMONIC:HarnessingLLMs for tabular data synthesis and privacy protection [C]// Advances in Neural Information Processing Systems.2024:100196-100212.
[61]TRAN T V,XIONG L.Differentially private tabular data synthesis using large language models [J].arXiv:2406.01457,2024.
[62]NGUYEN D,GUPTA S,DO K,et al.Generating realistic tabular data with large language models [C]// 2024 IEEE International Conference on Data Mining(ICDM).IEEE,2024:330-339.
[63]YANG S,YUAN C,RONG Y,et al.P-ta:Using proximal policy optimization to enhance tabular data augmentation via large language models [J].arXiv:2406.11391,2024.
[64]ZHANG M,XIAO Z,LU G,et al.Aigt:AI generative tablebased on prompt [J].arXiv:2412.18111,2024.
[65]SEEDAT N,HUYNH N,VAN BREUGEL B,et al.CuratedLLM:Synergy of LLMs and data curation for tabular augmentation in low-data regimes [J].arXiv:2312.12112,2023.
[66]YANG J Y,PARK G,KIM J,et al.Language-interfaced tabular oversampling via progressive imputation and self-authentication [C]// The Twelfth International Conference on Learning Representations.2024.
[67]KIM J,KIM T,CHOO J.Epic:Effective prompting for imba-lanced-class data synthesis in tabular data classification via large language models [C]// Advances in Neural Information Processing Systems.2024:31504-31542.
[68]NAM J,KIM K,OH S,et al.Optimized feature generation for tabular data via llms with decision tree reasoning [C]// Advances in Neural Information Processing Systems.2024:92352-92380.
[69]FEKRI M N,GHOSH A M,GROLINGER K.Generating energy data for machine learning with recurrent generative adversa-rial networks [J].Energies,2019,13(1):130.
[70]BERGER V W,ZHOU Y Y.Kolmogorov-smirnov test:Over-view [EB/OL].https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat06558.
[71]TAO L,XU S,WANG C H,et al.Discriminative estimation of total variation distance:A fidelity auditor for generative data [J].arXiv:2405.15337,2024.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey of Tabular Data Generation Techniques

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	LI Ao, BAI Xueru, JIANG Jiali, QIAO Ye. Group Cross Adversarial Application in Stock Price Prediction [J]. Computer Science, 2025, 52(10): 22-32.
[2]	LIU Yuting, GU Jingjing, ZHOU Qiang. Urban Flow Prediction Method Based on Structural Causal Model [J]. Computer Science, 2025, 52(10): 70-78.
[3]	LEI Ershuai, YU Suping, FAN Hong, XU Wujun. Spatial-Temporal Propagation Graph Neural Network for Traffic Prediction [J]. Computer Science, 2025, 52(10): 90-97.
[4]	ZHAO Chen, PENG Jian, HUANG Junhao. Spatial-Temporal Joint Mapping for Skeleton-based Action Recognition [J]. Computer Science, 2025, 52(10): 106-114.
[5]	WANG Liuyi, ZHOU Chun, ZENG Wenqiang, HE Xingxing, MENG Hua. High-frequency Feature Masking-based Adversarial Attack Algorithm [J]. Computer Science, 2025, 52(10): 374-381.
[6]	LI Siqi, YU Kun, CHEN Yuhao. Prediction of Resource Usage on High-performance Computing Platforms Based on ARIMAand LSTM [J]. Computer Science, 2025, 52(9): 178-185.
[7]	WANG Limei, HAN Linrui, DU Zuwei, ZHENG Ri, SHI Jianzhong, LIU Yiqun. Privacy Policy Compliance Detection Method for Mobile Application Based on Large LanguageModel [J]. Computer Science, 2025, 52(8): 1-16.
[8]	GUO Husheng, ZHANG Xufei, SUN Yujie, WANG Wenjian. Continuously Evolution Streaming Graph Neural Network [J]. Computer Science, 2025, 52(8): 118-126.
[9]	YU Shihai, HU Bin. Bio-inspired Neural Network with Visual Invariant Response to Moving Pedestrian [J]. Computer Science, 2025, 52(7): 170-188.
[10]	LI Bo, MO Xian. Application of Large Language Models in Recommendation System [J]. Computer Science, 2025, 52(6A): 240400097-7.
[11]	SHI Xincheng, WANG Baohui, YU Litao, DU Hui. Study on Segmentation Algorithm of Lower Limb Bone Anatomical Structure Based on 3D CTImages [J]. Computer Science, 2025, 52(6A): 240500119-9.
[12]	CHEN Shijia, YE Jianyuan, GONG Xuan, ZENG Kang, NI Pengcheng. Aircraft Landing Gear Safety Pin Detection Algorithm Based on Improved YOlOv5s [J]. Computer Science, 2025, 52(6A): 240400189-7.
[13]	LIU Bingzhi, CAO Yin, ZHOU Yi. Distillation Method for Text-to-Audio Generation Based on Balanced SNR-aware [J]. Computer Science, 2025, 52(6A): 240900125-5.
[14]	ZHANG Hang, WEI Shoulin, YIN Jibin. TalentDepth:A Monocular Depth Estimation Model for Complex Weather Scenarios Based onMultiscale Attention Mechanism [J]. Computer Science, 2025, 52(6A): 240900126-7.
[15]	CHENG Yan, HE Huijuan, CHEN Yanying, YAO Nannan, LIN Guobo. Study on interpretable Shallow Class Activation Mapping Algorithm Based on Spatial Weights andInter Layer Correlation [J]. Computer Science, 2025, 52(6A): 240500140-7.