计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 151-157.doi: 10.11896/jsjkx.250600149

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

针对多标记表格数据的半监督学习方法

葛泽庆, 黄圣君   

  1. 南京航空航天大学计算机科学与技术学院 南京 211106
  • 收稿日期:2025-06-24 修回日期:2025-08-22 发布日期:2026-03-12
  • 通讯作者: 黄圣君(huangsj@nuaa.edu.cn)
  • 作者简介:(gezeqing@nuaa.edu.cn)
  • 基金资助:
    国家自然科学基金优秀青年科学基金(62222605);叶企孙基金(U2441285)

Semi-supervised Learning Method for Multi-label Tabular Data

GE Zeqing, HUANG Shengjun   

  1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
  • Received:2025-06-24 Revised:2025-08-22 Online:2026-03-12
  • About author:GE Zeqing,born in 2000,postgraduate.His main research interests include multi-label learing and semi-supervised learning.
    HUANG Shengjun,born in 1987,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.42916S).His main research interests include machine lear-ning and pattern recognition.
  • Supported by:
    Excellent Young Scientists Found of the National Natural Science Foundation of China(62222605) and YQS Foundation(U2441285).

摘要: 表格数据在医学、金融和制造业等领域具有广泛应用,其多标记分类任务对揭示现实世界中复杂的关联特性至关重要。然而,获取大规模标记数据集往往成本高昂,这给研究带来了挑战。虽然半监督学习利用未标记样本在图像和文本数据中取得了成功,但由于表格数据缺乏固有的空间或语义结构,使得传统方法效率较低。为了应对这些挑战,提出了一种针对多标记表格数据的半监督学习框架。该方法引入了一种结构保留的数据增强方法,在特征表示空间内添加高斯噪声保留原始数据结构,与基于一致性的正则化技术,在样本及其扰动版本之间进行正则化,以增强泛化能力。此外,还开发了一种基于注意力机制的机制,有选择地从标记数据中聚合邻域信息,从而使模型能够有效地利用局部特征相关性。在10个公共多标记表格数据集上进行了广泛的实验,结果证明了该方法的有效性。

关键词: 表格数据, 多标记分类, 半监督学习, 数据增强, 注意力机制

Abstract: Tabular data is ubiquitous in industrial applications,spanning fields such as medicine,finance,and manufacturing,where each sample is characterized by heterogeneous features.Multi-label classification for tabular data is crucial for capturing the complex,interconnected nature of real-world phenomena,yet obtaining large-scale labeled datasets is often costly.While semi-supervised learning has shown success in image and text data by leveraging unlabeled samples,its application to tabular data remains challenging due to the lack of inherent spatial or semantic structures,making conventional augmentation and consistency-based methods less effective.To address these challenges,this paper proposes a novel semi-supervised learning frameworktai-lored for multi-label tabular data.This approach introduces a structure-preserving data augmentation method that adds Gaussian noise to the feature representation space preserving the original data structure,and a consistency-based regularization technique between samples and their perturbed versions to enhance generalization.Additionally,an attention-based mechanism is developed to selectively aggregate neighborhood information from labeled data,allowing the model to leverage local feature correlations effectively.For unlabeled data,a state-of-the-art pseudo-labeling strategy is employed to enable iterative refinement of model predictions.Extensive experiments are conducted on ten public multi-label tabular datasets,covering various domains to validate the robustness of the proposed method.Results demonstrate the effectiveness of the proposed method,advancing the state of semi-supervised multi-label learning for tabular data.

Key words: Tabular data, Multi-label classification, Semi-supervised learning, Data augmentation, Attention mechanism

中图分类号: 

  • TP181
[1]SOMVANSHI S,DAS S,JAVED S A,et al.A survey on deep tabular learning[J].arXiv:2410.12034,2024.
[2]TAREKEGN A N,ULLAH M,CHEIKH F A.Deep learningfor multi-label learning:a comprehensive survey[J].arXiv:2401.16549,2024.
[3]OUALI Y,HUDELOT C,TAMI M.An overview of deep semi-supervised learning[J].arXiv:2006.05278,2020.
[4]LEE D H.Pseudo-label:The simple and efficient semi-super-vised learning method for deep neural networks[C]//Workshop on Challenges in Representation Learning.New York:ICML,2013:896.
[5]XIE Q,DAI Z,HOVY E,et al.Unsupervised data augmentation for consistency training[J].Advances in Neural Information Processing Systems,2020,33:6256-6268.
[6]LAINE S,AILA T.Temporal ensembling for semi-supervised learning[J].arXiv:1610.02242,2016.
[7]JIA S,WANG P,JIA P,et al.Research on data augmentation for image classification based on convolution neural networks[C]//2017 Chinese Automation Congress(CAC).Piscataway,NJ:IEEE,2017:4165-4170.
[8]SHORTEN C,KHOSHGOFTAAR T M,FURHT B.Text data augmentation for deep learning[J].Journal of big Data,2021,8(1):101.
[9]LAINE S,AILA T.Temporal ensembling for semi-supervisedlearning[J].arXiv:1610.02242,2016.
[10]YOON J,ZHANG Y,JORDON J,et al.Vime:Extending the success of self-and semi-supervised learning to tabular domain[J].Advances in Neural Information Processing Systems,2020,33:11033-11043.
[11]BAHRI D,JIANG H,TAY Y,et al.Scarf:Self-supervised contrastive learning using random feature corruption[J].arXiv:2106.15147,2021.
[12]SOMEPALLI G,GOLDBLUM M,SCHWARZSCHILD A,et al.Saint:Improved neural networks for tabular data via row attention and contrastive pre-training[J].arXiv:2106.01342,2021.
[13]CHEN J,YAN J,CHEN Q,et al.Excelformer:A neural network surpassing gbdts on tabular data[J].arXiv:2301.02819,2023.
[14]ZHANG M L,ZHOU Z H.ML-KNN:A lazy learning approach to multi-label learning[J].Pattern Recognition,2007,40(7):2038-2048.
[15]HANG J Y,ZHANG M L.Dual perspective of label-specific feature learning for multi-label classification[J].ACM Transactions on Knowledge Discovery from Data,2024,19(1):1-30.
[16]LI G Z,YANG J Y,LU W C,et al.Improving prediction accuracy of drug activities by utilising unlabelled instances with feature selection[J].International Journal of Computational Biology and Drug Design,2008,1(1):1-13.
[17]XIE M K,XIAO J,LIU H Z,et al.Class-distribution-awarepseudo-labeling for semi-supervised multi-label learning[J].Advances in Neural Information Processing Systems,2023,36:25731-25747.
[18]LIU B,XU N,FANG X,et al.Correlation-induced label prior for semi-supervised multi-label learning[C]//Forty-first International Conference on Machine Learning.2024.
[19]GOODFELLOW I,BENGIO Y,COURVILLE A,et al.Deeplearning[M].Cambridge:MIT press,2016.
[20]RIDNIK T,BEN-BARUCH E,ZAMIR N,et al.Asymmetricloss for multi-label classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway,NJ:IEEE,2021:82-91.
[21]AMARI S.Backpropagation and stochastic gradient descent me-thod[J].Neurocomputing,1993,5(4/5):185-196.
[22]PETERSON L E.K-nearest neighbor[J].Scholarpedia,2009,4(2):1883.
[23]ZHANG M L,ZHOU Z H.A review on multi-label learning algorithms[J].IEEE Transactions on Knowledge and Data Engineering,2013,26(8):1819-1837.
[24]FANG J,TANG C,CUI Q,et al.Semi-supervised learning with data augmentation for tabular data[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management.New York:ACM,2022:3928-3932.
[25]LOSHCHILOV I,HUTTER F.Decoupled weight decay regularization[J].arXiv:1711.05101,2017.
[26]DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[J].arXiv:1708.04552,2017.
[27]HANG J Y,ZHANG M L.Collaborative learning of label se-mantics and deep label-specific features for multi-label classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,44(12):9860-9871.
[28]HANG J Y,ZHANG M L,FENG Y,et al.End-to-end probabilistic label-specific feature learning for multi-label classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2022:6847-6855.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!