计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 143-149.doi: 10.11896/jsjkx.230600079

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于异质图神经网络预训练的多标签文档分类研究

吴家伟1, 方全2, 胡骏2, 钱胜胜2   

  1. 1 郑州大学河南先进技术研究院 郑州450002
    2 中科院自动化所模式识别国家重点实验室 北京100190
  • 收稿日期:2023-06-08 修回日期:2023-10-09 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 方全(qfang@nlpr.ia.ac.cn)
  • 作者简介:(robin.wujw@gmail.com)
  • 基金资助:
    国家自然科学基金(62072456,62036012,62106262)

Pre-training of Heterogeneous Graph Neural Networks for Multi-label Document Classification

WU Jiawei1, FANG Quan2, HU Jun2, QIAN Shengsheng2   

  1. 1 Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450002,China
    2 National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2023-06-08 Revised:2023-10-09 Online:2024-01-15 Published:2024-01-12
  • About author:WU Jiawei,born in 1998,postgraduate.His main research interests include multi-label classification,graph neural network and knowledge graph.
    FANG Quan,born in 1988,associate professor.His main research interest is multimedia knowledge computing.
  • Supported by:
    National Natural Science Foundation of China(62072456,62036012,62106262).

摘要: 多标签文档分类是一种将文档实例与相关标签相关联的技术,近年来受到越来越多研究者的关注。现有的多标签文档分类方法尝试探索文本之外的信息的融合,如文档元数据或标签结构。然而,这些方法要么简单地利用元数据的语义信息,要么没有考虑标签的长尾分布,因此忽略了文档及其元数据之间的高阶关系和标签的分布规律等信息,从而影响到多标签文档分类的准确性。因此,文中提出一种新的基于异质图神经网络预训练的多标签文档分类方法。该方法通过构造文档与其元数据的异质图,采用两种对比学习预训练方法捕获文档与其元数据之间的关系,并通过平衡标签长尾分布的损失函数来提高多标签文档分类的准确性。在基准数据集上的实验结果表明,所提方法的准确率比Transformer提高了8%,比BertXML提高了4.75%,比MATCH提高了1.3%。

关键词: 多标签文档分类, 元数据, 异质图神经网络, 预训练, 长尾分布

Abstract: Multi-label document classification aims to associate document instances with relevant labels,which has received increasing research attention in recent years.Existing multi-label document classification methods attempt to explore the fusion of information beyond the text,such as document metadata or label structure.However,these methods either simply use the semantic information of metadata or do not consider the long-tail distribution of labels,thereby ignoring higher-order relationships between documents and their metadata and the distribution pattern of labels,which affects the accuracy of multi-label document classification.Therefore,this paper proposes a new multi-label document classification method based on the pre-training of hete-rogeneous graph neural networks.The method constructs a heterogeneous graph based on documents and their metadata,adopts two contrastive pre-training methods to capture the relationship between documents and their metadata,and improves the accuracy of multi-label document classification by balancing the problem of long-tail distribution of labels through a loss function.Experimental results on the benchmark dataset show that the proposed method outperforms Transformer BertXML and MATCH by 8%,4.75%,1.3%,respectively.

Key words: Multi-label document classification, Metadata, Heterogeneous graph neural network, Pre-training, Long-tail distribution

中图分类号: 

  • TP391
[1]DONG Y,MA H,SHEN Z,et al.A century of science:Globa-lization of scientific collaborations,citations,and innovations[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2017:1437-1446.
[2]WANG K,SHEN Z,HUANG C,et al.Microsoft academicgraph:When experts are not enough[J].Quantitative Science Studies,2020,1(1):396-413.
[3]MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning-based text classification:a comprehensive review[J].ACM Computing Surveys(CSUR),2021,54(3):1-40.
[4]ZHANG Y,SHEN Z,DONG Y,et al.MATCH:Metadata-aware text classification in a large hierarchy[C]//Proceedings of the Web Conference 2021.2021:3246-3257.
[5]AGGARWAL C C,ZHAI C X.A survey of text classification algorithms[M]//Mining text data.2012:163-222.
[6]HAO C,QIU H P,SUN Y,et al.Research Progress of Multi-label Text Classification[J].Computer Engineering and Applications,2021,57(10):48-56.
[7]LIU J,CHANG W C,WU Y,et al.Deep learning for extreme multi-label text classification[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.2017:115-124.
[8]YOU R,ZHANG Z,WANG Z,et al.Attentionxml:Label tree-based attention-aware deep model for high-performance extreme multi-label text classification[C]//Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems.2019:5812-5822.
[9]ZHANG W,YAN J,WANG X,et al.Deep extreme multi-label learning[C]//Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval.2018:100-107.
[10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[11]HUANG Y,GILEDERELI B,KÖKSAL A,et al.Balancingmethods for multi-label text classi-fication with long-tailed class distribution[J].arXiv:2109.04712,2021.
[12]CHANG W C,YU H F,ZHONG K,et al.Taming pretrained transformers for extreme multi-label text classification[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3163-3171.
[13]GONG J,TENG Z,TENG Q,et al.Hierarchical graph transformer-based deep learning model for large-scale multi-label text classification[J].IEEE Access,2020,8:30885-30896.
[14]MA Y L,LIU X F,ZHAO L J,et al.Hybrid embedding-basedtext representation for hierarchical multi-label text classification.[J].Expert Systems with Applications,2022,187:115905.
[15]TANG D,QIN B,LIU T.Learning semantic representations of users and products for document level sentiment classification[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing(volume 1:long papers).2015:1014-1023.
[16]KIM J,AMPLAYO R K,LEE K,et al.Categorical metadatarepresentation for customized text classification[J].Transactions of the Association for Computational Linguistics,2019,7:201-215.
[17]ZHANG Y,SHEN Z,WU C H,et al.Metadata-induced contras-tive learning for zero-shot multi-label text classification[C]//Proceedings of the ACM Web Conference 2022.2022.
[18]YANG P,SUN X,LI W,et al.SGM:sequence generation model for multi-label classification[J].arXiv:1806.04822,2018.
[19]WANG J,CHEN Z,LI H,et al.Hierarchical multi-label classification using incremental hypernetwork[J].Journal of Chongqing University of Posts & Telecommunications(Natural Science Edition),2019,31(4):12.
[20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[21]JIANG X,JIA T,FANG Y,et al.Pre-training on large-scaleheterogeneous graph[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.2021:756-766.
[22]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[23]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[24]CUI Y,JIA M,LIN T Y,et al.Class-balanced loss based on effective number of samples[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:9268-9277.
[25]WU T,HUANG Q,LIU Z,et al.Distribution-balanced loss for multi-label classification in long-tailed datasets[C]//Computer Vision-ECCV 2020:16th European Conference.Springer International Publishing,2020:162-178.
[26]LU Z Y.PubMed and beyond:a survey of web tools for sear-ching biomedical literature[J/OL].https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025693/pdf/baq036.pdf.
[27]XUN G,JHA K,YUAN Y,et al.MeSHProbeNet:a self-attentive probe net for MeSH indexing[J].Bioinformatics,2019,35(19):3794-3802.
[28]GUO Q,QIU X,LIU P,et al.Star-transformer[J].arXiv:1902.09113,2019.
[29]XUN G,JHA K,SUN J,et al.Correlation networks for ex-treme multi-label textclassification[C] //Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:1074-1082.
[30]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!