视觉表征学习综述

doi:10.11896/jsjkx.231100089

Abstract

Abstract: Representation learning is an important step of artificial intelligence algorithm,where well designed representation can boost downstream tasks.With the development of deep learning in computer vision,visual representation learning has become increasingly important,aiming at transforming complex visual information into representation that is easier for artificial intelligence algorithm to learn.In this paper,we focus on current research works widely used in visual representation learning,which are categorized as pre-trained visual representation learning,generative visual representation learning,contrastive visual representation learning,decoupled visual representation learning,and visual representation learning combined with language information accor-ding to the degrees and types of data dependency.Specifically,pre-trained visual representation learning is the application of supervised pre-training model in visual representation learning;generative visual representation learning uses generative model to learn visual representations;and contrastive visual representation learning focuses on the various network frameworks which using contrast learning to learn visual representations.Besides,the paper presents the applications of VAE and GAN in decoupled visual representation learning,as well as various approaches to improve visual representation learning with language information.Finally,evaluation metrics in visual representation learning and future perspectives are summarized.

Key words: Visual representation learning, Artificial intelligence algorithm, Decoupled visual representation learning, Language information

CLC Number:

TP391

WANG Shuaiwei, LEI Jie, FENG Zunlei, LIANG Ronghua. Review of Visual Representation Learning[J].Computer Science, 2024, 51(11): 112-132.

References

[1] BENGIO Y,COURVILLE A,VINCENT P.Representationlearning:A review and new perspectives[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
[2] ZHANG D ,YIN J,ZHU X,et al.Network RepresentationLearning:A Survey[J].IEEE Transactions on Big Data,2020,6(1):3-28.
[3] CHEN F X,WANG Y C,WANG B,et al.Graph representa-tion learning:a survey[J].Transactions on Signal and Information Processing,2020,9:e15.
[4] CHENG K Y,MENG C Y,WANG W S,et al.Research advances in disentangled representation learning[J].Journal of Computer Applications,2021,41(12):10.
[5] WEN Z D ,WANG J R ,WANG X X,,et al.A Review of Disentangled Representation Learning[J].Acta Automatica Sinica,2022,48(2):351-374.
[6] DU P F,LI X Y,GAO Y L.Survey of Multimodal Visual Language Representation Learning[J].Journal of Software,2021,32(2):22.
[7] YIN J,ZHANG Z D,GAO Y H,,et al.A survey on visual language pre-training[J].Journal of Software,2023,34(5):2000-2023.
[8] PEARSON K.On lines and planes of closest fit to systems of points in space[J].London,Edinburgh & Dublin Philosophical Magazine & Journal of Science,1901,2(11):559-572.
[9] FISHER R A.The Use of Multiple Measurements in Taxonomic Problems[J].Annals of Human Genetics,2012,7(7):179-188.
[10] BAUDAT G,ANOUAR F.Generalized Discriminant AnalysisUsing a Kernel Approach[J].Neural Computation,2000,12(10):2385-2404.
[11] ROWEIS S T,SAUL L K.Nonlinear dimensionality reduction by locally linear embedding[J].Science,2000,290:2323-2326.
[12] HINTON G E,OSINDERO S,TEH Y W.A fast learning algorithm for deep belief nets[J].Neural Computation,2006,18(7):1527-1554.
[13] FUKUSHIMA K.Neocognitron:A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position[J].Biological Cybernetics,1980,36(4):193-202.
[14] LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[15] ALEX K,ILYA S,GEOFFREY E.ImageNet classification with deep convolutional neural networks[C/OL]//ACM ,2017:84-90.https://doi.org/10.1145/3065386.
[16] SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014.
[17] SZEGEDY C.Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2015:1-9.
[18] SZEGEDY C,VANHOUCKEV,IOFFE S,et al.Rethinking the Inception Architecture for Computer Vision[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).2016:2818-2826.
[19] SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,Inception-ResNet and the Impact of Residual Connections on Learning[C]//AAAI.2017.
[20] HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2016:770-778.
[21] HUANG G, LIU Z,VAN DER MAATEN L,et al.DenselyConnected Convolutional Networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:2261-2269.
[22] IANDOLA F N,MOSKEWICZ M W,ASHRAF K,et al.SqueezeNet:AlexNet-level accuracy with 50x fewer parameters and <1MB model size[J].arXiv:abs/1602.07360.
[23] HOWARD A G,ZHU M,CHEN B,et al.MobileNets:Efficient Convolutional Neural Networks for Mobile Vision Applications[J].ArXiv,abs/1704.04861.
[24] ZHANG X , ZHOU X, LIN Met al.ShuffleNet:An Extremely Efficient Convolutional Neural Network for Mobile Devices[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018:6848-6856.
[25] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks.June 2014[J/OL].http://arxiv.org/abs/1406.2661,2014.
[26] NG A.Sparese Autoencoder[J].CS294A Lecture Notes,2011,72(2011):1-19.
[27] LIN X,ZHU C,ZHANG Q,et al.3D Keypoint Detection Basedon Deep Neural Network with Sparse Autoencoder[J/OL]. 2016.https://www.semanticscholar.org/paper/3D-Keypoint-Detection-Based-on-Deep-Neural-Network-Lin-Zhu/f0226fd05ff951ca63d318ec71cca02925a887b9.
[28] MENG Q ,CATCHPOOLE D ,SKILLICOM D,et al.Relational autoencoder for feature extraction[C]//2017 International Joint Conference on Neural Networks(IJCNN).2017:364-371.
[29] AN N,DING H,YANG J,et al.Deep ensemble learning forAlzheimers disease classification[J/OL].Journal of Biomedical Informatics,2019.https://www.sciencedirect.com/science/article/pii/S1532046420300393.
[30] VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[R].Universite de Montreal,2008.
[31] GIDARIS S,KOMODAKIS N.Generating classification weights with gnn denoising autoencoders for few-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:21-30.
[32] BO D,WEI X,JIA W,et al.Stacked convolutional denoising auto-encoders for feature representation[J].IEEE Transactions on Cybernetics,2016,47(4):1017-1027.
[33] HE K,CHEN X,XIE S,et al.Masked autoencoders are scalablevision learners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:16000-16009.
[34] CHEN J,HU M,LI B,et al.Efficient self-supervised vision pretraining with local masked reconstruction[J].arXiv:2206.00790,2022.
[35] SALAH R,VINCENT P,MULLER X.Contractive auto-en-coders:Explicit invariance during feature extraction[C]//Proceedings of the 28th International Conference on Machine Learning.2011:833-840.
[36] GANGULI S,IYER C V K,PANDEY V.Reachability Embeddings:Scalable Self-Supervised Representation Learning from Mobility Trajectories for Multimodal Geospatial Computer Vision[C]//2022 23rd IEEE International Conference on Mobile Data Management(MDM).IEEE,2022:44-53.
[37] KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[EB/OL].https://www.ee.bgu.ac.il/~rrtammy/DNN/StudentPresentations/2018/AUTOEN~2.PDF.
[38] SOHN K,YAN X,LEE H,et al.Learning Structured Output Representation using Deep Conditional Generative Models[C]//International Conference on Neural Information Processing Systems.MIT Press,2015.
[39] LOUIZOS C,SWERSKY K,LI Y,et al.The Variational Fair Autoencoder[J/OL].Computer Science,2015.https://www.semanticscholar.org/paper/The-Variational-Fair-Autoencoder-Louizos-Swersky/cbef7a84a53e19e019e5a05d232eb3c487c0e0c6?p2df.
[40] ZHAO S,SONG J,ERMON S.Infovae:Information maximizing variational autoencoders[J].arXiv:1706.02262,2017.
[41] RAMACHANDRA G.Least Square Variational Bayesian Au-toencoder with Regularization[J].arXiv:1707.03134,2017.
[42] CHEN X,KINGMA D P,SALIMANS T,et al.Variational lossy autoencoder[J].arXiv:1611.02731,2016.
[43] SHANG W,SOHN K,AKATA Z,et al.Channel-recurrent varia-tional autoencoders[J].arXiv:1706.03729,2017.
[44] CAI L,GAO H,JI S.Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation.CoRR abs/1705.07202 (2017)[J].arXiv:1705.07202,2017.
[45] VAN DEN OORD A,VINYALS O.Neural discrete representation learning[J/OL].Advances in Neural Information Processing Systems,2017,30.https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html.
[46] RAZAVI A,VAN DEN OORD A,VINYALS O.Generating diverse high-fidelity images with vq-vae-2[J/OL].Advances in Neural Information Processing Systems,2019,32.https://proceedings.neurips.cc/paper/2019/hash/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Abstract.html.
[47] RUECKERT F L.CR-VAE:Contrastive Regularization on Va-riational Autoencoders for Preventing Posterior Collapse[J].arXiv:2309.02968,2023.
[48] DENTON E L,CHINTALA S,FERGUS R,et al.Deep generative image models using a laplacian pyramid of adversarial networks[C]//Advances in Neural Information Processing Systems.2015:1486-1494.
[49] RADFORD A,METZ L,CHINTALA S.Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J/OL].Computer Science,2015.http://arxiv.org/pdf/1511.06434.
[50] SALIMANS T,GOODFELLOW I,ZAREMBA W,et al.Im-proved techniques for training gans[J/OL].Advances in Neural Information Processing Systems,2016,29.https://proceedings.neurips.cc/paper_files/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html.
[51] LU S,DONG Z,CAI D,et al.MIM-GAN-based Anomaly Detection for Multivariate Time Series Data[C]//2023 IEEE 98th Vehicular Technology Conference(VTC2023-Fall).IEEE,2023:1-7.
[52] WU J,ZHANG C,XUE T,et al.Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling[J/OL].Advances in Neural Information Processing Systems,2016,29.https://proceedings.neurips.cc/paper/2016/hash/44f683a84163b3523afe57c2e008bc8c-Abstract.html.
[53] YANG W X,YAN Y,CHEN S,et al.Multi-scale Generative Adversarial Network for Person Re-identification under Occlusion[J].Journal of Software,2020,31(7):1943-195.
[54] SUN H,ZHU T,CHANG W,et al.Generative Adversarial Networks Unlearning[J].arXiv:2308.09881,2023.
[55] ATHREYA S,RADHACHANDRAN A,IVEZI? V,et al.Ultrasound Image Enhancement using CycleGAN and Perceptual Loss[J].arXiv:2312.11748,2023.
[56] MIRZA M,OSINDERO S.Conditional generative adversarialnets[J].arXiv:1411.1784,2014.
[57] REED S,AKATA Z,MOHAN S,et al.Learning What andWhere to Draw[J/OL].New Republic,2016.https://proceedings.neurips.cc/paper/2016/hash/a8f15eda80c50adb0e71943adc8015cf-Abstract.html.
[58] ZHANG H,XU T,LI H,et al.Stackgan:Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5907-5915.
[59] BOUROU A,BOYER T,DAUPIN K,et al.PhenDiff:Revealing Invisible Phenotypes with Conditional Diffusion Models[J].ar-Xiv:2312.08290,2023.
[60] LI J,GUO Y M,YU T Y,et al.Multi-target Category Adversarial Example Generating Algorithm Based on GAN[J].Computer Science,2022,49(2):83-91.
[61] ARJOVSKY M,CHINTALA S,BOTTOU L.Wasserstein GAN[J].arXiv:1701.07875,2017.
[62] MESCHEDER L,NOWOZIN S,GEIGER A.Adversarial variational bayes:Unifying variational autoencoders and generative adversarial networks[C]//International Conference on Machine Learning.PMLR,2017:2391-2400.
[63] LI J,GUO Y M,YU T Y,et al.Multi-target Category Adversarial Example Generating Algorithm Based on GAN[J].Computer Science,2022,49(2):83-91.
[64] SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//International Conference on Machine Lear-ning.PMLR,2015:2256-2265.
[65] HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[66] NICHOL A Q,DHARIWAL P.Improved denoising diffusionprobabilistic models[C]//International Conference on Machine Learning.PMLR,2021:8162-8171.
[67] DHARIWAL P,NICHOL A.Diffusion models beat gans onimage synthesis[J].Advances in Neural Information Processing Systems,2021,34:8780-8794.
[68] WANG Y H ,YAIR S,AARON G,et al.InfoDiffusion:Representation Learning Using Information Maximizing Diffusion Models[C]//ICML.2023.
[69] YANG X,WANG X.Diffusion Model as Representation Learner[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:18938-18949.
[70] SONG R,LIU Y,MARTIN R R,et al.3d point of interest detection via spectral irregularity diffusion[J].The Visual Computer,2013,29(6):695-705.
[71] KRIZHEVSKY A ,HINTON G.Learning multiple layers offeatures from tiny images[J/OL].Handbook of Systemic Autoimmune Diseases,2009,1(4).https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf.
[72] BEEKLY D L, RAMOS E M, LEE W W, et al.The National Alzheimer’s Coordinating Center (NACC) database:The uniform data set[J].Alzheimer Disease & Associated Disorders,2007,21:249-258.
[73] HARIHARAN B,GIRSHICK R. Low-shot visual recognitionby shrinking and hallucinating features[J].arXiv:1606.02819,2016.
[74] JIA D,WEI D,RICHARD S,et al.ImageNet:A large-scale hie-rarchical image database[C]//CVPR.2009.
[75] WELINDER P, BRANSON S,MITA T,et al.Caltech-UCSDBirds 200[R].California Institute of Technology,2010.
[76] WANG T,ISOLA P.Understanding contrastive representationlearning through alignment and uniformity on the hypersphere[C]//International Conference on Machine Learning.PMLR,2020:9929-9939.
[77] WU Z,XIONG Y,YU S X,et al.Unsupervised Feature Lear-ning via Non-parametric Instance Discrimination[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2018.
[78] YE M,ZHANG X,YUEN P C,et al.Unsupervised embedding learning via invariant and spreading instance feature[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6210-6219.
[79] HE K,FAN H,WU Y,et al.Momentum contrast for unsuper-vised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738.
[80] CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//International Conference on Machine Learning.PMLR,2020:1597-1607.
[81] AARON V D O,LI Y Z ,ORIOL V.Representation learningwith contrastive predictive coding[J].arXiv:1807.03748,2018.
[82] CHEN X,FAN H,GIRSHICK R,et al.Improved Baselines with Momentum Contrastive Learning[J].arXiv:2003.04297,2020.
[83] CHEN T,KORNBLITH S,SWERSKY K,et al.Big self-supervised models are strong semi-supervised learners[J].Advances in Neural Information Processing Systems,2020,33:22243-22255.
[84] OORD A,LI Y,VINYALS O.Representation Learning withContrastive Predictive Coding[J].arXiv:1807.03748v1,2018.
[85] HENAFF O.Data-efficient image recognition with contrastivepredictive coding[C]//International Conference on Machine Learning.PMLR,2020:4182-4192.
[86] TIAN Y,KRISHNAN D,ISOLA P.Contrastive multiview co-ding[C]//European Conference on Computer Vision.Cham:Springer,2020:776-794.
[87] HASSANI K,KHASAHMADI A H.Contrastive multi-viewrepresentation learning on graphs[C]//International Confe-rence on Machine Learning.PMLR,2020:4116-4126.
[88] TIAN Y,SUN C,POOLE B,et al.What makes for good views for contrastive learning? [J].Advances in Neural Information Processing Systems,2020,33:6827-6839.
[89] CARON M,MISRA I,MAIRAL J,et al.Unsupervised learning of visual features by contrasting cluster assignments[J].Advances in Neural Information Processing Systems,2020,33:9912-9924.
[90] LI Y,HU P,LIU Z,et al.Contrastive clustering[C]//Procee-dings of the AAAI Conference on Artificial Intelligence.2021:8547-8555.
[91] VAN GANSBEKE W,VANDENHENDE S,GEORGOULIS S,et al.Scan:Learning to classify images without labels[C]//European Conference on Computer Vision.Cham:Springer,2020:268-285.
[92] CHEN X,XIE S,HE K.An empirical study of training self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:9640-9649.
[93] CARON M,TOUVRON H,MISRA I,et al.Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:9650-9660.
[94] GRILL J B,STRUB F,ALTCH? F,et al.Bootstrap your own latent-a new approach to self-supervised learning[J].Advances in Neural Information Processing Systems,2020,33:21271-21284.
[95] ABE F,JOSH A.Understanding self-supervised and contrastive learning with bootstrap your own latent (BYOL)[OL].https://untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html,2020.
[96] TIAN Y D , YU L T, CHEN X L,et al.Understanding self-supervisedlearning with dual deep networks[J/OL].2020.http://arxiv.org/abs/2010.00578v2.
[97] RICHEMOND P H,GRILL J B,ALTCHÉ F,et al.BYOLworks even without batch statistics[J].arXiv:2010.10241,2020.
[98] CHEN X,HE K.Exploring simple siamese representation lear-ning[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2021:15750-15758.
[99] ZBONTAR J,JING L,MISRA I,et al.Barlow twins:Self-supervised learning via redundancy reduction[C]//International Conference on Machine Learning.PMLR,2021:12310-12320.
[100] HIGGINS I,MATTHEY L,PAL A,et al.beta-vae:Learningbasic visual concepts with a constrained variational framework[J/OL].2016.https://www.semanticscholar.org/paper/beta-VAE%3A-Learning-Basic-Visual-Concepts-with-a-Higgins-Matt-hey/a90226c41b79f8b06007609f39f82757073641e2.
[101] BURGESS C P,HIGGINS I,PAL A,et al.Understanding disentangling in $\beta $-VAE[J].arXiv:1804.03599,2018.
[102] KIM H,MNIH A.Disentangling by factorising[C]//International Conference on Machine Learning.PMLR,2018:2649-2658.
[103] CHEN R T Q,LI X,GROSSE R B,et al.Isolating sources of disentanglement in variational autoencoders[J/OL].Advances in Neural Information Processing Systems,2018,31.https://proceedings.neurips.cc/paper/2018/hash/1ee3dfcd8a0645a25a35977997223d22-Abstract.html.
[104] KIM M,WANG Y,SAHU P,et al.Relevance factor VAE:Learning and identifying disentangled factors[J].arXiv:1902.01568,2019.
[105] CHEN X,KINGMA D P,SALIMANS T,et al.Variational lossy autoencoder[J].arXiv:1611.02731,2016.
[106] ZHAO S,SONG J,ERMON S.Infovae:Balancing learning and inference in variational autoencoders[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:5885-5892.
[107] KUMAR A,SATTIGERI P,BALAKRISHNAN A.Variational inference of disentangled latent concepts from unlabeled observations[J].arXiv:1711.00848,2017.
[108] HAN Q,CAI Y,ZHANG X.RevColV2:Exploring Disentangled Representations in Masked Image Modeling[J].arXiv:2309.01005,2023.
[109] ARTHUR G,OLIVIER B,ALEX S,et al.Measuring statistical dependence with Hilbert-Schmidt norms[C]//Algorithmic Learning Theory.2005:63-77.
[110] LOPEZ R,REGIER J,JORDAN M I,et al.Information con-straints on auto-encoding variational bayes[J/OL].Advances in Neural Information Processing Systems,2018,31.https://proceedings.neurips.cc/paper/2018/hash/9a96a2c73c0d477ff2a6da3bf538f4f4-Abstract.html.
[111] ESMAEILI B,WU H,JAIN S,et al.Structured disentangled representations[C]//The 22nd International Conference on Artificial Intelligence and Statistics.PMLR,2019:2525-2534.
[112] CHEN X,DUAN Y,HOUTHOOFT R,et al.Infogan:Interpretable representation learning by information maximizing ge-nerative adversarial nets[J/OL].Advances in Neural Information Processing Systems,2016,29.https://proceedings.neurips.cc/paper_files/paper/2016/hash/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Abstract.html.
[113] SINGH K K,OJHA U,LEE Y J.Finegan:Unsupervised hierarchical disentanglement for fine-grained object generation and discovery[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:6490-6499.
[114] LI Y,SINGH K K,OJHA U,et al.Mixnmatch:Multifactor disentanglement and encoding for conditional image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:8039-8048.
[115] LI X,CHEN L,WANG L,et al.SCGAN:Disentangled Representation Learning by Adding Similarity Constraint on Generative Adversarial Nets[J/OL].IEEE Access,2018:147928-147938.https://ieeexplore.ieee.org/document/8476290/.
[116] OJHA U,SINGH K K,LEE Y J.Generating furry cars:Disentangling object shape & Appearance across Multiple Domains[J].arXiv:2104.02052,2021.
[117] LARSEN A B L,SØNDERBY S K,LAROCHELLE H,et al.Autoencoding beyond pixels using a learned similarity metric[C]//International Conference on Machine Learning.PMLR,2016:1558-1566.
[118] ROSCA M,LAKSHMINARAYANAN B,WARDE-FARLEYD,et al.Variational approaches for auto-encoding generative adversarial networks[J].arXiv:1706.04987,2017.
[119] BASS C,DA SILVA M,SUDRE C,et al.Icam:Interpretableclassification via disentangled representations and feature attribution mapping[J].Advances in Neural Information Processing Systems,2020,33:7697-7709.
[120] LIU Z,LUO P ,WANG X.,et al.Deep learning face attributes in the wild[C]//ICCV.2015.
[121] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[122] VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[123] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[124] BALTRUŠAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:A survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(2):423-443.
[125] SUN C,MYERS A,VONDRICK C,et al.Videobert:A jointmodel for video and language representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:7464-7473.
[126] LI L H,YATSKAR M,YIN D,et al.VisualBERT:A Simple and Performant Baseline for Vision and Language[J/OL]. 2019.https://zhuanlan.zhihu.com/p/535357931.
[127] CHEN Y C,LI L,YU L,et al.UNITER:UNiversal Image-TExt Representation Learning[C]//European Conference on Computer Vision.Cham:Springer,2020.
[128] SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019.
[129] LI G,DUAN N,FANG Y,et al.Unicoder-VL:A Universal Encoder for Vision and Language by Cross-Modal Pre-Training[J].Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):11336-11344.
[130] ALBERTI C,LING J,COLLINS M,et al.Fusion of DetectedObjects in Text for Visual Question Answering[C]//Procee-dings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019.
[131] HUANG Z,ZENG Z,LIU B,et al.Pixel-bert:Aligning image pixels with text by deep multi-modal transformers[J].arXiv:2004.00849,2020.
[132] HUANG Z,ZENG Z,HUANG Y,et al.Seeing out of the box:End-to-end pre-training for vision-language representation lear-ning[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2021:12976-12985.
[133] KIM W,SON B,KIM I.Vilt:Vision-and-language transformerwithout convolution or region supervision[C]//International Conference on Machine Learning.PMLR,2021:5583-5594.
[134] LI X,YIN X,LI C,et al.Oscar:Object-semantics aligned pre-training for vision-language tasks[C]//European Conference on Computer Vision.Cham:Springer,2020:121-137.
[135] ZHANG P ,LI X ,HU X ,et al.VinVL:Making Visual Representations Matter in Vision-Language Models[J/OL].2021.https://ieeexplore.ieee.org/document/9577951.
[136] HU X ,YIN X ,LIN K ,et al.VIVO:Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training[J/OL].2020.http://arxiv.org/abs/2009.13682.
[137] LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J/OL].Advances in Neural Information Processing Systems,2019,32.https://proceedings.neurips.cc/paper_files/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html.
[138] TAN H,BANSAL M.Lxmert:Learning cross-modality encoder representations from transformers[J].arXiv:1908.07490,2019.
[139] LU J,GOSWAMI V,ROHRBACH M,et al.12-in-1:Multi-Task Vision and Language Representation Learning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).IEEE,2020.
[140] SUN Y,WANG S,LI Y,et al.Ernie:Enhanced representation through knowledge integration[J].arXiv:1904.09223,2019.
[141] YU F,TANG J,YIN W,et al.Ernie-vil:Knowledge enhancedvision-language representations through scene graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:3208-3216.
[142] LI C,YAN M,XU H,et al.Semvlp:Vision-language pre-trai-ning by aligning semantics at multiple levels[J].arXiv:2103.07829,2021.
[143] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision (ECCV).2018:201-216.
[144] FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:Improving visual-semantic embeddings with hard negatives[J].arXiv:1707.05612,2017.
[145] RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[146] JIA C,YANG Y,XIA Y,et al.Scaling up visual and vision-language representation learning with noisy text supervision[C]//International Conference on Machine Learning.PMLR,2021:4904-4916.
[147] LI Y,LIANG F,ZHAO L,et al.Supervision exists everywhere:A data efficient contrastive language-image pre-training paradigm[J].arXiv:2110.05208,2021.
[148] CHEN Y C,LI L,YU L,et al.Uniter:Universal image-text representation learning[C]//European Conference on Computer Vision.Cham:Springer,2020:104-120.
[149] LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in Neural Information Processing Systems,2021,34:9694-9705.
[150] WANG W,BAO H,DONG L,et al.Vlmo:Unified vision-language pre-training with mixture-of-modality-experts[J].arXiv:2111.02358,2021.
[151] YANG H H,AMARI S I.Adaptive online learning algorithms for blind separation:maximum entropy and minimum mutual information[J].Neural Computation,1997,9(7):1457-1482.
[152] EASTWOOD C,WILLIAMS C K I.A framework for the quantitative evaluation of disentangled representations[C]//International Conference on Learning Representations.2018.
[153] DO K,TRAN T.Theory and Evaluation Metrics for Learning Disentangled Representations[J].arXiv:1908.09961,2019.
[154] JIAO X,YIN Y,SHANG L,et al.Tinybert:Distilling bert for natural language understanding[J].arXiv:1909.10351,2019.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Review of Visual Representation Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 0

[1]	DONG Chao-ying, XU Xin, LIU Ai-jun, CHANG Jing-hui. New Routing Methods of LEO Satellite Networks [J]. Computer Science, 2020, 47(12): 285-290.
[2]	YU Yuan-yuan, CHAO Wen-han, HE Yue-ying, LI Zhou-jun. Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding [J]. Computer Science, 2019, 46(1): 238-244.