Computer Science ›› 2025, Vol. 52 ›› Issue (11A): 250300040-9.doi: 10.11896/jsjkx.250300040

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Outdoor Self-supervised Monocular Depth Estimation Method Based on Gram Matrix Attention

JIA Hongjun1, ZHANG Hailong3, LI Jingguo1, ZHANG Huimin4, HAN Chenggong4, JIANG He2,4   

  1. 1 Inner Mongolia Baiyinhua Mongdong Open-Pit Coal Industry Co.,Ltd.,Xilin Gol League,Inner Mongolia 026200,China
    2 Key Laboratory of Pattern Recognition and Intelligent Information Processing,Institutions of Higher Education of Sichuan Province,Chengdu University,Chengdu 610106,China
    3 School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China
    4 School of Information and Control Engineering,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(52304182,52204177),Open Fund of the Sichuan Provincial University Key Laboratory of Pattern Recognition and Intelligent Information Processing,Chengdu University(MSSB-2024-04),Open Fund of the Key Laboratory of System Control and Information Processing,Ministry of Education(SCIP20240105) and Cooperative Project of the Technological Innovation Base for Mine Fine Exploration and Intelligent Monitoring(2023MPIM03).

Abstract: In the outdoor depth estimation task,traditional U-network based models often ignore the correlation and difference between different features in the feature extraction and fusion stage,and fail to fully utilize the interaction information between features.To address this problem,this study proposes an outdoor monocular depth estimation method based on Gram matrix attention.Specifically,firstly,the correlation matrix and difference matrix between features are designed by utilizing the properties of Gram matrix decomposition,so as to enhance the information interaction and characterization ability between features.On this basis,the mask generated by the Gram matrix attention mechanism is further deeply fused with the features extracted from the convolutional layer.By combining the important features concerned by the attention mechanism with the fine details captured by the convolutional layer,the diversity and completeness of feature representation is realized.Numerous experimental results show that the performance of the network is improved with the introduction of the Gram matrix attention mechanism on the outdoor scene dataset KITTI.The proposed method in this study achieves an improvement on the δ1 metric to 0.880,while the absolute error metric decreases to 0.112.In addition,the test results on the Make3D dataset further validate the superiority of the proposed model,which is shown by the fact that the absolute relative error,the root-mean-square relative error,and the root-mean-square error reach 0.318,respectively,3.174 and 7.163 excellent levels,respectively.

Key words: Depth estimation, Gram matrix, Correlation matrix, Difference matrix, Feature fusion

CLC Number: 

  • TP391.4
[1]YANG R,DING Z,YANG J,et al.Simulation system of mine unmanned vehicle based on parallel control theory[J].Industry and Mine Automation,2022,48(11):180-183.
[2]IZADINIA H,SHAN Q,SEITS S.IM2CAD[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.IEEE Press,2017:2422-2431.
[3]LADICKY L,SHI J,POLLEFEYS M.Pulling things out of perspective [C]//IEEE Conference on Computer Vision and Pattern Recognition.2014:89-96.
[4]GEIGER A,LENZ P,STILLER C,et al.Vision meets robotics:The kitti dataset [J].The International Journal of Robotics Research,2013,32(11):1231-1237.
[5]LUO Q,GUO T,YAO Y.Laser ranging cooperative target for Cubesa[J].Optics and Precision Engineering,2017,25(7):1705-1713.
[6]CHENG Y,ZHAO Y,LUO Z,et al.Uncertainty evaluation analysis of surface features in surface structured light measurement[J].Optics and Precision Engineering,2022,30(17):2039-2049.
[7]VARMA A,CHAWLA H,ZONOOZ B,et al.Transformers inSelf-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics[J].arXiv:2202.03131,2022.
[8]NING Z,FRANCESCO N,GEORGE V.Norman Kerle[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2023:18537-18546.
[9]XIANG X,WANG Y,ZHANG L,et al.Self-supervised Mono-cular Depth Estimation with Large Kernel Attention[J].arXiv:2409.17895,2024.
[10]CHEN R,LUO H,ZHAO F,et al.Structure-Centric RobustMonocular Depth Estimation viaKnowledge Distillation[C]//Asian Conference on Computer Vision.Singapore:Springer,2025.
[11]LIU J,GUO Z Y,PING P,et al.Channel Interaction and Transformer Depth Estimation Network:Robust Self-Supervised Depth Estimation Under Varied Weather Conditions[J].Sustainability,2024,16(20):9131.
[12]GODARD C,MAC AODHA O,BROSTOW G J.Unsupervised monocular depth estimation with left-right consistency[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:270-279.
[13]XIE J,GIRSHICK R,FARHADI A.Deep3d:Fully automatic2d-to-3d video conversion with deep convolutional neural networks[C]//European Conference on Computer Vision.Springer International Publishing.2016:842-857.
[14]GARG R,BG V K,CARNEIRO G,et al.Unsupervised cnn for single view depth estimation:Geometry to the rescue[C]//European Conference on Computer Vision.Springer International Publishing.2016:740-756.
[15]BADKI A,TROCCOLI A,KIM K,et al.Bi3D:Stereo depth estimation via binary classifications[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway,IEEE Press:2020:1597-1605.
[16]DU Q,LIU R,PAN Y,et al.Depth estimation with multi-resolution stereo matching[C]//IEEE Visual Communications and Image Processing.IEEE Press,2019:1-4.
[17]JOHNSTON A,CARNEIRO G.Self-supervised monoculartrained depth estimation using self-attention and discrete dispa-rity volume[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Press,2020:4755-4764.
[18]ZHOU T,BROWN M,SNAVELY N,et al.Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:1851-1858.
[19]VIJAYANARASIMHAN S,RICCO S,SCHMID C,et al.Sfm-net:Learning of structure and motion from video[J/OL].CoRR,2017,abs/1704.07804.
[20]YIN Z,SHI J.Geonet:Unsupervised learning ofdense depth,optical flow and camera pose[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018:1983-1992.
[21]JANAI J,GUNEY F,RANJAN A,et al.Unsupervised learning of multi-frame optical flow with occlusions[C]//European Conference on Computer Vision.2018:690-706.
[22]WAN Y,ZHAO Q,GUO C H,et al.Multi-sensor fusion self-supervised deep odometry and depth estimation[J].Remote Sen-sing,2022,14(5):1228.
[23]MAHJOURIAN R,WICKE M,ANGELOVA A.Unsupervised learning of depth and ego-motion from monocular video using3D geometric constraints[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:5667-5675.
[24]ZHANG X,LI C,WANG Y,et al.Light feld depth estimation for scene with occlusion[J].Control and Decision,2018,33(12):2122-2130.
[25]MNIH V,HEESS N,GRAVES A.Recurrent models of visual attention[J].arXiv:1406.6247,2014.
[26]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//International Conference on Learning Representations.2015.
[27]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st InternationalConfe-rence on Neural Information Processing Systems.2017:6000-6010.
[28]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:transformers for image recognition at scale[C]//International Conference on Learning Representations.OpenReview.net,2021.
[29]ZHOU H,GREENWOOD D,TAYLOR S.Self-Supervised Monocular Depth Estimationwith Internal Feature Fusion[C]//British Machine Vision Conference 2021.BMVA Press,2021:378.
[30]VARMA A,CHAWLA H,ZONOOZ B,et al.Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics[C]//Proceedings of the 17th International Joint Conference on Computer Vision.2022:758-769.
[31]HAO C,ZHANG Y,POGGI M,et al.MonoViT:Self-Supervised Monocular Depth Estimation with a Vision Transformer[C]//International Conference on 3D Vision.Czech Republic,2022:668-678.
[32]LEE Y,KIM J,WILLETTE J,et al.Mpvit:Multi-path visiontransformer for dense prediction[C]//IEEE Conference on Computer Vision and Pattern Recognition.2022:7287-7296.
[33]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE Trans.,2004,13(4):600-612.
[34]GODARD C,MAC A O,FIRMAN M,et al.Digging into self-supervised monocular depth estimation[C]//IEEE International Conference on Computer Vision.2019:3828-3838.
[35]EIGEN D,PUHRSCH C,FERGUS R.Depth map predictionfrom a single image using a multi-scale deep network[C]//NIPS’14,2014.
[36]WONG A,HONG B W,SOATTO S.Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction[J].IEEE Conference on Computer Vision and Pattern Recognition.2019:5644-5653.
[37]ZOU Y,LUO Z,HUANG J B.Df-net:Unsupervised joint lear-ning of depth and flow using cross-task consistency[C]//Proceedings of the European Conference on Computer Vision.2018:36-53.
[38]RANJAN A,JAMPANI V,BALLES L,et al.Competitive col-laboration:Joint unsupervised learning of depth,camera motion,optical flow and motion segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:12240-12249.
[39]CASSER V,PIRK S,MAHJOURIAN R,et al.Depth prediction without the sensors:Leveraging structure for unsupervised learning from monocular videos[C]//Proceedings of The AAAI Conference on Artificial Intelligence.2019:8001-8008.
[40]ZHOU Z,FAN X,SHI P,et al.R-msfm:Recurrent multi-scale feature modulation formonocular depth estimating[C]//IEEE International Conference on Computer Vision.2021:12777-12786.
[41]SURI Z K.Pose Constraints for Consistent Self-supervised Monocular Depth and Ego-Motion [C]//Scandinavian Conference on Image Analysis.Springer,2023:340-353.
[42]BAE J,MOON S,IM S.Deep digging into the generalization of self-supervised monocular depthestimation[C]//36th AAAI Conference on Artificial Intelligence.2023:187-196.
[43]WANG W,HAN J,ZOU X,et al.Research on Fast Monocular Depth Estimation Algorithm Based on Edge Devices[J].Computer Measurement & Control,2025,33(4):262-269.
[1] LUO Chi, LU Lingyun, LIU Fei. Partial Differential Equation Solving Method Based on Locally Enhanced Fourier NeuralOperators [J]. Computer Science, 2025, 52(9): 144-151.
[2] GUO Husheng, ZHANG Xufei, SUN Yujie, WANG Wenjian. Continuously Evolution Streaming Graph Neural Network [J]. Computer Science, 2025, 52(8): 118-126.
[3] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[4] XU Yongwei, REN Haopan, WANG Pengfei. Object Detection Algorithm Based on YOLOv8 Enhancement and Its Application Norms [J]. Computer Science, 2025, 52(7): 189-200.
[5] FANG Chunying, HE Yuankun, WU Anxin. Emotion Recognition Based on Brain Network Connectivity and EEG Microstates [J]. Computer Science, 2025, 52(7): 201-209.
[6] LUO Xuyang, TAN Zhiyi. Knowledge-aware Graph Refinement Network for Recommendation [J]. Computer Science, 2025, 52(7): 103-109.
[7] SHI Xincheng, WANG Baohui, YU Litao, DU Hui. Study on Segmentation Algorithm of Lower Limb Bone Anatomical Structure Based on 3D CTImages [J]. Computer Science, 2025, 52(6A): 240500119-9.
[8] ZHANG Yongyu, GUO Chenjuan, WEI Hanyue. Deep Learning Stock Price Probability Prediction Based on Multi-modal Feature Wavelet Decomposition [J]. Computer Science, 2025, 52(6A): 240600140-11.
[9] LI Weirong, YIN Jibin. FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet [J]. Computer Science, 2025, 52(6A): 240900046-8.
[10] HAO Xu, WU Wenhong, NIU Hengmao, SHI Bao, WU Nier, WANG Jiamin, CHU Hongkun. Survey of Man-Machine Distance Detection Method in Construction Site [J]. Computer Science, 2025, 52(6A): 240700098-10.
[11] XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module [J]. Computer Science, 2025, 52(6A): 240800086-7.
[12] ZHANG Hang, WEI Shoulin, YIN Jibin. TalentDepth:A Monocular Depth Estimation Model for Complex Weather Scenarios Based onMultiscale Attention Mechanism [J]. Computer Science, 2025, 52(6A): 240900126-7.
[13] WANG Rui, TANG Zhanjun. Multi-feature Fusion and Ensemble Learning-based Wind Turbine Blade Defect Detection Method [J]. Computer Science, 2025, 52(6A): 240900138-8.
[14] LI Mingjie, HU Yi, YI Zhengming. Flame Image Enhancement with Few Samples Based on Style Weight Modulation Technique [J]. Computer Science, 2025, 52(6A): 240500129-7.
[15] WANG Rong , ZOU Shuping, HAO Pengfei, GUO Jiawei, SHU Peng. Sand Dust Image Enhancement Method Based on Multi-cascaded Attention Interaction [J]. Computer Science, 2025, 52(6A): 240800048-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!