计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 250300040-9.doi: 10.11896/jsjkx.250300040

• 计算机图形学&多媒体 • 上一篇    下一篇

基于格拉姆矩阵注意力的室外自监督单目深度估计方法

贾宏君1, 张海龙3, 李敬国1, 张晖敏4, 韩成功4, 江鹤2,4   

  1. 1 内蒙古白音华蒙东露天煤业有限公司 内蒙古 锡林郭勒盟 026200
    2 成都大学模式识别与智能信息处理四川省高校重点实验室 成都 610106
    3 中国矿业大学计算机科学与技术学院 江苏 徐州 221116
    4 中国矿业大学信息与控制工程学院 江苏 徐州 221116
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 江鹤(jianghe@cumt.edu.cn)
  • 作者简介:byhmdmygs@163.com
  • 基金资助:
    国家自然科学基金(52304182,52204177);成都大学模式识别与智能信息处理四川省高校重点实验室开放基金(MSSB-2024-04);系统控制与信息处理教育部重点实验室开放基金(SCIP20240105);矿山精细勘探与智能监测技术创新基地合作项目(2023MPIM03)

Outdoor Self-supervised Monocular Depth Estimation Method Based on Gram Matrix Attention

JIA Hongjun1, ZHANG Hailong3, LI Jingguo1, ZHANG Huimin4, HAN Chenggong4, JIANG He2,4   

  1. 1 Inner Mongolia Baiyinhua Mongdong Open-Pit Coal Industry Co.,Ltd.,Xilin Gol League,Inner Mongolia 026200,China
    2 Key Laboratory of Pattern Recognition and Intelligent Information Processing,Institutions of Higher Education of Sichuan Province,Chengdu University,Chengdu 610106,China
    3 School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China
    4 School of Information and Control Engineering,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(52304182,52204177),Open Fund of the Sichuan Provincial University Key Laboratory of Pattern Recognition and Intelligent Information Processing,Chengdu University(MSSB-2024-04),Open Fund of the Key Laboratory of System Control and Information Processing,Ministry of Education(SCIP20240105) and Cooperative Project of the Technological Innovation Base for Mine Fine Exploration and Intelligent Monitoring(2023MPIM03).

摘要: 在室外深度估计任务中,传统的基于U型网络的模型在特征提取与融合阶段往往忽略了不同特征间的相关性和差异性,未能充分利用特征间的交互信息。针对这一问题,提出了一种基于格拉姆矩阵注意力的室外单目深度估计方法。具体而言,首先利用格拉姆矩阵分解的特性,设计了特征间的相关性矩阵和差异性矩阵,从而增强了特征间的信息交互及表征能力。在此基础上,进一步将格拉姆矩阵注意力机制生成的掩码与卷积层提取的特征进行深度融合。通过结合注意力机制所关注的重要特征与卷积层所捕捉的精细细节,实现了特征表达的多样性和完整性。大量的实验结果表明,在室外场景数据集KITTI上,引入格拉姆矩阵注意力机制后,网络的性能得到了提升。所提方法δ1指标提高到0.880,绝对误差指标则下降至0.112。此外,在Make3D数据集上的测试结果也进一步验证了该模型的优越性,具体表现为绝对相对误差、均方根相对误差、均方根误差分别达到了0.318,3.174和7.163的优异水平。

关键词: 深度估计, 格拉姆矩阵, 相关性矩阵, 差异性矩阵, 特征融合

Abstract: In the outdoor depth estimation task,traditional U-network based models often ignore the correlation and difference between different features in the feature extraction and fusion stage,and fail to fully utilize the interaction information between features.To address this problem,this study proposes an outdoor monocular depth estimation method based on Gram matrix attention.Specifically,firstly,the correlation matrix and difference matrix between features are designed by utilizing the properties of Gram matrix decomposition,so as to enhance the information interaction and characterization ability between features.On this basis,the mask generated by the Gram matrix attention mechanism is further deeply fused with the features extracted from the convolutional layer.By combining the important features concerned by the attention mechanism with the fine details captured by the convolutional layer,the diversity and completeness of feature representation is realized.Numerous experimental results show that the performance of the network is improved with the introduction of the Gram matrix attention mechanism on the outdoor scene dataset KITTI.The proposed method in this study achieves an improvement on the δ1 metric to 0.880,while the absolute error metric decreases to 0.112.In addition,the test results on the Make3D dataset further validate the superiority of the proposed model,which is shown by the fact that the absolute relative error,the root-mean-square relative error,and the root-mean-square error reach 0.318,respectively,3.174 and 7.163 excellent levels,respectively.

Key words: Depth estimation, Gram matrix, Correlation matrix, Difference matrix, Feature fusion

中图分类号: 

  • TP391.4
[1]YANG R,DING Z,YANG J,et al.Simulation system of mine unmanned vehicle based on parallel control theory[J].Industry and Mine Automation,2022,48(11):180-183.
[2]IZADINIA H,SHAN Q,SEITS S.IM2CAD[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.IEEE Press,2017:2422-2431.
[3]LADICKY L,SHI J,POLLEFEYS M.Pulling things out of perspective [C]//IEEE Conference on Computer Vision and Pattern Recognition.2014:89-96.
[4]GEIGER A,LENZ P,STILLER C,et al.Vision meets robotics:The kitti dataset [J].The International Journal of Robotics Research,2013,32(11):1231-1237.
[5]LUO Q,GUO T,YAO Y.Laser ranging cooperative target for Cubesa[J].Optics and Precision Engineering,2017,25(7):1705-1713.
[6]CHENG Y,ZHAO Y,LUO Z,et al.Uncertainty evaluation analysis of surface features in surface structured light measurement[J].Optics and Precision Engineering,2022,30(17):2039-2049.
[7]VARMA A,CHAWLA H,ZONOOZ B,et al.Transformers inSelf-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics[J].arXiv:2202.03131,2022.
[8]NING Z,FRANCESCO N,GEORGE V.Norman Kerle[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2023:18537-18546.
[9]XIANG X,WANG Y,ZHANG L,et al.Self-supervised Mono-cular Depth Estimation with Large Kernel Attention[J].arXiv:2409.17895,2024.
[10]CHEN R,LUO H,ZHAO F,et al.Structure-Centric RobustMonocular Depth Estimation viaKnowledge Distillation[C]//Asian Conference on Computer Vision.Singapore:Springer,2025.
[11]LIU J,GUO Z Y,PING P,et al.Channel Interaction and Transformer Depth Estimation Network:Robust Self-Supervised Depth Estimation Under Varied Weather Conditions[J].Sustainability,2024,16(20):9131.
[12]GODARD C,MAC AODHA O,BROSTOW G J.Unsupervised monocular depth estimation with left-right consistency[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:270-279.
[13]XIE J,GIRSHICK R,FARHADI A.Deep3d:Fully automatic2d-to-3d video conversion with deep convolutional neural networks[C]//European Conference on Computer Vision.Springer International Publishing.2016:842-857.
[14]GARG R,BG V K,CARNEIRO G,et al.Unsupervised cnn for single view depth estimation:Geometry to the rescue[C]//European Conference on Computer Vision.Springer International Publishing.2016:740-756.
[15]BADKI A,TROCCOLI A,KIM K,et al.Bi3D:Stereo depth estimation via binary classifications[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway,IEEE Press:2020:1597-1605.
[16]DU Q,LIU R,PAN Y,et al.Depth estimation with multi-resolution stereo matching[C]//IEEE Visual Communications and Image Processing.IEEE Press,2019:1-4.
[17]JOHNSTON A,CARNEIRO G.Self-supervised monoculartrained depth estimation using self-attention and discrete dispa-rity volume[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Press,2020:4755-4764.
[18]ZHOU T,BROWN M,SNAVELY N,et al.Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:1851-1858.
[19]VIJAYANARASIMHAN S,RICCO S,SCHMID C,et al.Sfm-net:Learning of structure and motion from video[J/OL].CoRR,2017,abs/1704.07804.
[20]YIN Z,SHI J.Geonet:Unsupervised learning ofdense depth,optical flow and camera pose[C]//IEEE Conference on Computer Vision and Pattern Recognition.2018:1983-1992.
[21]JANAI J,GUNEY F,RANJAN A,et al.Unsupervised learning of multi-frame optical flow with occlusions[C]//European Conference on Computer Vision.2018:690-706.
[22]WAN Y,ZHAO Q,GUO C H,et al.Multi-sensor fusion self-supervised deep odometry and depth estimation[J].Remote Sen-sing,2022,14(5):1228.
[23]MAHJOURIAN R,WICKE M,ANGELOVA A.Unsupervised learning of depth and ego-motion from monocular video using3D geometric constraints[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2018:5667-5675.
[24]ZHANG X,LI C,WANG Y,et al.Light feld depth estimation for scene with occlusion[J].Control and Decision,2018,33(12):2122-2130.
[25]MNIH V,HEESS N,GRAVES A.Recurrent models of visual attention[J].arXiv:1406.6247,2014.
[26]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//International Conference on Learning Representations.2015.
[27]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st InternationalConfe-rence on Neural Information Processing Systems.2017:6000-6010.
[28]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:transformers for image recognition at scale[C]//International Conference on Learning Representations.OpenReview.net,2021.
[29]ZHOU H,GREENWOOD D,TAYLOR S.Self-Supervised Monocular Depth Estimationwith Internal Feature Fusion[C]//British Machine Vision Conference 2021.BMVA Press,2021:378.
[30]VARMA A,CHAWLA H,ZONOOZ B,et al.Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics[C]//Proceedings of the 17th International Joint Conference on Computer Vision.2022:758-769.
[31]HAO C,ZHANG Y,POGGI M,et al.MonoViT:Self-Supervised Monocular Depth Estimation with a Vision Transformer[C]//International Conference on 3D Vision.Czech Republic,2022:668-678.
[32]LEE Y,KIM J,WILLETTE J,et al.Mpvit:Multi-path visiontransformer for dense prediction[C]//IEEE Conference on Computer Vision and Pattern Recognition.2022:7287-7296.
[33]WANG Z,BOVIK A C,SHEIKH H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE Trans.,2004,13(4):600-612.
[34]GODARD C,MAC A O,FIRMAN M,et al.Digging into self-supervised monocular depth estimation[C]//IEEE International Conference on Computer Vision.2019:3828-3838.
[35]EIGEN D,PUHRSCH C,FERGUS R.Depth map predictionfrom a single image using a multi-scale deep network[C]//NIPS’14,2014.
[36]WONG A,HONG B W,SOATTO S.Bilateral Cyclic Constraint and Adaptive Regularization for Unsupervised Monocular Depth Prediction[J].IEEE Conference on Computer Vision and Pattern Recognition.2019:5644-5653.
[37]ZOU Y,LUO Z,HUANG J B.Df-net:Unsupervised joint lear-ning of depth and flow using cross-task consistency[C]//Proceedings of the European Conference on Computer Vision.2018:36-53.
[38]RANJAN A,JAMPANI V,BALLES L,et al.Competitive col-laboration:Joint unsupervised learning of depth,camera motion,optical flow and motion segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:12240-12249.
[39]CASSER V,PIRK S,MAHJOURIAN R,et al.Depth prediction without the sensors:Leveraging structure for unsupervised learning from monocular videos[C]//Proceedings of The AAAI Conference on Artificial Intelligence.2019:8001-8008.
[40]ZHOU Z,FAN X,SHI P,et al.R-msfm:Recurrent multi-scale feature modulation formonocular depth estimating[C]//IEEE International Conference on Computer Vision.2021:12777-12786.
[41]SURI Z K.Pose Constraints for Consistent Self-supervised Monocular Depth and Ego-Motion [C]//Scandinavian Conference on Image Analysis.Springer,2023:340-353.
[42]BAE J,MOON S,IM S.Deep digging into the generalization of self-supervised monocular depthestimation[C]//36th AAAI Conference on Artificial Intelligence.2023:187-196.
[43]WANG W,HAN J,ZOU X,et al.Research on Fast Monocular Depth Estimation Algorithm Based on Edge Devices[J].Computer Measurement & Control,2025,33(4):262-269.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!