UFormer:基于Transformer和U-Net结构的端到端特征点景象匹配算法

doi:10.11896/jsjkx.230300045

计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230300045-6.doi: 10.11896/jsjkx.230300045

• 图像处理&多媒体技术 • 上一篇下一篇

UFormer:基于Transformer和U-Net结构的端到端特征点景象匹配算法

辛瑞, 张霄力, 彭侠夫, 陈锦文

厦门大学航空航天学院福建厦门 361005

发布日期:2023-11-09
通讯作者: 张霄力(zhangxl_xmu@163.com)
作者简介:(xr13013@163.com)
基金资助:
航空科学基金(201958068002)

UFormer:An End-to-End Feature Point Scene Matching Algorithm Based on Transformer and U-Net

XIN Rui, ZHANG Xiaoli, PENG Xiafu, CHEN Jinwen

School of Aerospace Engineering,Xiamen University,Xiamen,Fujian 361005,China

Published:2023-11-09
About author:XIN Rui,born in 1998,is a master student.His main research interests include scene matching and computer vision navigation.
ZHANG Xiaoli,born in 1970,Ph.D,associate professor.His main research interests inlcude theoretical analysis of nonlinear systems,deep learning,integrated navigation.
Supported by:
Aeronautical Science Foundation of China(201958068002).

摘要/Abstract

摘要： 目前景象匹配算法多采用传统特征点匹配算法,算法流程由特征检测和特征匹配组成,对于弱纹理场景精度低,匹配成功率低。UFormer提出了一种端到端的方案,用于完成基于Transformer的特征提取和匹配操作,采用注意力机制提高算法应对弱纹理场景的能力。受U-Net架构的启发,UFormer在编码器-解码器结构的基础上由粗到细构建了图像亚像素级的映射关系。编码器采用self-cross attention交叠结构检测并提取图像对各尺度的相关特征,建立特征连接,进行下采样,用于粗粒度的匹配,提供初始位置。解码器上采样,恢复图像分辨率,融合每个尺度上的注意力特征映射,实现细粒度层面的匹配,并通过期望的方式将匹配结果细化至亚像素精度。引入真值单应性矩阵计算粗、细粒度匹配点对坐标的欧氏距离反馈损失,监督网络的学习。UFormer融合特征检测与特征匹配,结构更简单,在保证准确性的同时提高了实时性,在一定程度上具备应对弱纹理场景的能力。在收集的无人机飞行轨迹数据集上,相比SIFT,坐标精度提升了0.183个像素,匹配耗时缩短至0.106 s,对弱纹理场景图像的匹配成功率更高。

关键词: 景象匹配, 注意力机制, 视觉定位, 深度学习

Abstract: At present,most scene matching algorithms use traditional feature point matching algorithms.The algorithm process consists of feature detection and feature matching.For weak texture scenes,both the accuracy and matching success rate are low.UFormer proposes an end-to-end solution to complete Transformer-based feature extraction and matching operations,and uses an attention mechanism to improve the algorithm’s ability to deal with weak texture scenes.Inspired by the U-Net architecture,UFormer constructs the sub-pixel-level mapping relationship of images from coarse to fine based on the encoder-decoder structure.The encoder uses the self-cross attention overlapping structure to detect and extract the relevant features of each scale of the image,establish feature connections,and perform down-sampling for coarse-grained matching to provide the initial position.The decoder upsamples,restores image resolution,fuses attentional feature maps at each scale,achieves matching at a fine-grained level,and refines the matching results to sub-pixel precision in a desired way.Introduce the ground-truth homography matrix to calculate the Euclidean distance feedback loss of coarse and fine-grained matching point-to-coordinates,and supervise the learning of the network.UFormer integrates feature detection and feature matching,with a simpler structure,which improves real-time performance while ensuring accuracy,and has the ability to deal with weak texture scenes to a certain extent.On the collected drone trajectory data set,compared with SIFT,the coordinate accuracy improves by 0.416 pixels,the matching time decreases to 0.106 s,and the matching success rate for weak texture scene images is higher.

Key words: Scene matching, Attention mechanism, Visual localization, Deep learning

中图分类号:

TN967.2

辛瑞, 张霄力, 彭侠夫, 陈锦文. UFormer:基于Transformer和U-Net结构的端到端特征点景象匹配算法[J]. 计算机科学, 2023, 50(11A): 230300045-6. https://doi.org/10.11896/jsjkx.230300045

XIN Rui, ZHANG Xiaoli, PENG Xiafu, CHEN Jinwen. UFormer:An End-to-End Feature Point Scene Matching Algorithm Based on Transformer and U-Net[J]. Computer Science, 2023, 50(11A): 230300045-6. https://doi.org/10.11896/jsjkx.230300045

参考文献

[1]JIANG X Y,MA J Y,XIAO G B,et al.A review of multimodal image matching:Methods and applications[J].Information Fusion,2021,73:22-71.
[2]LENG C C,ZHANG H,LI B,et al.Local feature descriptor for image matching:A survey[J].IEEE Access,2018,7:6424-6434.
[3]MIAN A S,BENNAMOUN M,OWENS R,et al.Keypoint detection and local feature matching for textured 3D face recognition[J].International Journal of Computer Vision,2008,79(1):1-12.
[4]LI J,ALLINSON N M.A comprehensive review of current local features for computer vision[J].Neurocomputing,2008,71 (10/11/12):1771-1787.
[5]CHEN L,ROTTENSTEINER F,HEIPKE C,et al.Feature detection and description for image matching:from hand-crafted design to deep learning[J].Geo-spatial Information Science,2021,24(1):58-74.
[6]SARLIN P E,DETONE D,MALISIEWICZ T,et al.Superglue:Learning feature matching with graph neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:4938-4947.
[7]RONNEBERGER O,FISCHER P,BROX T,et al.U-net:Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and compu-ter-assisted intervention.Springer,2015:234-241.
[8]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].arXiv:1706.03762,2017.
[9]SUN J,SHEN Z,WANG Y,et al.LoFTR:Detector-free local feature matching with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:8922-8931.
[10]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125.
[11]LOWE G.Sift-the scale invariant feature transform[J].International Joural,2004,2(2):91-110.
[12]BAY H,TUYTELAARS T,GOOL L,et al.Surf:Speeded up robust features[C]//European Conference on Computer Vision.Springer,2006:404-417.
[13]SILPA-ANAN C,HARTLEY R.Optimised KD-trees for fast image descriptor matching[C]//2008 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2008:1-8.
[14]CALONDER M,LEPETIT V,STRECHA C,et al.Binary robust independent elementary features[C]//Proceedings of the European Conference on Computer Vision:778-792.
[15]VISWANATHAN D G.Features from accelerated segment test (fast)[C]//Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services.London,UK,2009:6-8.
[16]RUBLEE E,RABAUD V,KONOLIGE K,et al.ORB:An efficient alternative to SIFT or SURF[C]//International Confe-rence on Computer Vision.IEEE:2011:2564-2571.
[17]DETONE D,MALISIEWICZ T,RABINOVICH A,et al.Superpoint:Self-supervised interest point detection and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2018:224-236.
[18]BARROSO-LAGUNA A,RIBA E,PONSA D,et al.Key.net:Keypoint detection by handcrafted and learned cnn filters[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5836-5844.
[19]SHEN X L,WANG C,LI X,et al.Rf-net:An end-to-end image matching network based on receptive field[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8132-8140.
[20]LEE J,KIM B,CHO M S,et al.Self-Supervised Equivariant Learning for Oriented Keypoint Detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4847-4857.
[21]DUSMANU M,ROCCO,PAJDLA T,et al.D2-net:A trainable cnn for joint description and detection of local features[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:8092-8101.
[22]ONO Y,TRULLS E,FUA P,et al.LF-Net:Learning local features from images[C]//Advances in Neural Information Processing Systems.2018.
[23]REVAUD J,WEINIAEPFEL P,DE S C,et al.R2D2:repeatable and reliable detector and descriptor[J]arXiv:1906.06195,2019.
[24]YIN J,LIU Q,MENG F,et al.STCDesc:Learning deep local descriptor using similar triangle constraint[J].Knowledge-based systems,2022(19):248.
[25]TIAN Y,FAN B,WU F.L2-Net:Deep Learning of Discriminative Patch Descriptor in Euclidean Space[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2017.
[26]MISHCHUK A,MISHKIN D,RADENOVIC F,et al.Working hard to know your neighbor’s margins:Local descriptor learning loss[C]//Advances in Neural Information Processing Systems.2017.
[27]DANG Z,DENG C,YANG X,et al.Nearest neighbor matching for deep clustering[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2021:13693-13702.
[28]FISCHLER M A,BOLLES R C.Random sample consensus:a paradigm for model fitting with applications to image analysis and automated cartography[J].Communications of the ACM 1981,24(6):381-395.
[29]WANG Q,ZHANG J,YANG K,et al.MatchFormer:Interleaving Attention in Transformers for Feature Matching[J].ar-Xiv:2203.09645,2022.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

UFormer:基于Transformer和U-Net结构的端到端特征点景象匹配算法

UFormer:An End-to-End Feature Point Scene Matching Algorithm Based on Transformer and U-Net

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0