计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 133-142.doi: 10.11896/jsjkx.230700207

• 计算机图形学&多媒体 • 上一篇    下一篇

一种基于对偶学习的场景分割模型

刘思纯, 王小平, 裴喜龙, 罗航宇   

  1. 同济大学电子与信息工程学院 上海 200092
  • 收稿日期:2023-07-27 修回日期:2023-11-22 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 王小平(xpwang6510@tongji.edu.cn)
  • 作者简介:(sichun_liu@tongji.edu.cn)
  • 基金资助:
    国家重点研发计划(2022YFB4300504-4)

Scene Segmentation Model Based on Dual Learning

LIU Sichun, WANG Xiaoping, PEI Xilong, LUO Hangyu   

  1. School of Electronics and Information Engineering,Tongji University,Shanghai 200092,China
  • Received:2023-07-27 Revised:2023-11-22 Online:2024-08-15 Published:2024-08-13
  • About author:LIU Sichun,born in 1998,postgraduate.Her main research interests include deep learning and computer vision.
    WANG Xiaoping,born in 1965,Ph.D,professor.His main research interests include AI algorithms,deep learning and computer vision.
  • Supported by:
    National Key Research and Development Program of China(2022YFB4300504-4).

摘要: 城市场景分割等复杂任务存在特征图空间信息利用率低下、分割边界不够精准以及网络参数量过大的问题。为解决这些问题,提出了一种基于对偶学习的场景分割模型DualSeg。首先,采用深度可分离卷积使模型参数量显著减少;其次,融合空洞金字塔池化与双重注意力机制模块获取准确的上下文信息;最后,利用对偶学习构建闭环反馈网络,通过对偶关系约束映射空间,同时训练“图像场景分割”和“对偶图像重建”两个任务,辅助场景分割模型的训练,帮助模型更好地感知类别边界、提高识别能力。实验结果表明,在自然场景分割数据集PASCAL VOC中,基于Xception骨架网络的DualSeg模型的mIoU和全局准确率分别达到81.3%和95.1%,在CityScapes数据集上mIoU达到77.4%,并且模型参数量减少18.45%,验证了模型的有效性。后续将探索更有效的注意力机制,进一步提高分割精度。

关键词: 场景分割, 图像重建, 对偶学习, 注意力机制, 深度可分离卷积, 多层次特征融合

Abstract: For complex tasks such as urban scene segmentation,there are problems such as low utilization of feature map space information,inaccurate segmentation boundaries,and excessive network parameters.To solve these problems,DualSeg,a scene segmentation model based on dual learning,is proposed.Firstly,depthwise separable convolution is used to significantly reduce the number of model parameters Secondly,accurate context information is obtained by fusing hollow pyramid pooling and double attention mechanism modules.Finally,dual learning is used to construct a closed-loop feedback network,and the mapping space is constrained by duality,while training the two tasks of “image scene segmentation” and “dual image reconstruction”,it can assist the training of the scene segmentation model,help the model to better perceive the category boundary and improve the recogni-tion ability.Experimental results show that the DualSeg model based on the Xception skeleton network achieves 81.3% mIoU and 95.1% global accuracy on natural scene segmentation dataset PASCAL VOC,respectively,and the mIoU reaches 77.4% on the CityScapes dataset,and the number of model parameters decreases by 18.45%,which verifies the effectiveness of the model.A more effective attention mechanism will be explored in the future to further improve the segmentation accuracy.

Key words: Scene segmentation, Image reconstruction, Dual learning, Attention mechanism, Depthwise separable convolution, Multi-level feature fusion

中图分类号: 

  • TP391
[1]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3431-3440.
[2]BADRINARAYANAN V,KENDALL A,CIPOLLA R.Segnet:A deep convolutional encoder-decoder architecture for image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(12):2481-2495.
[3]CHEN L C,PAPANDREOU G,KOKKINOS I,et al.Semantic image segmentation with deep convolutional nets and fully connected crfs[J].arXiv:1412.7062,2014.
[4]CHEN L C,PAPANDREOU G,KOKKINOS I,et al.Deeplab:Semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected crfs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(4):834-848.
[5]CHEN L C,PAPANDREOU G,SCHROFF F,et al.Rethinking atrous convolution for semantic image segmentation[J].arXiv:1706.05587,2017.
[6]CHEN L C,ZHU Y,PAPANDREOU G,et al.Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:801-818.
[7]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[8]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[9]ZHENG S,LU J,ZHAO H,et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transfor-mers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6881-6890.
[10]XIE E,WANG W,YU Z,et al.SegFormer:Simple and efficient design for semantic segmentation with transformers[J].Advances in Neural Information Processing Systems,2021,34:12077-12090.
[11]CHENG B,MISRA I,SCHWING A G,et al.Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:1290-1299.
[12]JAIN J,LI J,CHIU M T,et al.Oneformer:One transformer to rule universal image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:2989-2998.
[13]LUO P,WANG G,LIN L,et al.Deep dual learning for semantic image segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2718-2726.
[14]CHEN J L,PENG Y B,LI N.Single image super-resolution reconstruction network based on dual learning strategy [J].Computer Application Research,2021,38(7):2235-2240.
[15]WANG L,LI D,ZHU Y,et al.Dual super-resolution learningfor semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3774-3783.
[16]CORDTS M,OMRAN M,RAMOS S,et al.The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:3213-3223.
[17]EVERINGHAM M,ESLAMI S M A,VAN G L,et al.The pascal visual object classes challenge:A retrospective[J].International Journal of Computer Vision,2015,111:98-136.
[18]CHOLLET F.Xception:Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1251-1258.
[19]BOYD S P,VANDENBERGHE L.Convex optimization[M].Cambridge University Press,2004.
[20]SUN K,ZHAO Y,JIANG B,et al.High-resolution representations for labeling pixels and regions[J].arXiv:1904.04514,2019.
[21]ZHAO H,SHI J,QI X,et al.Pyramid scene parsing network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2881-2890.
[22]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:770-778.
[23]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!