计算机科学 ›› 2023, Vol. 50 ›› Issue (9): 220-226.doi: 10.11896/jsjkx.220900256
李祥1, 范志广2, 林楠1, 曹仰杰1, 李学相1
LI Xiang1, FAN Zhiguang2, LIN Nan1, CAO Yangjie1, LI Xuexiang1
摘要: 近年来,视觉问答逐渐成为计算机视觉领域的研究热点之一。目前大多数研究是围绕2D图像的问答,但2D图像存在由视点改变、遮挡和重投影引入的空间模糊性。现实生活中,人机交互的场景往往是3D的,研究3D问答更具实际应用价值。已有的3D问答算法能感知3D对象以及它们的空间关系,并能回答意义复杂的问题。但是,由点云组成的3D场景和问题属于两种模态的数据,这两种模态数据之间存在明显的差异,难以对齐,两者潜在的相关特征容易被忽略。针对这一问题,提出了一种基于自监督学习的3D真实场景问答方法。该方法首次在3D问答模型中引入对比学习,通过3D跨模态对比学习对齐3D场景和问题,缩小两种模态的异构差距,挖掘两者的相关特征。此外,将深度交互注意力网络用于处理3D场景和问题,对3D场景中的对象和问题中的关键词做充分的交互。在ScanQA数据集上进行的大量实验表明,3DSSQA在EM@1这个主要指标上的准确度达到了24.3%,超过了目前最先进的模型。
中图分类号:
[1]AZUMA D,MIYANISHI T,KURITA S,et al.ScanQA:3DQuestion Answering for Spatial Scene Understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:19129-19139. [2]YAN X,YUAN Z,DU Y,et al.CLEVR3D:Compositional Language and Elementary Visual Reasoning for Question Answe-ring in 3D Real-World Scenes[J].arXiv:2112.11691,2021. [3]WANG H,GUO B,ZENG Y,et al.Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System:A Review[J].arXiv:2207.00782,2022. [4]KIM K,BILLINGHURST M,BRUDER G,et al.Revisitingtrends in augmented reality research:A review of the 2nd de-cade of ISMAR(2008-2017)[J].IEEE transactions on visualization and computer graphics,2018,24(11):2947-2962. [5]MITTAL V.Attngrounder:Talking to cars with attention[C]//European Conference on Computer Vision.Cham:Springer,2020:62-73. [6]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9. [7]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in Neural Information Processing Systems.2015:2296-2304. [8]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369. [9]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621. [10]KAZEMI V,ELQURSH A.Show,ask,attend,and answer:A strong baseline for visual question answering[J].arXiv:1704.03162,2017. [11]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29. [12]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2019:6281-6290. [13]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1653-1662. [14]ZHOU Y,REN T,ZHU C,et al.Trar:Routing the attention spans in transformer for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2074-2084. [15]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in Neural Information Processing Systems,2021,34:9694-9705. [16]ZENG Y,ZHANG X,LI H.Multi-Grained Vision LanguagePre-Training:Aligning Texts with Visual Concepts[J].arXiv:2111.08276,2021. [17]WANG P,YANG A,MEN R,et al.OFA:Unifying Architec-tures,Tasks,and Modalities Through a Simple Sequence-to-Sequence Learning Framework[C]//International Conference on Machine Learning.PMLR,2022:23318-23340. [18]YE S,CHEN D,HAN S,et al.3D Question Answering[J].ar-Xiv:2112.08359,2021. [19]YANG J,DUAN J,TRAN S,et al.Vision-Language Pre-Trai-ning with Triple Contrastive Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15671-15680. [20]WANG W,BAO H,DONG L,et al.Vlmo:Unified vision-lan-guage pre-training with mixture-of-modality-experts[J].arXiv:2111.02358,2021. [21]CHEN D Z,CHANG A X,NIEßNER M.Scanrefer:3d object localization in rgb-d scans using natural language[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:202-221. [22]CHEN Z,GHOLAMI A,NIEßNER M,et al.Scan2cap:Context-aware dense captioning in rgb-d scans[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3193-3203. [23]DAI A,NIEßNER M.3dmv:Joint 3d-multi-view prediction for 3d semantic scene segmentation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:452-468. [24]QI C R,LITANY O,HE K,et al.Deep hough voting for 3d object detection in point clouds[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9277-9286. [25]QI C R,YI L,SU H,et al.Pointnet++:Deep hierarchical feature learning on point sets in a metric space[J].Advances in Neural Information Processing Systems,2017,30:5099-5108. [26]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [27]MISRA I,MAATEN L.Self-supervised learning of pretext-in-variant representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:6707-6717. [28]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018. [29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318. [30]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81. [31]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop On Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72. [32]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575. [33]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conferenceon Computer Vision.Cham:Springer,2016:382-398. [34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014. |
|