自监督学习用于3D真实场景问答

doi:10.11896/jsjkx.220900256

计算机科学 ›› 2023, Vol. 50 ›› Issue (9): 220-226.doi: 10.11896/jsjkx.220900256

• 数据库&大数据&数据科学 • 上一篇下一篇

自监督学习用于3D真实场景问答

李祥¹, 范志广², 林楠¹, 曹仰杰¹, 李学相¹

1 郑州大学网络空间安全学院郑州 450000
2 中山大学计算机学院广州 510000

收稿日期:2022-09-28 修回日期:2023-03-28 出版日期:2023-09-15 发布日期:2023-09-01
通讯作者: 李学相(lxx@zzu.edu.cn)
作者简介:(lixiang.zg@qq.com)
基金资助:
国家自然科学基金面上项目(61972092);郑州市协同创新重大专项(20XTZX06013)

Self-supervised Learning for 3D Real-scenes Question Answering

LI Xiang¹, FAN Zhiguang², LIN Nan¹, CAO Yangjie¹, LI Xuexiang¹

1 School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450000,China
2 School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510000,China

Received:2022-09-28 Revised:2023-03-28 Online:2023-09-15 Published:2023-09-01
About author:LI Xiang,born in 1997,postgraduate.His main research interests include vi-sual question answering and so on.
LI Xuexiang,born in 1965,professor,master supervisor.His main research interests include high performance computing and cloud computing.
Supported by:
General Project of the National Natural Science Foundation of China(61972092) and Collaborative Innovation Major Project of Zhengzhou(20XTZX06013).

摘要/Abstract

摘要： 近年来,视觉问答逐渐成为计算机视觉领域的研究热点之一。目前大多数研究是围绕2D图像的问答,但2D图像存在由视点改变、遮挡和重投影引入的空间模糊性。现实生活中,人机交互的场景往往是3D的,研究3D问答更具实际应用价值。已有的3D问答算法能感知3D对象以及它们的空间关系,并能回答意义复杂的问题。但是,由点云组成的3D场景和问题属于两种模态的数据,这两种模态数据之间存在明显的差异,难以对齐,两者潜在的相关特征容易被忽略。针对这一问题,提出了一种基于自监督学习的3D真实场景问答方法。该方法首次在3D问答模型中引入对比学习,通过3D跨模态对比学习对齐3D场景和问题,缩小两种模态的异构差距,挖掘两者的相关特征。此外,将深度交互注意力网络用于处理3D场景和问题,对3D场景中的对象和问题中的关键词做充分的交互。在ScanQA数据集上进行的大量实验表明,3DSSQA在EM@1这个主要指标上的准确度达到了24.3%,超过了目前最先进的模型。

关键词: 3D问答, 自监督学习, 对比学习, 点云, 深度交互注意力

Abstract: Visual question answering(VQA)has gradually become one of the research hotspots in recent years.Most of the current question-answering research is 2D-image-based,often suffering from spatial ambiguity introduced by viewpoint changing,occlusion,and reprojection.In practice,human-computer interaction scenarios are often three-dimensional,yielding the demand for 3D-scene-based question answering.Existing 3D question answering algorithms have so far been able to perceive 3D objects and their spatial relationships,and can answer complex questions.However,point clouds represented by 3D scenes and the target questions belong to two different modalities,which are extremely difficult to align,leading to their unconspicuous related features are easy to be ignored.Aiming at this problem,this paper proposes a novel learning-based question answering method for realistic 3D scenes,called 3D self-supervised question answering(3DSSQA).Within 3DSSQA,a 3D cross-modal contrastive learning model(3DCMCL) is proposed to first align point-cloud data with question data globally for modality heterogeneity gap reduction,before mining related features between the two.In addition,a deep interactive attention(DIA) network is adapted to align 3D objects with keywords in a more fine-grained granularity,facilitating sufficient interactions between them.Extensive experiments on the ScanQA dataset demonstrate that 3DSSQA achieves an accuracy of 24.3% on the main EM@1 metric,notably surpassing state-of-the-art models.

Key words: 3D question answering, Self-supervised learning, Contrastive learning, Point clouds, Deep interactive attention

中图分类号:

TP181

李祥, 范志广, 林楠, 曹仰杰, 李学相. 自监督学习用于3D真实场景问答[J]. 计算机科学, 2023, 50(9): 220-226. https://doi.org/10.11896/jsjkx.220900256

LI Xiang, FAN Zhiguang, LIN Nan, CAO Yangjie, LI Xuexiang. Self-supervised Learning for 3D Real-scenes Question Answering[J]. Computer Science, 2023, 50(9): 220-226. https://doi.org/10.11896/jsjkx.220900256

参考文献

[1]AZUMA D,MIYANISHI T,KURITA S,et al.ScanQA:3DQuestion Answering for Spatial Scene Understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:19129-19139.
[2]YAN X,YUAN Z,DU Y,et al.CLEVR3D:Compositional Language and Elementary Visual Reasoning for Question Answe-ring in 3D Real-World Scenes[J].arXiv:2112.11691,2021.
[3]WANG H,GUO B,ZENG Y,et al.Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System:A Review[J].arXiv:2207.00782,2022.
[4]KIM K,BILLINGHURST M,BRUDER G,et al.Revisitingtrends in augmented reality research:A review of the 2nd de-cade of ISMAR(2008－2017)[J].IEEE transactions on visualization and computer graphics,2018,24(11):2947-2962.
[5]MITTAL V.Attngrounder:Talking to cars with attention[C]//European Conference on Computer Vision.Cham:Springer,2020:62-73.
[6]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9.
[7]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in Neural Information Processing Systems.2015:2296-2304.
[8]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369.
[9]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[10]KAZEMI V,ELQURSH A.Show,ask,attend,and answer:A strong baseline for visual question answering[J].arXiv:1704.03162,2017.
[11]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[12]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2019:6281-6290.
[13]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1653-1662.
[14]ZHOU Y,REN T,ZHU C,et al.Trar:Routing the attention spans in transformer for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2074-2084.
[15]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in Neural Information Processing Systems,2021,34:9694-9705.
[16]ZENG Y,ZHANG X,LI H.Multi-Grained Vision LanguagePre-Training:Aligning Texts with Visual Concepts[J].arXiv:2111.08276,2021.
[17]WANG P,YANG A,MEN R,et al.OFA:Unifying Architec-tures,Tasks,and Modalities Through a Simple Sequence-to-Sequence Learning Framework[C]//International Conference on Machine Learning.PMLR,2022:23318-23340.
[18]YE S,CHEN D,HAN S,et al.3D Question Answering[J].ar-Xiv:2112.08359,2021.
[19]YANG J,DUAN J,TRAN S,et al.Vision-Language Pre-Trai-ning with Triple Contrastive Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15671-15680.
[20]WANG W,BAO H,DONG L,et al.Vlmo:Unified vision-lan-guage pre-training with mixture-of-modality-experts[J].arXiv:2111.02358,2021.
[21]CHEN D Z,CHANG A X,NIEßNER M.Scanrefer:3d object localization in rgb-d scans using natural language[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:202-221.
[22]CHEN Z,GHOLAMI A,NIEßNER M,et al.Scan2cap:Context-aware dense captioning in rgb-d scans[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3193-3203.
[23]DAI A,NIEßNER M.3dmv:Joint 3d-multi-view prediction for 3d semantic scene segmentation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:452-468.
[24]QI C R,LITANY O,HE K,et al.Deep hough voting for 3d object detection in point clouds[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9277-9286.
[25]QI C R,YI L,SU H,et al.Pointnet++:Deep hierarchical feature learning on point sets in a metric space[J].Advances in Neural Information Processing Systems,2017,30:5099-5108.
[26]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[27]MISRA I,MAATEN L.Self-supervised learning of pretext-in-variant representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:6707-6717.
[28]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[30]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[31]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop On Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[32]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[33]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conferenceon Computer Vision.Cham:Springer,2016:382-398.
[34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

自监督学习用于3D真实场景问答

Self-supervised Learning for 3D Real-scenes Question Answering

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0