Computer Science ›› 2023, Vol. 50 ›› Issue (9): 220-226.doi: 10.11896/jsjkx.220900256

• Database & Big Data & Data Science • Previous Articles     Next Articles

Self-supervised Learning for 3D Real-scenes Question Answering

LI Xiang1, FAN Zhiguang2, LIN Nan1, CAO Yangjie1, LI Xuexiang1   

  1. 1 School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450000,China
    2 School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510000,China
  • Received:2022-09-28 Revised:2023-03-28 Online:2023-09-15 Published:2023-09-01
  • About author:LI Xiang,born in 1997,postgraduate.His main research interests include vi-sual question answering and so on.
    LI Xuexiang,born in 1965,professor,master supervisor.His main research interests include high performance computing and cloud computing.
  • Supported by:
    General Project of the National Natural Science Foundation of China(61972092) and Collaborative Innovation Major Project of Zhengzhou(20XTZX06013).

Abstract: Visual question answering(VQA)has gradually become one of the research hotspots in recent years.Most of the current question-answering research is 2D-image-based,often suffering from spatial ambiguity introduced by viewpoint changing,occlusion,and reprojection.In practice,human-computer interaction scenarios are often three-dimensional,yielding the demand for 3D-scene-based question answering.Existing 3D question answering algorithms have so far been able to perceive 3D objects and their spatial relationships,and can answer complex questions.However,point clouds represented by 3D scenes and the target questions belong to two different modalities,which are extremely difficult to align,leading to their unconspicuous related features are easy to be ignored.Aiming at this problem,this paper proposes a novel learning-based question answering method for realistic 3D scenes,called 3D self-supervised question answering(3DSSQA).Within 3DSSQA,a 3D cross-modal contrastive learning model(3DCMCL) is proposed to first align point-cloud data with question data globally for modality heterogeneity gap reduction,before mining related features between the two.In addition,a deep interactive attention(DIA) network is adapted to align 3D objects with keywords in a more fine-grained granularity,facilitating sufficient interactions between them.Extensive experiments on the ScanQA dataset demonstrate that 3DSSQA achieves an accuracy of 24.3% on the main EM@1 metric,notably surpassing state-of-the-art models.

Key words: 3D question answering, Self-supervised learning, Contrastive learning, Point clouds, Deep interactive attention

CLC Number: 

  • TP181
[1]AZUMA D,MIYANISHI T,KURITA S,et al.ScanQA:3DQuestion Answering for Spatial Scene Understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:19129-19139.
[2]YAN X,YUAN Z,DU Y,et al.CLEVR3D:Compositional Language and Elementary Visual Reasoning for Question Answe-ring in 3D Real-World Scenes[J].arXiv:2112.11691,2021.
[3]WANG H,GUO B,ZENG Y,et al.Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System:A Review[J].arXiv:2207.00782,2022.
[4]KIM K,BILLINGHURST M,BRUDER G,et al.Revisitingtrends in augmented reality research:A review of the 2nd de-cade of ISMAR(2008-2017)[J].IEEE transactions on visualization and computer graphics,2018,24(11):2947-2962.
[5]MITTAL V.Attngrounder:Talking to cars with attention[C]//European Conference on Computer Vision.Cham:Springer,2020:62-73.
[6]MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:A neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1-9.
[7]GAO H,MAO J,ZHOU J,et al.Are you talking to a machine? dataset and methods for multilingual image question[C]//Advances in Neural Information Processing Systems.2015:2296-2304.
[8]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369.
[9]SHIH K J,SINGH S,HOIEM D.Where to look:Focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4613-4621.
[10]KAZEMI V,ELQURSH A.Show,ask,attend,and answer:A strong baseline for visual question answering[J].arXiv:1704.03162,2017.
[11]YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[12]YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.2019:6281-6290.
[13]RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1653-1662.
[14]ZHOU Y,REN T,ZHU C,et al.Trar:Routing the attention spans in transformer for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2074-2084.
[15]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation[J].Advances in Neural Information Processing Systems,2021,34:9694-9705.
[16]ZENG Y,ZHANG X,LI H.Multi-Grained Vision LanguagePre-Training:Aligning Texts with Visual Concepts[J].arXiv:2111.08276,2021.
[17]WANG P,YANG A,MEN R,et al.OFA:Unifying Architec-tures,Tasks,and Modalities Through a Simple Sequence-to-Sequence Learning Framework[C]//International Conference on Machine Learning.PMLR,2022:23318-23340.
[18]YE S,CHEN D,HAN S,et al.3D Question Answering[J].ar-Xiv:2112.08359,2021.
[19]YANG J,DUAN J,TRAN S,et al.Vision-Language Pre-Trai-ning with Triple Contrastive Learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:15671-15680.
[20]WANG W,BAO H,DONG L,et al.Vlmo:Unified vision-lan-guage pre-training with mixture-of-modality-experts[J].arXiv:2111.02358,2021.
[21]CHEN D Z,CHANG A X,NIEßNER M.Scanrefer:3d object localization in rgb-d scans using natural language[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:202-221.
[22]CHEN Z,GHOLAMI A,NIEßNER M,et al.Scan2cap:Context-aware dense captioning in rgb-d scans[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:3193-3203.
[23]DAI A,NIEßNER M.3dmv:Joint 3d-multi-view prediction for 3d semantic scene segmentation[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:452-468.
[24]QI C R,LITANY O,HE K,et al.Deep hough voting for 3d object detection in point clouds[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:9277-9286.
[25]QI C R,YI L,SU H,et al.Pointnet++:Deep hierarchical feature learning on point sets in a metric space[J].Advances in Neural Information Processing Systems,2017,30:5099-5108.
[26]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[27]MISRA I,MAATEN L.Self-supervised learning of pretext-in-variant representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:6707-6717.
[28]OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[29]PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002:311-318.
[30]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[31]BANERJEE S,LAVIE A.METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop On Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.2005:65-72.
[32]VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.Ci-der:Consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:4566-4575.
[33]ANDERSON P,FERNANDO B,JOHNSON M,et al.Spice:Semantic propositional image caption evaluation[C]//European Conferenceon Computer Vision.Cham:Springer,2016:382-398.
[34]KINGMA D P,BA J.Adam:A method for stochastic optimiza-tion[J].arXiv:1412.6980,2014.
[1] XU Jie, WANG Lisong. Contrastive Clustering with Consistent Structural Relations [J]. Computer Science, 2023, 50(9): 123-129.
[2] HU Shen, QIAN Yuhua, WANG Jieting, LI Feijiang, LYU Wei. Super Multi-class Deep Image Clustering Model Based on Contrastive Learning [J]. Computer Science, 2023, 50(9): 192-201.
[3] WANG Mingxia, XIONG Yun. Disease Diagnosis Prediction Algorithm Based on Contrastive Learning [J]. Computer Science, 2023, 50(7): 46-52.
[4] WU Jufeng, ZHAO Xungang, ZHOU Qiang, RAO Ning. Contrastive Learning for Low-light Image Enhancement [J]. Computer Science, 2023, 50(6A): 220600171-6.
[5] ZENG Wu, MAO Guojun. Few-shot Learning Method Based on Multi-graph Feature Aggregation [J]. Computer Science, 2023, 50(6A): 220400029-10.
[6] HE Chao, CHEN Jinjie, JIN Zhao, LEI Yinjie. Automatic Modulation Recognition Method Based on Multimodal Time-Frequency Feature Fusion [J]. Computer Science, 2023, 50(4): 226-232.
[7] ZHU Lei, WANG Shanmin, LIU Qingshan. Self-supervised 3D Face Reconstruction Based on Detailed Face Mask [J]. Computer Science, 2023, 50(2): 214-220.
[8] WANG Pengyu, TAI Wenxin, LIU Fang, ZHONG Ting, LUO Xucheng, ZHOU Fan. Self-supervised Flight Trajectory Prediction Based on Data Augmentation [J]. Computer Science, 2023, 50(2): 130-137.
[9] LI Zong-min, ZHANG Yu-peng, LIU Yu-jie, LI Hua. Deformable Graph Convolutional Networks Based Point Cloud Representation Learning [J]. Computer Science, 2022, 49(8): 273-278.
[10] DU Hang-yuan, LI Duo, WANG Wen-jian. Method for Abnormal Users Detection Oriented to E-commerce Network [J]. Computer Science, 2022, 49(7): 170-178.
[11] YUAN De-sen, LIU Xiu-jing, WU Qing-bo, LI Hong-liang, MENG Fan-man, NGAN King-ngi, XU Lin-feng. Visual Question Answering Method Based on Counterfactual Thinking [J]. Computer Science, 2022, 49(12): 229-235.
[12] MIAO Zhuang, WANG Ya-peng, LI Yang, WANG Jia-bao, ZHANG Rui, ZHAO Xin-xin. Robust Hash Learning Method Based on Dual-teacher Self-supervised Distillation [J]. Computer Science, 2022, 49(10): 159-168.
[13] TIAN Song-wang, LIN Su-zhen, YANG Bo. Multi-band Image Self-supervised Fusion Method Based on Multi-discriminator [J]. Computer Science, 2021, 48(8): 185-190.
[14] HAO Wen, WANG Ying-hui, NING Xiao-juan, LIANG Wei and SHI Zheng-hao. Survey of 3D Object Recognition for Point Clouds [J]. Computer Science, 2017, 44(9): 11-16.
[15] QIU Chun-li and XU Hong-li. Direct Triangulation Algorithm for Three-dimensional Scattered Points [J]. Computer Science, 2014, 41(2): 157-160.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!