视觉情境感知驱动的虚拟机器人交互系统

doi:10.11896/jsjkx.230200167

计算机科学 ›› 2023, Vol. 50 ›› Issue (9): 260-268.doi: 10.11896/jsjkx.230200167

视觉情境感知驱动的虚拟机器人交互系统

刘宇博, 郭斌, 马可, 邱晨, 刘思聪

西北工业大学计算机学院西安 710129

收稿日期:2023-02-22 修回日期:2023-07-01 出版日期:2023-09-15 发布日期:2023-09-01
通讯作者: 郭斌(guob@nwpu.edu.cn)
作者简介:(redcontritio@qq.com)
基金资助:
国家杰出青年科学基金(62025205);国家自然科学基金(62032020,61725205,62102317)

Design of Visual Context-driven Interactive Bot System

LIU Yubo, GUO Bin, MA Ke, QIU Chen, LIU Sicong

School of Computer Science,Northwestern Polytechnical University,Xi'an 710129,China

Received:2023-02-22 Revised:2023-07-01 Online:2023-09-15 Published:2023-09-01
About author:LIU Yubo,born in 2000,postgraduate,is a member of China Computer Federation.His main researchinterest is multi-modal QA.
GUO Bin,born in 1980,Ph.D,professor,doctoral supervisor.His main research interests include ubiquitous computing,mobile crowd sensing,big data intelligence and so on.
Supported by:
National Science Fund for Distinguished Young Scholars(62025205) and National Natural Science Foundation of China(62032020,61725205,62102317).

摘要/Abstract

摘要： 虚拟机器人是能与人交互的智能软件,通常具有实时性、交互性等特点。文中以视觉情境感知驱动的虚拟机器人为主题,从轻量级目标检测模型及压缩、实时关键帧提取、系统优化和交互策略4个方面展开探究,在边缘的资源受限平台上构建强实时性、高交互性、高度可扩展的虚拟机器人系统。具体而言,在轻量级目标检测模型及压缩方面,首先探究不同主干网络下SSD模型的性能与精度,随后对基于VGG16网络的SSD模型进行int8量化与剪枝,在精度损失不超过0.1%的前提下,帧率比原模型提高187%。在实时关键帧提取方面,使用边缘特征强度和HOG特征进行视频流预筛选,降低系统压力,等效减少90%的推理时延。在系统优化方面,采用微服务化降低冷启动时延约98%。在交互策略方面,使用含计时器的状态机对情境进行建模以实现情境驱动,并采用语音形式完成人机交互的输出。

关键词: 资源受限, 轻量级模型, 模型压缩, 目标检测, 情境驱动

Abstract: Bots are intelligent software that can interact with people,and usually have the characteristics of real-time and interactivity.This paper takes the bots driven by visual context awareness as the theme,and explores from four aspects:lightweight target detection model and compression,real-time key frame extraction,system optimization,and interaction strategy,and builds strong real-time on edge resource-constrained devices.A flexible,highly interactive and highly scalable bots system.Specifically,in terms of lightweight target detection models and compression,we first explore the performance and accuracy of different lightweight target detection models,and compress the SSD model based on the VGG16 network to find a suitable compression strategy.Compression on the latest SSD model can increase the frame rate by 187% compared with the original model,under the pre-mise that the accuracy loss does not exceed 0.1%.In terms of real-time key frame extraction,the input video stream is pre-screened to reduce system pressure,which is equivalent to reducing inference delay by 90%.In terms of system optimization,the use of microservices reduces the cold start delay by about 98%.In terms of interaction strategy,a state machine with timer is used to model the situation to achieve situation-driven,and the output of human-computer interaction is completed in the form of speech.

Key words: Resource-constrained, Lightweight model, Model compression, Object detection, Context-driven

中图分类号:

TP391

刘宇博, 郭斌, 马可, 邱晨, 刘思聪. 视觉情境感知驱动的虚拟机器人交互系统[J]. 计算机科学, 2023, 50(9): 260-268. https://doi.org/10.11896/jsjkx.230200167

LIU Yubo, GUO Bin, MA Ke, QIU Chen, LIU Sicong. Design of Visual Context-driven Interactive Bot System[J]. Computer Science, 2023, 50(9): 260-268. https://doi.org/10.11896/jsjkx.230200167

参考文献

[1]KLOPFENSTEIN L C,DELPRIORI S,MALATINI S,et al.The rise of bots:A survey of conversational interfaces,patterns,and paradigms[C]//Proceedings of the 2017 Conference on Designing Interactive Systems.2017:555-565.
[2]ALBAYRAK N,ÖZDEMIR A,ZEYDAN E.An overview of artificial intelligence based chatbots and an example chatbot application[C]//2018 26th Signal Processing and Communications Applications Conference(SIU).2018:1-4.
[3]ADARSH P,RATHI P,KUMAR M.YOLO v3-Tiny:ObjectDetection and Recognition using one stage improved model[C]//2020 6th International Conference on Advanced Computing and Communication Systems(ICACCS).2020:687-694.
[4]IANDOLA F N,HAN S,MOSKEWICZ M W,et al.Squeeze-Net:AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size[J].arXiv:1602.07360,2016.
[5]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[6]ZHANG X,ZHOU X,LIN M,et al.Shufflenet:An extremelyefficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6848-6856.
[7]LI H,KADAV A,DURDANOVIC I,et al.Pruning filters for efficient convnets[J].arXiv:1608.08710,2016.
[8]VANHOUCKE V,SENIOR A,MAO M Z.Improving the speed of neural networks on CPUs[C]//Deep Learning and Unsupervised Feature Learning Workshop(NIPS 2011).2011.
[9]GUPTA S,AGRAWAL A,GOPALAKRISHNAN K,et al.Deep learning with limited numerical precision[C]//Interna-tional Conference on Machine Learning.2015:1737-1746.
[10]COURBARIAUX M,HUBARA I,SOUDRY D,et al.Binarized neural networks:Training deep neural networks with weights and activations constrained to +1 or －1[J].arXiv:1602.02830,2016.
[11]SUJATHA C,MUDENAGUDI U.A study on keyframe extraction methods for video summary[C]//2011 International Conference on Computational Intelligence and Communication Networks.2011:73-77.
[12]KELM P,SCHMIEDEKE S,SIKORA T.Feature-based videokey frame extraction for low quality video sequences[C]//2009 10th Workshop on Image Analysis for Multimedia Interactive Services.2009:25-28.
[13]LIU T,ZHANG H J,QI F.A novel video key-frame-extraction algorithm based on perceived motion energy model[J].IEEE Transactions on Circuits and Systems for Video Technology,2003,13(10):1006-1013.
[14]SOBEL I,FELDMAN G.A 3×3 isotropic gradient operator for image processing[J].Pattern Classification and Scene Analysis,1973:271-272.
[15] CANNY J.A Computational Approach to Edge Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1986,PAMI-8(6):679-698.
[16]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[17]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.2015:1-9.
[18]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4510-4520.
[19]ZHUANG Y,RUI Y,HUANG T S,et al.Adaptive key frame extraction using unsupervised clustering[C]//Proceedings 1998 International Conference on Image Processing.icip98 (cat.no.98cb36269).1998:866-870.
[20]HARALICK R M,SHAPIRO L G.Image segmentation tech-niques[J].Computer Vision,Graphics,and Image Processing,1985,29(1):100-132.
[21]DALAL N,AND TRIGGS B.Histograms of oriented gradients for human detection[C]//2005 IEEE Computer Society Confe-rence on Computer Vision and Pattern Recognition(CVPR'05).2005:886-893.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

视觉情境感知驱动的虚拟机器人交互系统

Design of Visual Context-driven Interactive Bot System

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0