计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 233-240.doi: 10.11896/jsjkx.190600109

• 计算机图形学&多媒体 • 上一篇    下一篇

基于YOLOv3的施工场景安全帽佩戴的图像描述

徐守坤, 倪楚涵, 吉晨晨, 李宁   

  1. 常州大学信息科学与工程学院 江苏 常州 213164
  • 修回日期:2019-06-19 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 徐守坤(17000210@smail.cczu.edu.cn)
  • 基金资助:
    受国家自然科学基金项目(61803050)

Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3

XU Shou-kun, NI Chu-han, JI Chen-chen, LI Ning   

  1. School of Information Science and Engineering, Changzhou University, Changzhou, Jiangsu 213164, China
  • Revised:2019-06-19 Online:2020-08-15 Published:2020-08-10
  • About author:U Shou-kun, born in 1972, Ph.D, professor, is a member of China Computer Federation.His main research interests include artificial intelligence and pervasive computing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61803050).

摘要: 近年来, 因工人未佩戴安全帽而造成的施工事故频繁发生, 为降低事故发生率, 对工人安全帽佩戴情况进行图像描述的研究。当前基于神经网络的图像描述方法缺乏可解释性且细节描述不充分, 施工场景图像描述的研究较为匮乏, 针对该问题, 提出采用YOLOv3(You Only Look Once)的检测算法, 以及基于语义规则和语句模板相结合的方法递进式地生成安全帽佩戴的描述语句。首先, 采集数据, 制作安全帽佩戴检测数据集和图像字幕数据集;其次, 使用K-means算法确定适用于该数据集的锚框参数值, 用以YOLOv3网络的训练与检测;再次, 预定义一个语义规则, 结合目标检测结果来提取视觉概念;最后, 将提取出的视觉概念填充进由图像字幕标注生成的语句模板, 以生成关于施工场景中工人安全帽佩戴的图像描述语句。使用Ubuntu16.04系统和Keras深度学习框架搭建实验环境, 在自制的安全帽佩戴数据集上进行不同算法的对比实验。实验结果表明, 所提方法不仅能够有效界定安全帽佩戴者和未佩戴者的数量, 而且在BLEU-1和CIDEr评价指标上的得分分别达到了0.722和0.957, 相比其他方法分别提高了6.9%和14.8%, 证明了该方法的有效性和优越性。

关键词: K-means聚类算法, YOLOv3网络, 安全帽佩戴, 图像描述方法, 语句模板, 语义规则

Abstract: In recent years, construction accidents caused by workers not wearing safety helmets occur frequently.In order to reduce the accident rate, this paper studies the image caption of workers wearing safety helmets.Existing image caption methods based on neural network are short of explanatory and detailed descriptions, and the research on the image caption of construction scene is relatively scarce.To solve this problem, this paper proposes YOLOv3 algorithm with the method based on the combination of semantic rules and sentence templates to gradually generate the description of wearing safety helmets.Firstly, images are collected, and the safety helmets wearing detection dataset and the image caption dataset are made.Secondly, the K-means algorithm is used to determine anchor boxes parameters, which are applicable to the dataset for the training and detection of YOLOv3 network.Thirdly, a semantic rule is predefined to combine the target detection results to extract visual concepts.Finally, the extracted visual concepts are filled into the sentence template generated by the image caption annotation, which is used to generate the image description statement about the workers wearing safety helmets or not in the construction scene.The Ubuntu16.04 system and the Keras deep learning framework are used to build the experimental environment, and different algorithmsare compared on the self-made datasets of helmet wearing.The experimental results show that the proposed method can not only effectively define the number of safety helmet wearers and non-wearers, but also achieve 0.722 and 0.957 respectively on BLEU-1 and CIDEr evaluation metrics, which are 6.9% and 14.8% higher than other methods, demonstrating the effectiveness and superiority of the proposed method.

Key words: Image caption method, K-means clustering algorithm, Safety helmet wearing, Semantic rules, Sentence template, YOLOv3 network

中图分类号: 

  • TP391
[1]FARHADI A, HEJRATI M, SADEGHI M A, et al.Every Picture Tells a Story:Generating Sentences from Images[C]∥European Conference on Computer Vision, 2010:15-29.
[2]KULKARNI G, PREMRAJ V, DHAR S, et al.Baby talk:Understanding and generating simple image descriptions[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society, 2011:1601-1608.
[3]ORDONEZ V, KULKARNI G, BERG T L.Im2Text:describing images using 1 million captioned photographs[C]∥International Conference on Neural Information Processing Systems.Curran Associates Inc.2011:1143-1151.
[4]JIA Y, SALZMANN M, DARRELL T.Learning cross-modality similarity for multinomial data[C]∥International Conference on Computer Vision.IEEE, 2011:2407-2414.
[5]VINYALS O, TOSHEV A, BENGIO S, et al.Show and tell:A neural image caption generator[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society, 2015:3156-3164.
[6]XU K, BA J, KIROS R, et al.Show, Attend and Tell:Neural Ima-ge Caption Generation with Visual Attention[C]∥Internatio-nal Conference on Machine Learning (ICML).2015:2048-2057.
[7]WU Q, SHEN C, LIU L, et al.What Value Do Explicit High Level Concepts Have in Vision to Language Problems?[C]∥Computer Vision and Pattern Recognition.IEEE, 2016:203-212.
[8]LU J, XIONG C, PARIKH D, et al.Knowing When to Look:Adaptive Attention via a Visual Sentinel for Image Captioning[C]∥IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE Computer Society, 2017:3242-3250.
[9]FENG G C, CHEN Y Y, CHEN N, et al.Research on automatic helmet recognition based on machine vision [J].Mechanical Design and Manufacturing Engineering, 2015, 44(10):39-42.
[10]DAHIYA K, SINGH D, MOHAN C K.Automatic detection of bike-riders without helmet using surveillance videos in real-time[C]∥International Joint Conference on Neural Networks.IEEE, 2016:3046-3051.
[11]GIRSHICK R, DONAHUE J, DARRELLAND T, et al.Richfeature hierarchies for object detection and semantic segmentation[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2014:580-587.
[12]GIRSHICK R.Fast R-CNN[C]∥International Conference on Computer Vision.IEEE, 2015:1440-1448.

[13]REN S, HE K, GIRSHICK R, et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
[14]REDMON J, FARHADI A.YOLOv3:An Incremental Improvement[C]∥IEEE Conference on Computer Vision and Pattern Recognition.2018.
[15]LIU W, ANGUELOV D, ERHAN D, et al.SSD:Single ShotMultiBox Detector[C]∥European Conference on Computer Vision.2016:21-37.
[16]REDMON J, FARHADI A.YOLO9000:Better, Faster, Stronger[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2017:6517-6525.
[17]HE K, ZHANG X, REN S, et al.Deep Residual Learning for Ima-ge Recognition[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society, 2016:770-778.
[18]LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:Common Objects in Context[C]∥European Conference on Computer Vision.2014:740-755.
[19]HODOSH M, YOUNG P, HOCKENMAIER J.Framing Image Description as a Ranking Task:Data, Models and Evaluation Metrics[C]∥International Conference on Artificial Intelligence.AAAI Press, 2015:4188-4192.
[20]PLUMMER B A, WANG L, CERVANTES C M, et al.Flickr30k Entities:Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models[J].International Journal of Computer Vision, 2017, 123(1):74-93.
[21]PAPINENI K, ROUKOS S, WARD T, et al.BLEU:a method for automatic evaluation of machine translation[C]∥Procee-dings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia, Pennsylvania:Association for Computational Linguistics, 2002:311-318.
[22]BANERJEE S, LAVIE A.METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]∥Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Ann Arbor:ACL, 2005:65-72.
[23]VEDANTAM R, ZITNICK C L, PARIKH D.CIDEr:Consensus-based Image Description Evaluation[C]∥IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 2015:4566-4575.
[1] 金雨芳, 吴祥, 董辉, 俞立, 张文安.
基于改进YOLO v4的安全帽佩戴检测算法
Improved YOLO v4 Algorithm for Safety Helmet Wearing Detection
计算机科学, 2021, 48(11): 268-275. https://doi.org/10.11896/jsjkx.200900098
[2] 吕明磊,刘冬梅,曾智勇.
一种改进的K-means聚类算法的图像检索方法
Novel Image Retrieval Method of Improved K-means Clustering Algorithm
计算机科学, 2013, 40(8): 285-288.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!