弹幕信息协助下的视频多标签分类

doi:10.11896/jsjkx.200800198

计算机科学 ›› 2021, Vol. 48 ›› Issue (1): 167-174.doi: 10.11896/jsjkx.200800198

• 计算机图形学与多媒体 • 上一篇下一篇

弹幕信息协助下的视频多标签分类

陈洁婷, 王维莹, 金琴

中国人民大学信息学院北京 100872

收稿日期:2020-08-29 修回日期:2020-10-05 出版日期:2021-01-15 发布日期:2021-01-15
通讯作者: 金琴(qjin@ruc.edu.cn)
作者简介:jietingchen@ruc.edu.cn
基金资助:
国家自然科学基金(61772535);北京市自然科学基金(4192028);国家重点研发计划(2016YFB1001202)

Multi-label Video Classification Assisted by Danmaku

CHEN Jie-ting, WANG Wei-ying, JIN Qin

School of Information,Renmin University of China,Beijing 100872,China

Received:2020-08-29 Revised:2020-10-05 Online:2021-01-15 Published:2021-01-15
About author:CHEN Jie-ting,born in 1997,postgra-duate,is a member of China Computer Federation.Her main research interests include multimedia computing and so on.
JIN Qin,born in 1972,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include multimedia computing and human computer interaction.
Supported by:
National Natural Science Foundation of China(61772535),Beijing Municipal Natural Science Foundation(4192028) and National Key Research and Development Plan(2016YFB1001202).

摘要/Abstract

摘要： 文中探究了弹幕信息协助下的视频多标签分类任务。多标签视频分类任务根据视频内容从不同角度赋予视频多个标签,与视频推荐等应用紧密相关。多标签视频数据集的高标注成本和对视频内容的多角度理解是该研究领域面临的主要问题。弹幕是一种新近出现的用户评论形式,受到了众多用户的欢迎。由于用户参与度高,弹幕视频网站的视频拥有大量用户自发添加的标签,这些标签是天然的多标签数据。文中以此构建了一个多标签视频数据集,并整理出了视频标签间的层级语义关系,该数据集在未来将公开发布。同时,弹幕文本模态包含大量与视频内容相关的细粒度信息,因此在以往视频分类工作融合视觉和音频模态的基础上,引入弹幕文本模态进行视频多标签分类研究。在基于聚类的NeXtVLAD模型、注意力Dbof模型和基于时序的GRU模型上进行实验,在增加弹幕模态后,GAP指标最高提升了23%,证明了弹幕信息对该任务具有辅助作用。此外,还探索了如何在分类中利用标签层级关系,通过构建标签关系矩阵来改造标签,进而将标签语义融入训练。实验结果表明,加入标签关系后,Hit@1指标提升了15%,因此其能优化多标签分类的效果。此外,MAP指标在细粒度小类上提升了4%,说明标签语义的引入有利于预测样本量较少的类别,具有研究价值。

关键词: 标签关系, 弹幕, 多标签, 多模态, 分类, 视频

Abstract: This work explores the multi-label video classification task assisted by danmaku.Multi-label video classification can associate multiple tags to a video from different aspects,which can benefit video understanding tasks such as video recommendation.There are two challenges in this task,one is the high annotation cost of dataset,and the other is how to understand video from multi-aspect and multimodal perspectives.Danmaku is a new trend of online commenting.Danmaku video has lots of manual annotations added by website users for high user engagement.It can be used as classification data directly.This work collects a multi-label danmaku video dataset and builds a hierarchical label correlation structure for the first time on danmaku video data.The dataset will be released in the future.Danmaku contains informative and fine-grained interaction data with the video content.This paper introduces danmaku modality to assist classification based on previous works,most of which only combine the visual and audio modalities.This paper choses cluster-based model NeXtVLAD,attention Dbof and temporal based GRU models as baselines.Experiments show that danmaku data is helpful,which improves GAP by 0.23.This paper also explores the use of label correlation,updating the video labels by a relationship matrix to integrate the semantic information into training.Experiments show that the leverage of label correlation improves Hit@1 by 0.15.Besides,the MAP can be improved by 0.04 in fine-grained labels,which indicates that the label semantic information benefits the prediction of small classes and it is valuable to explore.

Key words: Classification, Danmaku, Label correlation, Multi-label, Multi-modal, Video

中图分类号:

TP399

陈洁婷, 王维莹, 金琴. 弹幕信息协助下的视频多标签分类[J]. 计算机科学, 2021, 48(1): 167-174. https://doi.org/10.11896/jsjkx.200800198

CHEN Jie-ting, WANG Wei-ying, JIN Qin. Multi-label Video Classification Assisted by Danmaku[J]. Computer Science, 2021, 48(1): 167-174. https://doi.org/10.11896/jsjkx.200800198

参考文献

[1] LIN R,XIAO J,FAN J.Nextvlad:An efficient neural network to aggregate frame-level features for large-scale video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV).Munich,Germany,2018.
[2] GARG S.Learning video features for multi-label classification[C]//Proceedings of the European Conference on Computer Vision (ECCV).Munich,Germany,2018.
[3] ABU-EL-HAIJA S,KOTHARI N,LEE J,et al.Youtube-8m:A large-scale video classification benchmark[J].arXiv:1609.086.75.
[4] CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the properties of neural machine translation:Encoder-decoder approaches[J].arXiv:1409.1259.
[5] LEE J,NATSEV A,READE W,et al.The 2nd YouTube-8M Large-Scale Video Understanding Challenge[C]//Proceedings of the European Conference on Computer Vision (ECCV).Munich,Germany,2018:193-205.
[6] YANG W,RUAN N,GAO W,et al.Crowdsourced time-sync video tagging using semantic association graph[C]//2017 IEEE International Conference on Multimedia and Expo (ICME).Hong Kong,China,2017:547-552.
[7] LIAO Z,XIAN Y,YANG X,et al.TSCSet:A crowdsourcedtime-sync comment dataset for exploration of user experience improvement[C]//23rd International Conference on Intelligent User Interfaces.Tokyo,Japan,2018:641-652.
[8] BAI Q,HU Q V,GE L,et al.Stories That Big Danmaku Data Can Tell as a New Media[J].IEEE Access,2019,7:53509-53519.
[9] MA S,CUI L,DAI D,et al.Livebot:Generating live video comments based on visual and textual contexts[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Hilton Hawaiian Village,Honolulu,Hawaii,USA,2019,33:6810-6817.
[10] OLSEN D R,MOON B.Video summarization based on user interaction[C]//Proceedings of the 9th European Conference on Interactive TV and Video.Lisbon,Portugal,2011:115-122.
[11] WANG X,JIANG Y G,CHAI Z,et al.Real-timesummarization of user-generated videos based on semantic recognition[C]//Proceedings of the 22nd ACM International Conference on Multimedia.Orlando,Florida,USA,2014:849-852.
[12] SÁNCHEZ J,PERRONNIN F,MENSINK T,et al.Image classification with the fisher vector:Theory and practice[J].International Journal ofCcomputer Vision,2013,105(3):222-245.
[13] JÉGOU H,DOUZE M,SCHMID C,et al.Aggregating local descriptors into a compact image representation[C]//2010 IEEE computer society conference on computer vision and pattern re-cognition.San Francisco,California,USA,2010:3304-3311.
[14] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[15] MIECH A,LAPTEV I,SIVIC J.Learnable pooling with context gating for video classification[J].arXiv:1706.06905.
[16] JÉGOU H,DOUZE M,SCHMID C,et al.Aggregating local descriptors into a compact image representation[C]//2010 IEEE computer society conference on computer vision and pattern recognition.San Francisco,California,USA,2010:3304-3311.
[17] PENG H,LI J,HE Y,et al.Large-scale hierarchical text classification with recursively regularized deep graph-cnn[C]//Proceedings of the 2018 World Wide Web Conference.Lyon,France,2018:1063-1072.
[18] WANG L,CHEN S,ZHOU H.Boosting Up Segment-level Video Classification Performance with Label Correlation and Reweighting[EB/OL].https://static.googleusercontent.com/media/research.google.com/zh-CN//youtube8m/workshop2019/c_07.pdf.
[19] BANERJEE S,AKKAYA C,PEREZ-SORROSAL F,et al.Hierarchical Transfer Learning for Multi-label Text Classification[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Fortezza da Basso,Florence,Italy,2019:6295-6300.
[20] CHEN B,HUANG X,XIAO L,et al.Hyperbolic Capsule Networks for Multi-Label Classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Seattle,Washington,USA,2020:3115-3124.
[21] POUYANFAR S,WANG T,CHEN S C.Residual Attention-Based Fusion for Video Classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.Long Beach,California,USA,2019.
[22] WANG Z,KUAN K,RAVAUT M,et al.Truly multi-modal youtube-8m video classification with video,audio,and text[J].arXiv:1706.05461.
[23] HE X,PENG Y.Fine-grained image classification via combining vision and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,Hawaii,USA,2017:5994-6002.
[24] 中国人工智能学会,知乎.2017知乎看山杯机器学习挑战赛[EB/OL].https://www.biendata.xyz/competition/zhihu/.
[25] PARTALAS I,KOSMOPOULOS A,BASKIOTIS N,et al.Lshtc:A benchmark for large-scale text classification[J].arXiv:1503.08581.
[26] HE K,ZHANG X,REN S,et al.Deep residual learning for ima-ge recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas,NV,USA,2016:770-778.

相关文章 15

[1]	陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2]	聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙. 基于自然语言的视频片段定位综述 Overview of Natural Language Video Localization 计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[3]	周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[4]	曲倩文, 车啸平, 曲晨鑫, 李瑾如. 基于信息感知的虚拟现实用户临场感研究 Study on Information Perception Based User Presence in Virtual Reality 计算机科学, 2022, 49(9): 146-154. https://doi.org/10.11896/jsjkx.220500200
[5]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[6]	武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[7]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[8]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[10]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11]	高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[12]	杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[13]	张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥. 视频理解中的动作质量评估方法综述 Survey on Action Quality Assessment Methods in Video Understanding 计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[14]	黄璞, 沈阳阳, 杜旭然, 杨章静. 基于局部约束特征线表示的人脸识别 Face Recognition Based on Locality Constrained Feature Line Representation 计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[15]	刘云, 董守杰. 基于CUDA核函数的多路视频图像拼接加速算法 Acceleration Algorithm of Multi-channel Video Image Stitching Based on CUDA Kernel Function 计算机科学, 2022, 49(6A): 441-446. https://doi.org/10.11896/jsjkx.210600043

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

弹幕信息协助下的视频多标签分类

Multi-label Video Classification Assisted by Danmaku

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0