Started in January,1974(Monthly)
Supervised and Sponsored by Chongqing Southwest Information Co., Ltd.
ISSN 1002-137X
CN 50-1075/TP
CODEN JKIEBK
Editors
    Content of Computer Graphics& Multimedia in our journal
        Published in last 1 year |  In last 2 years |  In last 3 years |  All
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Overview of Person Re-identification for Complex Scenes
    ZHANG Min, YU Zeng, HAN Yun-xing, LI Tian-rui
    Computer Science    2022, 49 (10): 138-150.   DOI: 10.11896/jsjkx.211200207
    Abstract877)      PDF(pc) (3098KB)(1453)       Save
    Person re-identification(Re-ID) aims to study the matching of specific persons among multiple disjoint cameras.To the best of our knowledge,it’s the first work that uses the types of challenges that the Re-ID technology needs to overcome in complex scenes as the classification basis,and classifies the Re-ID articles published during 2010-2021 into seven categories:person posture issues,occlusion issues,lighting issues,viewpoint issues,background issues,resolution issues and other open issues.This classification method is convenient for researchers to start from actual needs and find corresponding solutions according to the problems.Firstly,it reviews the research background,significance and research status of Re-ID,summarizes the current mainstream Re-ID framework,counts the papers published in the three top conferences of computer vision,i.e.CVPR,ICCV and ECCV,and counts the Re-ID related projects in the national fund projects since 2013.Secondly,with regard to the seven types of challenges faced in complex scenarios,the existing literatures are classified and analyzed in detail from the two aspects:the cause of the problems and the solutions.The mainstream methods for dealing with various challenges are summarized and listed again.Afterwards,we summarize the Re-ID methods with high generalization and list the difficulties of the current Re-ID research.Finally,the future development trend of Re-ID is discussed.
    Reference | Related Articles | Metrics
    Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning
    FANG Zhong-jun, ZHANG Jing, LI Dong-dong
    Computer Science    2022, 49 (10): 151-158.   DOI: 10.11896/jsjkx.210900159
    Abstract515)      PDF(pc) (2444KB)(429)       Save
    Image captioning is one of the hot research topics in the field of computer vision.It is a cross-media data analysis task that combines computer vision and natural language processing.It describes the image by understanding the content of the image and generating captions that are both semantically and grammatically correct.Existing image captioning methods mostly use the encoder-decoder model.This kind of methods mostly ignore the relative position relationship between visual objects when extracting the visual object features in image,and the relative position relationship between objects is very important for generating accurate captioning.Based on this,this paper proposes a spatial encoding and multi-layer joint encoding enhanced transformer for image captioning.In order to make better use of the position information contained in the image,this paper proposes a spatial encoding mechanism for visual objects,which converts the independent spatial relationship of each visual object into the relative spatial relationship between visual objects to help the model to recognize the relative spatial relationship between each visual object.At the same time,in the encoder part of visual objects,the top encoding feature retains more semantic information that fits the image but loses part of the visual information of the image.Taking this into account,this paper proposes a multi-level joint encoding mechanism to improve the semantic information contained in the top encoding layer by integrating the image feature information contained in each shallow encoding layer,so as to obtain richer semantic features that fit the image.This paper evaluates the proposed image captioning method by multiple evaluation indicators(BLEU,METEOR,ROUGE-L,CIDEr,etc.) on the MSCOCO dataset.The ablation experiment proves that the spatial encoding mechanism and the multi-level joint encoding mechanism proposed in this paper can be helpful in generating more accurate and effective image captions.Comparative experimental results show that the proposed method in can produce accurate and effective image caption and is superior to most of the latest methods.
    Reference | Related Articles | Metrics
    Robust Hash Learning Method Based on Dual-teacher Self-supervised Distillation
    MIAO Zhuang, WANG Ya-peng, LI Yang, WANG Jia-bao, ZHANG Rui, ZHAO Xin-xin
    Computer Science    2022, 49 (10): 159-168.   DOI: 10.11896/jsjkx.210800050
    Abstract693)      PDF(pc) (4472KB)(508)       Save
    In order to improve the performance of unsupervised hash learning and achieve robust hashing image retrieval,this paper proposes a novel robust hash learning method based on dual-teacher self-supervised distillation.Specifically,the proposed method contains two stages:a self-supervised dual-teacher learning stage and a robust hash learning stage.In the first stage,a modified cluster algorithm is designed to effectively improve the accuracy of hard pseudo labels.Then,we fine-tune the teacher networks by hard pseudo labels to get the initial soft pseudo labels.In the second stage,we filter the initial soft pseudo labels by our soft pseudo label denoising method,which combines a hybrid denoising strategy and a dual-teacher denoising strategy.Then,we train the student network with the denoised soft pseudo labels by knowledge distillation,so that robust hash codes for label-free images are obtained.Extensive experiments on CIFAR-10,FLICKR25K and EuroSAT datasets show that the proposed robust hash learning method outperforms the state-of-the-art methods.In detail,the MAP of our method is 18.6% higher than that of the TBH method on CIFAR-10,2.4% higher than that of the DistillHash method on FLICKR25K,and 18.5% higher than that of the ETE-GAN method on EuroSAT.
    Reference | Related Articles | Metrics
    Mutual Learning Knowledge Distillation Based on Multi-stage Multi-generative Adversarial Network
    HUANG Zhong-hao, YANG Xing-yao, YU Jiong, GUO Liang, LI Xiang
    Computer Science    2022, 49 (10): 169-175.   DOI: 10.11896/jsjkx.210800250
    Abstract504)      PDF(pc) (2029KB)(544)       Save
    Aiming at the problems of insufficient knowledge distillation efficiency,single stage training methods,complex training processes and difficult convergence of traditional knowledge distillation methods in image classification tasks,this paper designs a mutual learning knowledge distillation based on multi-stage multi-generative adversarial networks(MS-MGANs).Firstly,the whole training process is divided into several stages,teacher models of different stages are obtained to guide student models to achieve better accuracy.Secondly,the layer-wise greedy strategy is introduced to replace the traditional end-to-end training mode,and the layer-wise training strategy based on convolution block is adopted to reduce the number of parameters to be optimized in each iteration process,and further improve the distillation efficiency of the model.Finally,a generative adversarial structure is introduced into the knowledge distillation framework,with the teacher model as the feature discriminator and the student model as the feature generator,so that the student model can better follow or even surpass the performance of the teacher model in the process of continuously imitating the teacher model.The proposed method is compared with other advanced knowledge distillation methods on several public image classification data sets,and the experimental results show that the new knowledge distillation method has better performance in image classification.
    Reference | Related Articles | Metrics
    Study on 3D Motion-in-Depth Perception Based on Binocular Vision
    LU Ping, ZHANG Di, XIAO Jun-feng, BI Ke
    Computer Science    2022, 49 (10): 176-182.   DOI: 10.11896/jsjkx.220500265
    Abstract417)      PDF(pc) (2493KB)(708)       Save
    Obtaining stereoscopic information is one of the basic abilities of human beings to perceive the world.Through stereo vision,we can judge the shape,size,distance,relative position of objects,as well as the direction and speed of changes in object motion information.Among them,the perception information of moving objects plays an important role in stereo vision perception.The acquisition of motion visual information is not only the key ability of biological vision systems to survive in a dynamic world,but also an important means for artificial vision systems to efficiently process stereoscopic video.Therefore,in order to design a 3D depth motion perception model that conforms to the visual characteristics of human eyes,it is necessary to explicitly excavate the salient features of human perception of stereoscopic motion,so as to design experiments to explore.In this paper,motion stereo videos are designed as visual stimuli based on monocular and binocular cues,and subjective experiments are designed using the control variable method.The experiment explores two parts:the influence of the relative distance between the target and the reference sphere on the subjects’ perception ability,and the relationship between the actual movement direction of the target and the subjects’ perception direction.Experimental data is analyzed by using two behavioral measures:the percentage of successfully intercepted targets and the perceived bias.The conclusion shows that,firstly,the smaller the relative distance between the target and the reference,the higher the interception success rate.The target velocity and the reference’s motion radius affect the relative distance of the target and reference spheres.This indicates that the relative positional relationship between the target and the reference plays an important role in the human eye’s perception of moving objects.Motion perception has a certain relativity,and motion is easier to perceive at the position with reference point and the position close to the reference point.Second,we find that perceptions elicited by deep motion are more pronounced than those induced by lateral motion.The correct intercept rate of perceived depth direction is 42.67%~47.01% higher than that of lateral motion.This shows that the visual stimulation brought by deep motion is more obvious,and the perception ability of objects moving in different directions is asymmetric.However,when there is an interception error in the depth direction,the perceptual deviation is larger,and the deviation is about 0.1583~0.3665.This study explores the salient features of human perception of motion and provides insights into the observer’s process of motion perception in 3D environments.This study explores the significant characteristics of human motion perception,and provides a new subjective contrast standard to judge the perception effect of 3D motion perception model for the subsequent design of 3D motion perception model work,which makes the original stereo perception ability index more refined.
    Reference | Related Articles | Metrics
    Neural Architecture Search for Light-weight Medical Image Segmentation Network
    ZHANG Fu-chang, ZHONG Guo-qiang, MAO Yu-xu
    Computer Science    2022, 49 (10): 183-190.   DOI: 10.11896/jsjkx.210800052
    Abstract229)      PDF(pc) (2977KB)(485)       Save
    Most of the existing medical image segmentation models with excellent performance are manually designed by domain experts.The design process usually requires a lot of professional knowledge and repeated experiments.In addition,the over complex segmentation model not only has high requirements for hardware resources,but also has low segmentation efficiency.An neural architecture search method named Auto-LW-MISN(Automatically Light-weight Medical Image Segmentation Network) is proposed for automatic construction of light-weight medical image segmentation network.In this paper,by constructing a light-weight search space,designing a search super network for medical image segmentation,and designing a differentiable search stra-tegy with complexity constraints,a neural architecture search framework for automatic search of light-weight medical image segmentation network is established.Experimental results on microscope cell images,liver CT images and prostate MR images show that Auto-LW-MISN can automatically construct light-weight segmentation models for different modes of medical images,and its segmentation accuracy is improved compared with U-net,Attention U-net,Unet++and NAS-Unet.
    Reference | Related Articles | Metrics
    Cross-scale Feature Fusion Self-attention for Image Captioning
    WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan
    Computer Science    2022, 49 (10): 191-197.   DOI: 10.11896/jsjkx.220600009
    Abstract313)      PDF(pc) (3284KB)(395)       Save
    In recent years,the encoder-decoder framework based on self-attention mechanism has become the mainstream model in image captioning.However,self-attention in the encoder only models the visual relations of low-scale features,ignoring some effective information in high-scale visual features,thus affecting the quality of the generated descriptions.To solve this problem,this paper proposes a cross-scale feature fusion self-attention(CFFSA) method for image captioning.Specifically,CFFSA integrates low-scale and high-scale visual features in self-attention to improve the range of attention from a visual perspective,which increases effective visual information and reduces noise,thereby learning more accurate visual and semantic relationships.Experiments on MS COCO dataset show that the proposed method can more accurately capture the relationship between cross-scale visual features and generate more accurate descriptions.In addition,CFFSA is a general method,which can further improve the performance of the model by combining with other self-attention based image captioning methods.
    Reference | Related Articles | Metrics
    Object Detection Algorithm Based on Improved Split-attention Network
    PAN Yi, WANG Li-ping
    Computer Science    2022, 49 (10): 198-206.   DOI: 10.11896/jsjkx.210800214
    Abstract268)      PDF(pc) (3750KB)(415)       Save
    Recently,most object detection algorithms based on convolutional neural network have the problems of lacking of reasonable use of meaningful contextual information and are easy to miss the detection of hard targets.In order to solve these problems,this paper proposes an object detection algorithm based on improved split-attention networks.Firstly,the split attention mechanism is introduced,and the multi-path structure is combined with feature-map attention mechanism to improve its feature representations.Then,in the convolution layer,poly-scale convolution is used to replace the vanilla convolution to enhance the scale-sensitivity of the neural network.Finally,the proposed algorithm is applied to Faster R-CNN.Experiments are carried out on Pascal VOC and MS COCO datasets.Compared with the original algorithm,the mAP of the proposed algorithm has improved 1.6% and 2.4% respectively without introducing additional parameters and computational complexities,and the mAP of the proposed algorithm is also higher than that of other algorithms,which verifies its good performance.
    Reference | Related Articles | Metrics
    Voxel Deformation Network Based on Environmental Information Mining
    LIU Na-li, TIAN Yan, SONG Ya-dong, JIANG Teng-fei, WANG Xun, YANG Bai-lin
    Computer Science    2022, 49 (10): 207-213.   DOI: 10.11896/jsjkx.210900066
    Abstract478)      PDF(pc) (3690KB)(341)       Save
    The technique of 3D deformation is one of the hot topics in the field of computer graphics.Current 3D deformation methods mainly learn the changes before and after deformation by aggregating localized adjacent voxel features,and fail to exploit the interrelationship between non-local voxel features,and the absence of contextual information prevents the model from capturing more discriminative features.To address the above problems,this paper designs a voxel deformation network based on environmental information mining,which can extract local and environmental information simultaneously,and extract environmental information from different spatial domains to improve the representation performance of the network,further modeling the relationship before and after the deformation of the object.Firstly,a novel self-attention mechanism is introduced.Specifically,the learning of the non-local dependence of different voxels is proposed to improve the ability of voxel discrimination.Then,a multi-scale analysis method is introduced to extract environmental information in different perceptual fields via multiple dilated convolution with different dilation rates,which provides more informative contextual features for the subsequent models.In addition,this paper analyzes the impact of feature fusion on the model and designs a method based on encoder-decoder feature fusion,which adaptively fuses the features extracted from the encoder and decoder to improve the nonlinear mapping capability of the model.Extensive experiments are conducted on our tooth dataset.The results show that the deformation prediction accuracy of the proposed method is improved compared to existing methods.
    Reference | Related Articles | Metrics
    Face Image Synthesis Driven by Geometric Feature and Attribute Label
    DAI Fu-yun, CHI Jing, REN Ming-guo, ZHANG Qi-dong
    Computer Science    2022, 49 (10): 214-223.   DOI: 10.11896/jsjkx.210900080
    Abstract367)      PDF(pc) (6412KB)(443)       Save
    Aiming at the problems in current face image synthesis,such as the lack of diversity of synthetic appearances and expressions,the low reality of the facial expressions and the low synthesis efficiency,this paper proposes a novel face synthesis network model driven by facial geometric feature and attribute label.Given a source face image,a target face image and the attribute(e.g.,hair color,gender,age) label,the new face synthesis model can generate a highly realistic face image which owns the expression of the source face,the identity of the target face and the specified attribute.The new model consists of two parts:facial landmark generator(FLMG) and geometry and attribute aware generator(GAAG).FLMG uses the facial geometric feature points to encode the expression information,and transfers the expression from the source to the target face in the form of feature points.Combining the transferred feature points,the specified attribute label and the target face image,GAAG generates a face image with specified appearance and expression.A novel soft margin triplet perception loss is introduced to GAAG,which can make the synthesized face more natural and keep the identity of the target face well,and makes the GAAG converge faster.Experimental results show that the face images generated by our approach have more diverse appearances and more realistic expressions.In addition,our model only needs to be trained once to realize the transfer between any arbitrary different expressions,so its efficiency is high.
    Reference | Related Articles | Metrics
      First page | Prev page | Next page | Last page Page 1 of 1, 10 records