Computer Science

Select

Image Segmentation Based on Deep Learning:A Survey

HUANG Wenke, TENG Fei, WANG Zidan, FENG Li

Computer Science 2024, 51 (2): 107-116. DOI: 10.11896/jsjkx.230900002

Abstract （105）

PDF（pc）（1716KB）（1869）

Save

Image segmentation is a fundamental task in computer vision and its main purpose is to extract meaningful and cohe-rent regions from the image input.Over the years,a wide variety oftechniques have been developed in the field of image segmentation,including those based on traditional methods,as well as more recent image segmentation techniques utilizing convolutional neural networks.With the development of deep learning,more deep learning algorithms have been applied to image segmentation tasks.In particular,there has been a surge of scholarly interest in deep learning over the past two years,and many deep learning algorithms have emerged for image segmentation tasks.However,most of the new algorithms have not been summarized or analyzed,which will hinder the progress of subsequent research.This paper provides a comprehensive review of literatures on deep learning-based image segmentation research published in the past two years.First,it briefly introduces common datasets for image segmentation.Next,it clarifies new classifications for image segmentation based on deep learning.Finally,the existing challenges are discussed and the future research directions are prospected.

Reference | Related Articles | Metrics

Select

Unsupervised Learning of Monocular Depth Estimation:A Survey

CAI Jiacheng, DONG Fangmin, SUN Shuifa, TANG Yongheng

Computer Science 2024, 51 (2): 117-134. DOI: 10.11896/jsjkx.230400197

Abstract （114）

PDF（pc）（3783KB）（1721）

Save

As the key point of 3D reconstruction,automatic driving and visual SLAM,depth estimation has always been a hot research direction in the field of computer vision,among which,monocular depth estimation technology based on unsupervised learning has been widely concerned by academia and industry because of its advantages of convenient deployment,low computational cost and so on.Firstly,this paper reviews the basic knowledge and research actuality of depth estimation and briefly introduces the advantages and disadvantages of depth estimation based on parametric learning,non-parametric learning,supervised learning,semi-supervised learning and unsupervised learning.Secondly,the research progress of monocular depth estimation based on unsupervised learning is summarized comprehensively.The monocular depth estimation based on unsupervised learning is summarized according to five categories:combination of interpretable mask,combination of visual odometer,combination of prior auxi-liary information,combination of generated adversarial network and real-time lightweight network,and the typical framework model is introduced and compared.Then,the application of monocular depth estimation based on unsupervised learning in medicine,autonomous driving,agriculture,military and other fields is introduced.Finally,the common data sets used for unsupervised depth estimation are briefly introduced,and the future research direction of monocular depth estimation based on unsupervised learning is proposed,while the prospects of various research directions in this rapidly growing field are also prospected.

Reference | Related Articles | Metrics

Select

Medical Image Segmentation Algorithm Based on Self-attention and Multi-scale Input-Output

DING Tianshu, CHEN Yuanyuan

Computer Science 2024, 51 (2): 135-141. DOI: 10.11896/jsjkx.221100260

Abstract （128）

PDF（pc）（2429KB）（1813）

Save

Refined fundus image segmentation results of diabetic retinopathy can better assist doctors in diagnosis.The appea-rance of large scale and high resolution segmentation data sets provides favorable conditions for more refined segmentation.The mainstream segmentation network based on U-Net,using convolution operation based on local operation,cannot fully excavate global information when making pixel prediction.The network model adopts single-input single-output structure,which makes it difficult to obtain multi-scale feature information.In order to maximize the use of existing large-scale high-resolution fundus image focal segmentation data sets and achieve more refined segmentation,better segmentation methods need to be designed.In this paper,U-Net is transformed based on the self-attention mechanism and multi-scale input/output structure,and a new segmentation network,SAM-Net,is proposed.The self-attention module is used to replace the traditional convolutional module,and the ability of the network to obtain global information is increased.The multi-scale input and multi-scale output structures are introduced to make it easier for the network to obtain multi-scale feature information.The image slicing method is used to reduce the input size of the model,so as to prevent the training difficulty of the neural network model from increasing due to the large pixel of the input picture.Finally,experimental results on IDRiD and FGADR data sets show that SAM-Net can achieve better performance than other methods.

Reference | Related Articles | Metrics

Select

Multi-guided Point Cloud Registration Network Combined with Attention Mechanism

LIU Xuheng, BAI Zhengyao, XU Zhu, DU Jiajin, XIAO Xiao

Computer Science 2024, 51 (2): 142-150. DOI: 10.11896/jsjkx.230200073

Abstract （60）

PDF（pc）（3185KB）（1684）

Save

This paper proposes a point cloud alignment network,AMGNet,which uses the probability matrix of matching points between point clouds and the spatial information feature matrix of point clouds to search for correspondence and determine the weights of corresponding points with each other.First,the point cloud feature extraction network is used to get the high-dimensional features of the two unaligned point clouds and then the Transformer is used to fuse the independent features with the contextual information.Also,the weight assignment uses the strategy of double matrix co-determination.Finally,the singular value decomposition is used to obtain the required rigid transformation matrix.Several experiments are conducted on synthetic datasets,such as ModelNet40,7Scenes and real scenes.The results show that the mean square error of rotation matrix and translation vector in ModelNet40 target unknown experiments is reduced to 0.025 and 0.004 6,respectively.AMGNet alignment has high accuracy,high interference resistance,and good generalization ability.

Reference | Related Articles | Metrics

Select

Infrared Small Target Detection Based on Dilated Convolutional Conditional GenerativeAdversarial Networks

ZHANG Guodong, CHEN Zhihua, SHENG Bin

Computer Science 2024, 51 (2): 151-160. DOI: 10.11896/jsjkx.221200045

Abstract （84）

PDF（pc）（4901KB）（1708）

Save

Deep-learning based object detection methods have achieved great performance in general object detection tasks by virtue of their powerful modeling capabilities.However,the design of deeper network and the abuse of pooling operations also lead to semantic information loss which suppress their performance when detecting infrared small targets with low signal-noise-ratio and small pixel essential features.This paper proposes a novel infrared small target detection algorithm based on dilated convolution conditional generative adversarial network.A dilated convolution stacked generative network makes full use of context information to establish layer-to-layer correlations and facilitate semantic information retainment of infrared small targets in the deep network.In addition,the generative network integrates the channel-space-mixed attention module which selectively amplifies target information and suppresses background clusters.Furthermore,a self-attention association module is proposed to deal with semantic conflict generated during the fusion process between layers.A variety of evaluation metrics are used to compare the proposed method with other state-of-the-arts at present to demonstrate the superiority of the proposed method in complex backgrounds.On the public SIRST dataset,the F score of the proposed model is 64.70% which is 8.29% higher than the traditional method and 7.29% higher than the deep learning method.On the public ISOS dataset,the F score is 64.54%,which is 23.59% higher than the traditional method and 6.58% higher than the deep learning method.

Reference | Related Articles | Metrics

Select

Hierarchical Conformer Based Speech Synthesis

WU Kewei, HAN Chao, SUN Yongxuan, PENG Menghao, XIE Zhao

Computer Science 2024, 51 (2): 161-171. DOI: 10.11896/jsjkx.221100125

Abstract （47）

PDF（pc）（5383KB）（1717）

Save

Speech synthesis requires synthesizing the input speech text into a speech signal containing phonemes,words and utte-rances.Existing speech synthesis methods consider utterance as a whole,and it is difficult to synthesize different lengths of speech signals accurately.In this paper,we analyze the hierarchical relationships embedded in speech signals,design a Conformer-based hierarchical text encoder and a Conformer-based hierarchical speech encoder,and propose a speech synthesis model based on the hierarchical text-speech Conformer.First,the model constructs hierarchical text encoders according to the length of the input text signal,including three levels of phoneme level,word level,and utterance level text encoders.Each level of text encoder,describes text information of different lengths and uses Conformer’s attention mechanism to learn the relationship between different temporal features in the signal of that length.Using the hierarchical text encoder,we can find out the information that needs to be emphasized at different lengths in the utterance,and effectively achieve the extraction of text features at different lengths to alleviate the problem of uncertainty in the duration of the synthesized speech signal.Second,the hierarchical speech encoder includes three levels:phoneme level,word level,and utterance level speech encoder.For each level of speech encoder,the text features is used as the query vector of the Conformer,and the speech features are used as the keyword vector and value vector of the Conformer to extract the matching relationship between text features and speech features.The problem of inaccurate synthesis of diffe-rent length speech signals can be alleviated by using hierarchical speech encoder and text-to-speech matching relations.The hie-rarchical text-to-speech encoder modeled in this paper can be flexibly embedded into a variety of existing decoders to provide more reliable speech synthesis results through the complementarity between text and speech.Experimental validation is performed on two datasets,LJSpeech and LibriTTS,and experimental results show that the Mel inversion distortion of the proposed method is smaller than that of existing speech synthesis methods.

Reference | Related Articles | Metrics

Select

Two-stage Visible Watermark Removal Model Based on Global and Local Features for Document Images

ZHAO Jiangfeng, HE Hongjie, CHEN Fan, YANG Shubin

Computer Science 2024, 51 (2): 172-181. DOI: 10.11896/jsjkx.230600144

Abstract （86）

PDF（pc）（6094KB）（1726）

Save

Visible watermark is a common digital image copyright protection measure.Analysis of the removal results of watermarks can verify the effectiveness of the watermarks on images and provide reference and inspiration for watermark designers to design or add them.Currently,most watermark removal methods are based on research on natural images,while document images are also widely used in daily life.However,due to the lack of publicly available datasets for removing watermarks from document images,research on removing watermarks from such images is relatively limited.To explore the effectiveness of watermark removal methods on document images,a dataset for removing watermarks from single document images,the single document image watermark removal dataset(SDIWRD),is constructed.In the research on the removal of watermarks in document images,it is found that the removal results of existing watermark removal methods often leave watermark artifacts,such as main body artifacts or outline artifacts.To address this problem,a two-stage watermark removal model based on global and local features is proposed,which uses a two-stage half-instance normalized encoder-decoder architecture from coarse to fine.In the coarse stage,a global and local feature extraction module is designed to enhance the capture of global spatial features while preserving the extraction of local detail information,thus helping with watermark removal.In the fine stage,the fine network shares the weights of the coarse stage and constructs a recurrent feature fusion module to fully explore the important features of the coarse stage encoder and provide rich context information for the fine stage,helping with detailed watermark removal.In addition,a structure similarity loss is used to improve the visual quality of the removed watermark.The proposed method is tested on the SDIWRD dataset,and the results show that the peak signal-to-noise ratio(PSNR) is 41.21 dB,the structural similarity(SSIM) is 99.07%,and the root mean square error(RMSE) is 3.64,which are better than existing methods.In addition,the proposed method is also tested on the publicly available CLWD color watermark removal dataset,and the results showethat the PSNR is 39.31 dB,the SSIM is 98.81%,and the RMSE is 3.50,which are also better than existing watermark removal methods.These experimental results demonstrate that the proposed method has good generalization and can effectively alleviate the problem of watermark artifacts.Finally,some suggestions for preventing watermark removal are also proposed.The proposed method and dataset can be publicly accessed at the corresponding website.

Reference | Related Articles | Metrics

Select

Cross-scene Gesture Recognition Based on Point Cloud Trajectories and Compressed Doppler

ZHANG Hongwang, ZHOU Rui, CHENG Yu, LIU Chenxu

Computer Science 2024, 51 (2): 182-188. DOI: 10.11896/jsjkx.230400184

Abstract （57）

PDF（pc）（2510KB）（1668）

Save

Millimeter wave radar can be used for various sensing tasks,such as activity recognition,gesture recognition,heart rate perception.Among them,gesture recognition is a research hotspot,which can realize contactless human-computer interaction.Most existing studies on gesture recognition make use of point cloud or range-Doppler for pattern recognition through neural networks to achieve sensing.However,there are some problems.Firstly,the robustness of these methods is poor.The changes of the user and his/her location affect the received millimeter wave signals,causing the accuracy of the sensing model to reduce.Secondly,these methods input the complete range-Doppler map into the neural network,which makes the model complicated and makes it difficult for the model to focus on the sensing task,because there are many unrelated regions to the sensing task.To solve these problems,this paper first builds the gesture trajectory from multiple continuous frames of point cloud,and then cuts and compresses the multiple continuous range-Doppler maps to obtain the two-dimensional local Doppler map.Finally,the features are extracted from the point cloud trajectory and the two-dimensional local Doppler map respectively by the neural networks,concatenated and classified by a fully-connected neural network.Experiments show that the proposed method focuses on gestures and can achieve a recognition accuracy of 98%,and can achieve a recognition accuracy of 93% for new users and 92% for new locations in the cases of user changes and location changes,better than the state of the art.

Reference | Related Articles | Metrics

Select

LNG-Transformer:An Image Classification Network Based on Multi-scale Information Interaction

WANG Wenjie, YANG Yan, JING Lili, WANG Jie, LIU Yan

Computer Science 2024, 51 (2): 189-195. DOI: 10.11896/jsjkx.221100218

Abstract （70）

PDF（pc）（2444KB）（1702）

Save

Due to the superior representation capability of the Transformer’s Self-Attention mechanism,several researchers have developed Self-Attention mechanism-based image processing model and achieved great success.However,the traditional network for image classification based on Self-Attention cannot take into account global information and computational complexity,which limits the wide application of Self-Attention.This paper proposes an efficient and scalable attention module,Local Neighbor Glo-bal Self-Attention(LNG-SA),that may interact with local,neighbor,and global information at any stage.By cascading LNG-SA module,a brand-new network called LNG-Transformer is created.LNG-Transformer adopts a hierarchical structure that provides excellent flexibility,and has a computational complexity proportional to image resolution.The features of LNG-SA enable LNG-Transformer to interact with local information,neighbor information,and global information even in the early stage of high-resolution,resulting in increased efficiency and enhanced learning capacity.Experimental results show that LNG-Transformer performs well at image classification.

Reference | Related Articles | Metrics

Select

Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer

ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru

Computer Science 2024, 51 (2): 196-204. DOI: 10.11896/jsjkx.221100234

Abstract （52）

PDF（pc）（3194KB）（1710）

Save

Deep learning-based image classification models have been successfully applied in various scenarios.The current image classification models can be categorized into two classes:the CNN-based classifiers and the Transformer-based classifiers.Due to its limited receptive field,the CNN-based classifiers cannot model the global relation of image,which decreases the classification accuracy.While the Transformer-based classifiers usually segmente the image into non-overlapping image patches with equal size,which harms the local information between each pair of adjacent image patches.Additionally,the Transformer-based classification models often require pre-training on large datasets,resulting in high computational costs.To tackle these problems,an efficient pyramid vision Transformer(EPVT) based on depth-wise convolution is proposed in this paper to extract both the local and glo-bal information between adjacent image patches at a low computational cost.The EPVT model consists of three key components:local perception module(LP),spatial information fusion module(SIF) and convolutional feed-forward network module(CFFN).The LP module is used to capture the local correlation of image patches.SIF module is used to fuse local information between adjacent image patches and improve the feature expression ability of the proposed EPVT by utilizing the long-distance dependence between different image patches.CFFN module is used to encode the location information and reconstruct tensors between feature image patches.To validate the proposed EPVT model’s performance,various experiments are conducted on the benchmark datasets,and experimental results show the EPVT achieves 82.6% classification accuracy on ImageNet-1K,which outperforms most of the SOTA models with lower computational complexity.

Reference | Related Articles | Metrics

Select

Recursive Gated Convolution Based Super-resolution Network for Remote Sensing Images

LIU Changxin, WU Ning, HU Lirui, GAO Ba, GAO Xueshan

Computer Science 2024, 51 (2): 205-216. DOI: 10.11896/jsjkx.230800017

Abstract （68）

PDF（pc）（4547KB）（1718）

Save

Due to hardware manufacturing constraints,it is usually difficult to obtain high-resolution(HR) images in the area of remote sensing.From low resolution remote-sensing image to reconstruct high-resolution(HR) image via single-image super-re-solution(SISR) technique is a common method.Recently,the convolutional neural network(CNN) was introduced to the field of super-resolution image reconstruction,and it effectively improved the image reconstruction performance.However,the classic CNN-based approaches typically use low-order attention to extract deep features,which limites its reconstructing ability.More-over,the receptive field is limited,which lacks the ability to learn long-range dependency.To solve the above problems,a recursive gated convolution-based super-resolution method for remote sensing images(RGCSR) is proposed.The RGCSR introduces recursive gated convolution(gⁿConv) to learn global dependencies and local details,and high-order features are acquired by high-order spatial interactions.Firstly,a high-order interaction—feedforward neural network(HFB) consisting of a high-order interaction sub-module(HorBlock) and a feedforward neural network(FFN) is applied to extract high-order features.Then,a feature optimization module(FOB) contains channel attention(CA) and gⁿConv is used to optimize the output features of each intermediate module.Finally,the comparison results on multiple datasets show that RGCSR has better reconstruction and visualization performances than existing CNN-based solutions.

Reference | Related Articles | Metrics

Select

Survey of Image Data Augmentation Techniques Based on Deep Learning

SUN Shukui, FAN Jing, SUN Zhongqing, QU Jinshuai, DAI Tingting

Computer Science 2024, 51 (1): 150-167. DOI: 10.11896/jsjkx.230500103

Abstract （174）

PDF（pc）（3382KB）（1615）

Save

In recent years,deep learning has demonstrated excellent performance in many computer vision tasks such as image classification,object detection,and image segmentation.Deep neural networks usually rely on a large amount of training data to avoid overfitting,so excellent performance is inseparable from the support of massive image data.However,in many real-world applications,it is often difficult to obtain sufficient image data,and data collection is also expensive and time-consuming.The emergence of image data augmentation has effectively alleviated the problem of insufficient data,and as an effective way to increasethe quantity,quality,and diversity of training data,data augmentation has become a necessary component for the successful application of deep learning models on image data.Understanding existing algorithms can help choose appropriate methods and develop new algorithms.This paper elaborates on the research motivation of image data augmentation,systematically classifies numerous data augmentation algorithms,analyzes each type of data augmentation algorithm in detail,and then points out some considerations in the design of data augmentation algorithms and their application scope.The effectiveness of data augmentation is demonstrated through three computer vision tasks,and finally,this paper summarizes and proposes some prospects for future research directions of data augmentation.

Reference | Related Articles | Metrics

Select

Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement

TANG Jia, GUO Yan, YE Mingwei, WU Guixing

Computer Science 2024, 51 (1): 168-174. DOI: 10.11896/jsjkx.230700084

Abstract （108）

PDF（pc）（2765KB）（1574）

Save

The visual language pretraining(VLP) model has shown impressive performance on multimodal tasks through con-trastive learning and other methods.However,existing research has overlooked the benefits of multi-view descriptions,andthe importance of semantics and grammar.To address this issue,this paper proposes multi-view learning and semantic enhancement for multimodal pre-training(MulSE),which consists of the following three main components:1)introducing multi-view contrastive learning with a generator in the fused encoder model;2)proposing multimodal text reordering as a novel self-supervised visual language pretraining task;3)increasing and exploring the optimal MLM masking ratio,maximizing the ability to use visual information.By improving the pretraining task and employing multiple optimal strategies,our experiments demonstrate that MulSE enhances intra-modal and inter-modal understanding,improves the comprehension of syntax and semantics within text.With only 4M pre-training data volume,it achieves the results of previous large datasets in the graphic retrieval task,and the valuation result on visual question-answering and visual implicative tasks outperforms the previous comprehension VLP models.

Reference | Related Articles | Metrics

Select

Method of Infrared Small Target Detection Based on Multi-depth Feature Connection

WANG Weijia, XIONG Wenzhuo, ZHU Shengjie, SONG Ce, SUN He, SONG Yulong

Computer Science 2024, 51 (1): 175-183. DOI: 10.11896/jsjkx.230200037

Abstract （176）

PDF（pc）（4105KB）（1494）

Save

Small infrared targets have the characteristics of a small number of pixels and a complex background,which leads to the problems of low detection accuracy and high time-consumption.This paper proposes a multi-depth feature connection network.Firstly,the model proposes a multi-depth cross-connect backbone to increase feature transfer between different layers and enhance feature extraction capabilities.Secondly,an attention-guided pyramid structure is designed to enhance the deep features and separate the background from the target.Thirdly,an asymmetric fusion decoding structure is proposed to enhance the preservation of texture information and position information in decoding.Finally,the model introduces point regression loss to get the center coordinates.The proposed network model is trained and tested on the SIRST dataset and the self-built infrared small target dataset.Experimental results show that compared with existing data-driven and model-driven algorithms,the proposed model has higher detection accuracy and faster speed in complex scenes.Compared with the suboptimal model,the average precision of the model is improved by 5.41%,and the detection speed reaches 100.8 FPS.

Reference | Related Articles | Metrics

Select

Weighted-loss-based Up-sampling for Point Cloud Occupancy Map Video

CHEN Hang, LI Li, LIU Dong, LI Houqiang

Computer Science 2024, 51 (1): 184-189. DOI: 10.11896/jsjkx.230600161

Abstract （102）

PDF（pc）（2326KB）（1458）

Save

In video-based point cloud compression(V-PCC),a 3D point cloud is divided into hundreds of patches and then mapped onto a 2D grid,generating a texture video that captures texture information and a geometry video that captures geometry information.Meanwhile,an occupancy map video is also generated to record whether each pixel in the former two videos corresponds to a point in the reconstructed point cloud.Therefore,the quality of the occupancy map video is directly linked to the quality of the reconstructed point cloud.To save bit cost,the occupancy map video is down-sampled at the encoder and up-sampled with a simplistic method at the decoder.This paper aims to use a deep learning-based up-sampling method to replace the simple up-sampling method in the original V-PCC to improve the quality of the up-sampled occupancy map videos as well as that of the reconstructed point cloud.A weighted distortion loss function in the network training process is introduced to remove the normal points as few as possible while removing the noisy points as many as possible when reconstructing a point cloud.Experimental results show that the proposed method can significantly improve the subjective and objective performances of the V-PCC.

Reference | Related Articles | Metrics

Select

Raindrop In-Situ Captured Benchmark Image Dataset and Evaluation

CHEN Tianyi, XUE Wen, QUAN Yuhui, XU Yong

Computer Science 2024, 51 (1): 190-197. DOI: 10.11896/jsjkx.230500125

Abstract （144）

PDF（pc）（4328KB）（1494）

Save

When taking photos through glass windows in rainy days,the raindrops adhered to glass surfaces are usually presented in the images,which not only degrade the visibility of the image but also prevent many computer vision algorithms from functioning properly.The research on raindrop removal is a scientific research to remove raindrops from such rainy images.The singleimage raindrop removal research presents significant challenges due to the diverse and unique forms of raindrops found in nature.The varying transparency of raindrops further complicates the task of removing raindrop artifacts and degrades the imaging quality of background scenes,adversely impacting the performance of existing raindrop removal algorithms.To facilitate a comprehensive understanding of this research area,this paper provides a detailed introduction to single-image raindrop removal,covering two main aspects:single-image raindrop removal algorithms and joint raindrop removal algorithms for single images.Additionally,a summary and evaluation of existing algorithms in this field are presented.However,the performance of the algorithm is often li-mited by the quality and quantity of the dataset in deep learning based methods,but in existing raindrop datasets,common situations such as low-quality raindrop images and insufficient image quantities exist.In existing raindrop datasets,there are common situations such as poor quality of raindrop images and insufficient number of raindrop images.This paper proposes a higher education megacenter(HEMC) dataset.Camera shake,window reflections and other external disturbances are avoided as much as possible thus improving the image quality of the training set and accuracy of the test set and indirectly improving the performance of the raindrop removal methods.HEMC is evaluated in various aspects using competent visual effects and objective metrics.Experimental results show the diversity of the raindrop images in HEMC and stability of the objective metrics.In addition,the results verify the universality and stability of the HEMC in the raindrop removal methods.

Reference | Related Articles | Metrics

Select

Seal Removal Based on Generative Adversarial Gated Convolutional Network

WU Guibin, YANG Zongyuan, XIONG Yongping, ZHANG Xing, WANG Wei

Computer Science 2024, 51 (1): 198-206. DOI: 10.11896/jsjkx.230500232

Abstract （198）

PDF（pc）（4303KB）（1544）

Save

Seals on invoices and documents seriously affect the accuracy of text recognition,so seal elimination techniques play an important role in the pre-processing of document analysis,and document enhancement.However,threshold-based methods and deep learning-based methods suffer from incomplete seal elimination and modification of background pixels.Thus,this paper proposes a two-stage seal elimination network,SealErase.The first stage is a U-shaped segmentation network for generating bina-rized masks with seal position,and the second stage is an inpainting network for refined seal elimination.Due to the lack of available public paired datasets for seal elimination,existing methods cannot design pixel-level evaluation metrics to measure the quality of the generated images.Moreover,training the neural network using paired training sets can effectively improve the performance of the network.To this end,this paper constructs a high-simulated seal elimination dataset containing 8 000 samples,taking into account the generalisation to real scenes and the robustness to noise.The seals are divided into two types:seals in real document images and synthetic seals.In order to objectively evaluate the performance of SealErase,it devises a comprehensive evaluation metric based on the image generation quality and the recognition accuracy of characters obscured by seals to evaluate the elimination performance of the SealErase network.The existing seal elimination methods are compared on the seal elimination dataset,and the experimental results show that the SealErase network improve the peak signal to noise ratio by 26.79% and the mean structural similarity by 4.48% in the evaluation metric of image generation quality compared to the state-of-the-art methods.After seal elimination by SealErase network,the accuracy of recognition of characters obscured by seals is improved by 38.86%.Experimental results show that SealErase is equally effective in eliminating seals and preserving the obscured characters in real scenes.

Reference | Related Articles | Metrics

Select

Error-bounded Compatible High-order Remeshing

ZHANG Wenxiang, GUO Jiapeng, FU Xiaoming

Computer Science 2024, 51 (1): 207-214. DOI: 10.11896/jsjkx.230700116

Abstract （126）

PDF（pc）（4120KB）（1470）

Save

This paper proposes a method to construct high-quality and compatible high-order surface meshes with bounded approximation errors.Given two closed,oriented,and topologically equivalent surfaces and a sparse set of corresponding landmarks,the proposed method contains two steps:(1)generate compatible high-order meshes with bounded approximation errors and(2)reduce mesh complexity while ensuring that approximation errors are always bounded,and reduce the distortion between the compatible meshes and approximation errors with the original meshes by optimizing the control vertices.The first step is to generate compatible linear meshes with bounded approximation errors,and then upgrade them to high-order meshes.In the second step,the mesh complexity is effectively reduced by iteratively performing an edge-based remeshing and increasing the compatible target edge lengths.The Jacobian matrix of the mapping between 3D Bézier triangles is derived from tangent space,so that the distortion energy can be effectively optimized.By optimizing the distortion energy and approximation errors energy,the distortion between compatible meshes and approximation errors are effectively reduced.Tests on various pairs of complex models demonstrate the efficacy and practicability of our method for constructing high-quality compatible high-order meshes with bounded approximation errors.

Reference | Related Articles | Metrics

Select

B-spline Functional Model of Terrestrial Sunshape Based on Measured Data

SHEN Tong, ZHAO Le, FENG Jieqing

Computer Science 2024, 51 (1): 215-224. DOI: 10.11896/jsjkx.230700209

Abstract （125）

PDF（pc）（5255KB）（1462）

Save

The function describing the distribution of solar radiative energy received on the ground is called the surface sunshape model.It is important for accurate simulation of the distribution of radiative flux density on the receiver in solar power tower.The percentage of halo radiative energy to the total solar radiative energy is called the CircumSolar Ratio(CSR),which is a key para-meter in the surface sunshape model.At present,the commonly used surface sunshape models have drawbacks of low accuracy,CSR misalignment,discontinuity,and not being integrated analytically.To address these problems,a new sunshape model in terms of tensor product B-spline function is proposed based on observation dataset.Firstly,the two observation datasets are processed via data cleaning,de-noise,normalization,average,and data concatenation.As a result,84 sets of data with different CSR values are obtained.Each set of data corresponds a solar radiative solar energy scanning profile,and varies with incident angle θ.Then,the data set of CSR=0.005 with the most drastic change is chosen as the sample case for constrained B-spline function fitting,whose knot vector and number of control coefficients are determined through differential evolution algorithm and experiments,respectively.Then,the other 83 sets of data corresponding to 83 CSR values are fitted using the above knot vector and the number of control coefficients.Finally,the 84 univariate B-spline functions are adopted as inputs,and CSR value is used as variable to perform B-spline fitting on their control coefficients.The knot vector and the number of control vertices are still determined using the above methods.As a result,a surface sunshape model is obtained,which is in terms of tensor product B-spline function with 12×15 control coefficients,and variables CSR and θ.Compared with existing models,the proposed B-spline function model is C² continuous,which has the advantages of CSR alignment,high fitting accuracy,and analytical integration of radiative energy distribution.

Reference | Related Articles | Metrics

Select

Local Progressive and Iterative Approximation for Least Squares B-spline Curve and Surface Fitting

GAO Yang, JIANG Yini, LIN Hongwei

Computer Science 2024, 51 (1): 225-232. DOI: 10.11896/jsjkx.230700152

Abstract （132）

PDF（pc）（2555KB）（1454）

Save

Progressive and iterative approximation for least squares B-spline curve and surface fitting(LSPIA),as an effective method for fitting large data,has attracted the attention of many researchers.To address the problem that the LSPIA algorithm is less effective in fitting local data points,a local LSPIA algorithm,called LOCAL-LSPIA,is proposed.Firstly,the initial curve is given and some of the data points are selected from the given data points.Then,the control points to be adjusted are selected on the initial curve.Finally,LOCAL-LSPIA is used to generate a series of locally varying fitted curves(surfaces) by iteratively adjusting this part of the control points and ensuring that the limits of the generated curves(surfaces) are the least-squares results of fitting some of the data points while adjusting only this part of the control points.Experimental results on multiple curve-surface fitting show that the LOCAL-LSPIA algorithm requires fewer steps and shorter time than the LSPIA algorithm to achieve the same local fitting accuracy.Therefore,LOCAL-LSPIA is effective and has a faster convergence rate than LSPIA algorithm in the case of fitting local data.

Reference | Related Articles | Metrics

Select

FeaEM:Feature Enhancement-based Method for Weakly Supervised Salient Object Detection via Multiple Pseudo Labels

SHI Dianxi, LIU Yangyang, SONG Linna, TAN Jiefu, ZHOU Chenlei, ZHANG Yi

Computer Science 2024, 51 (1): 233-242. DOI: 10.11896/jsjkx.230500035

Abstract （95）

PDF（pc）（4005KB）（1498）

Save

Salient object detection is designed to detect the most obvious areas of an image.The traditional method based on single label is inevitably affected by the refinement algorithm and shows bias characteristics,which further affects the detection perfor-mance of saliency network.To solve this problem,based on the structure of multi-instruction filter,this paper proposes a feature enhancement-based method for weakly supervised salient object detection via multiple pseudo labels(FeaEM),which integrates more comprehensive and accurate saliency cues from multiple labels to effectively improve the performance of object detection.The core of FeaEM method is to introduce a new multi-instruction filter structure and use multiple pseudo-labels to avoid the negative effects of a single label.By introducing the feature selection mechanism into the instruction filter,more accurate significance clues are extracted and filtered from the noise false label,so as to learn more effective representative features.At the same time,the existing weak supervised object detection methods are very sensitive to the scale of the input image,and the prediction structure of the input of different sizes of the same image has a large deviation.The scale feature fusion mechanism is introduced to ensure that the output of the same image of different sizes is consistent,so as to effectively improve the scale generalization ability of the model.A large number of experiments on multiple data sets show that the FeaEM method proposed in this paper is superior to the most representative methods.

Reference | Related Articles | Metrics

Select

Weakly Supervised Video Anomaly Detection Based on Dual Dynamic Memory Network

ZHOU Wenhao, HU Hongtao, CHEN Xu, ZHAO Chunhui

Computer Science 2024, 51 (1): 243-251. DOI: 10.11896/jsjkx.230300134

Abstract （132）

PDF（pc）（3019KB）（1497）

Save

Video anomaly detection aims to identify frame-level abnormal behaviors from the video.The weakly supervised me-thods use both normal and abnormal video supplemented by the video-level labels for training,which show better performance than the unsupervised methods.However,the current weakly supervised video anomaly detection methods cannot record the long-term mode of the video.At the same time,some methods use the information of future frames to achieve better detection results,which makes it impossible to apply online.For this reason,a weakly supervised video anomaly detection method based on dual dynamic memory network is proposed for the first time in this paper.The memory network containing two memory modules is designed to record the normal and abnormal modes of video in the long term respectively.In order to realize the collaborative update of video features and memory items,the read operation is used to enhance the features of video frames based on the memory items in the memory module,and the write operation is used to update the contents of memory items based on the features of video frames.At the same time,the number of memory items will be dynamically adjusted during the training process to meet the needs of different video monitoring scenarios.In training,a modality separation loss is proposed to increase the discrimination between memory items.During the test,only memory items are needed without the participation of future video frames,so that accurate online detection can be achieved.Experimental results on two public weakly supervised video anomaly detection datasets show that the proposed method is superior to all online application methods,and also has strong competitiveness compared with offline application methods.

Reference | Related Articles | Metrics

Select

Transformer Object Detection Algorithm Based on Multi-granularity

XU Fang, MIAO Duoqian, ZHANG Hongyun

Computer Science 2023, 50 (11): 143-150. DOI: 10.11896/jsjkx.230600028

Abstract （236）

PDF（pc）（4143KB）（1853）

Save

Different from other scale objects,small objects have the characteristics of carrying less semantic information and a small number of training samples.Therefore,the current object detection algorithm has the problem of low detection accuracy for small objects.Aiming at this problem,a Transformer object detection algorithm based on multi-granularity is proposed.Firstly,adopting the multi-granularity idea,a new Transformer serialization method is designed to predict the object position granularly from coarse to fine,thereby improving the object location effect of the model.Then,based on the three-way decision idea,fine-grained mining of small object samples and regular-scale object samples increases the number of small object samples and hardnegative samples.Finally,experimental results on the COCO dataset show that,the small object detection average accuracy(APs) of the algorithm reaches 31.5%,and the mean average accuracy(mAP) reaches 49.1%.Compared with the baseline model,the APs is improved by 1.4% and the mAP is improved by 2.2%.The algorithm effectively improves the detection effect of small objects and significantly improves the overall accuracy of object detection.

Reference | Related Articles | Metrics

Select

Surface Anomaly Detection Based on Image Reconstruction and Semantic Difference Discrimination

WANG Shangshang, JIN Cheng

Computer Science 2023, 50 (11): 151-159. DOI: 10.11896/jsjkx.221100023

Abstract （122）

PDF（pc）（3596KB）（1845）

Save

Reconstruction-based methods are widely used for surface anomaly detection.These methods are expected to only reconstruct normal patterns well and detect and localize anomalies by the larger reconstruction error in anomalous areas.Previous methods either tend to “generalize” too well,resulting in high fidelity reconstruction of anomalies,or measure reconstruction differences in image space,which doesn’t really capture the semantic differences.To tackle these problems,this paper proposes a model consisting of a reconstruction network and a discrimination network.In the reconstruction network,we design a multiscale location-augmented dynamic prototype unit to reinforce the learning of normal patterns.In the discrimination network,we fuse the multiscale deep features of the input image and its anomaly-free reconstruction to utilize the multiscale semantic difference information before and after reconstruction,which reinforces the discrimination of semantic differences.On the MVTec dataset,our method reaches 99.5% AUROC in the detection task,and 98.5% AUROC,95.0% PRO in the location task,outperforms pre-vious reconstruction-based methods by a large margin.

Reference | Related Articles | Metrics

Select

Deepfake Face Tampering Video Detection Method Based on Non-critical Masks and AttentionMechanism

YU Yang, YUAN Jiabin, CAI Jiyuan, ZHA Keke, CHEN Zhangyu, DAI Jiawei, FENG Yuxiang

Computer Science 2023, 50 (11): 160-167. DOI: 10.11896/jsjkx.221100109

Abstract （141）

PDF（pc）（3400KB）（1809）

Save

Since the introduction of Deepfake technology,its illegal application has caused a bad impact on individuals,society and national security,and there are huge hidden dangers.Therefore,deep fake detection for face video is a hot and difficult problem in the field of computer vision.In view of the above problems,this paper proposes a deepfake video detection method based on non-critical mask and CA_S3D Model.It firstly divides the face image into key areas and non-critical regions,and improves the attention of the deep neural network to the key areas of the face image through the mask processing of the non-critical areas,and reduces the influence and interference of irrelevant information on the deep neural network.Then it introduces the contextual attention module in the S3D network,which enhances the ability to capture the long-range dependence of sample data information and improves the attention to key channels and features.Experimental results show that the proposed method improves the perfor-mance of the deep neural network on the DFDC dataset,the accuracy rate increases from 83.85% to 90.10%,and the AUC value increases from 0.931 to 0.979.By comparing with the existing deepfake video detection methods,the performance of the proposed method is better than that of the existing methods,which verifies its effectiveness.

Reference | Related Articles | Metrics

Select

Robust Video Watermarking Scheme Based on QDCT Global Equalization Strategy

TAO Xinyu, XIONG Lizhi, ZHANG Xiang

Computer Science 2023, 50 (11): 168-176. DOI: 10.11896/jsjkx.221000228

Abstract （218）

PDF（pc）（2993KB）（1892）

Save

As a promising technology of copyright protection,video watermarking has attracted more and more attention in recent years.Different from the original domain scheme,the compressed domain scheme does not need to fully encode and decode video,so it has higher efficiency,and the video storage and transmission generally need to be compressed and encoded.Therefore,robust video watermarking scheme in compressed domain become a research hotspot.However,most of the existing schemes in the compressed domain use the individual QDCT coefficients in the compressed domain to embed the watermark,which makes the algorithm less robust.In order to improve the robustness of the compressed domain algorithm,a robust video watermarking scheme based on the QDCT global equalization strategy is proposed in this paper.Firstly,the blocks with both texture and high spatial complexity are selected as watermark blocks by using the number of non-zero coefficients,and then the sum of all coefficients in the two blocks is calculated respectively.According to the sum of the coefficients and the watermark information,all the non-zero coefficients in the sequence block are modified by the global equalization strategy to satisfy the block-pair coefficients rule,and the watermark is embedded.Experimental results show that the robust performance of the proposed scheme is better than that of the existing robust video watermarking scheme in resisting both recompression and noise attacks,increases by 8% and 9% respectively,while ensuring the high visual quality of the watermark-containing video.

Reference | Related Articles | Metrics

Select

Three-dimensional AI Clone Speech Source Identification Method Based on Improved MFCCFeature Model

WANG Xueguang, ZHU Junwen, ZHANG Aixin

Computer Science 2023, 50 (11): 177-184. DOI: 10.11896/jsjkx.221000024

Abstract （295）

PDF（pc）（4051KB）（1879）

Save

The emergence of AI cloned voice technology will have a fatal impact on the legal order of modern society.In recent years,researchers have only focused on the research in the field of AI-synthesized speech containing the same sample speech content,but little research has been done on the identification of AI-synthesized speech containing the content that is different from the sample content.Thus,this paper proposes a three-dimensional model to identify AI cloned speech sources based on an improved MFCC feature model.Firstly,it verifies the characteristics of artificially analyzed AI cloned speech by previous scholars,and summarize the characteristics of “abnormally active formant F5” and “abnormal mutation of energy,formant and pitch curve” for computer identification.Secondly,it uses the second-order difference to correct the MFCC coefficients based on the characte-ristics of AI cloned speech,and use the “inverse logic deduction method” to further quantify and sample the mutation characteristics of energy,formants,and pitch curves,and define them as feature vector ternary of speech recognition.After that,it takes the feature vector triples as input,and uses the D-S evidence synthesis rule to fuse the results of the comparison of the three groups of inspection materials with the samples.Finally,a three-dimensional material evaluation model based on improved MFCC characteristic parameters is formed.After the random sampling experiment of the crowd,the AI clone source identification method has an average probability of 67.324% with a standard deviation of 7.32% for the identification of AI clones synthesized with the same human clone source,which is very effective.

Reference | Related Articles | Metrics

Select

Fusion Tracker:Single-object Tracking Framework Fusing Image Features and Event Features

WANG Lin, LIU Zhe, SHI Dianxi, ZHOU Chenlei, YANG Shaowu, ZHANG Yongjun

Computer Science 2023, 50 (10): 96-103. DOI: 10.11896/jsjkx.220900075

Abstract （285）

PDF（pc）（2834KB）（1449）

Save

Object tracking is a fundamental research problem in the field of computer vision.As the mainstream object tracking method sensor,conventional cameras can provide rich scene information.However,due to the limitation of sampling principle,conventional cameras suffer from overexposure or underexposure under extreme lighting conditions,and there is motion blur in high-speed motion scenes.In contrast,event camera is a bionic sensor that can sense light intensity changes to output event streams,with the advantages of high dynamic range and high temporal resolution,but it is difficult to capture static targets.Inspired by the characteristics of conventional and event cameras,a dual-modal fusion single-target tracking method,called fusion tracker,is proposed.The method adaptively fuses visual cues from conventional and event camera data by feature enhancement,while designing an attention mechanism-based feature matching network to match object cues of template frames with search frames to establish long-term feature associations and make the tracker focus on object information.The fusion tracker can solve the semantic loss problem caused by correlation operations during feature matching and improve the performance of object tra-cking.Experiments on two publicly available datasets demonstrate the superiority of our approach and validate the effectiveness of the key parts of the fusion tracker by ablation experiments.The fusion tracker can effectively improve the robustness of object tracking tasks in complex scenarios and provide reliable tracking results for downstream applications.

Reference | Related Articles | Metrics

Select

Unbiased Scene Graph Generation Based on Adaptive Regularization Algorithm

LI Haochen, CAO Fuyuan, QIAO Shichang

Computer Science 2023, 50 (10): 104-111. DOI: 10.11896/jsjkx.221000084

Abstract （207）

PDF（pc）（4153KB）（1427）

Save

The purpose of scene graph generation is to give a picture,obtain the visual triplet form of entities and relationships between entities through the object detection module,namely subject,relationship and object,and construct a semantic structured representation.Scene graphs can be applied to downstream tasks such as image retrieval and visual question answering.However,due to the longtail distribution of relationships between entities in the dataset,existing models tend to predict coarse grained head relationships.Such scene graph cannot play an auxiliary role for downstream tasks.Previous works generally adopt rebalancing strategies such as resampling and reweighting to solve the long tail problem.However,because the models repeatedly learn the tail relationship samples,it is prone to overfitting.In order to solve the above problems,an adaptive regularized unbiased scene graph generation method is proposed in this paper.Specifically,the method adaptively adjusts the weights of full connected classifier of the model by designing a regularization term based on the prior relation frequency,so as to achieve the prediction of model balance.The proposedmethod is tested on Visual Genome dataset,and the experimental results show that it can not only prevent the model from overfitting,but also alleviate the negative impact of the longtail distribution problem on the scene graph generation,and the state-of-the-artscene graph generation methods combined with the proposed method can more effectively improve the performance of unbiased scene graph generation.

Reference | Related Articles | Metrics

Select

Forgery Face Detection Based on Multi-scale Transformer Fusing Multi-domain Information

MA Xin, JI Lixin, LI Shaomei

Computer Science 2023, 50 (10): 112-118. DOI: 10.11896/jsjkx.220900048

Abstract （314）

PDF（pc）（2733KB）（1534）

Save

At present,the proliferation of “face-changing” fake videos generated based on deep forgery technologies such as Deepfakes poses a considerable threat to citizens' privacy and national political security.Therefore,it is of great significance to study deep-faked face detection technology in videos.Aiming at the problems of insufficient extraction of facial features and weak gene-ralization ability of existing forged face detection methods,this paper proposes a fake face detection method based on multi-scale Transformer for the fusion of multi-domain information.First,based on the idea of multi-domain feature fusion,feature extraction from the frequency domain and RGB domain of video frames improves the generalization of the model.Second,the EfficientNet and multi-scale Transformer are combined to design a multi-level feature extraction network to extract more elaborate forged features.The test results on open-source datasets show that the proposed method has better detection performance than the existing methods.At the same time,experimental results on cross-datasets prove that the proposed model has better generalization performance.

Reference | Related Articles | Metrics