Computer Science

Select

Review of Visualization Drawing Methods of Flow Field Based on Streamlines

ZHANG Qian, XIAO Li

Computer Science 2021, 48 (12): 1-7. DOI: 10.11896/jsjkx.201200108

Abstract （1010）

PDF（pc）（3214KB）（2102）

Save

Flow visualization is an important branch of scientific computational visualization.It mainly visualizes the simulation calculation results of computational fluid dynamics,and provides researchers with visually intuitive graphical images to facilitate researchers' analysis.The known techniques for flow visualization include geometric-based methods:such as streamline,particle tracking methods;and texture-based methods:LIC,spot noise and IBFV.Streamline visualization is an important and commonly used geometric visualization method for flow field visualization.In the study of streamline visualization,the placement of streamline is the focus of the entire streamline visualization,and the number and position of streamline affect the entire visualization effect.When too many streamlines are placed,it will cause visual clutter,and too little cause the flow field information to be incompletely expressed and cannot be transmitted to domain experts.In order to achieve accurate display of scientific data,streamline visualization has generated two important research directions:placement of seed points and reduction of streamline.This article introduces the related research of seed point placement method and streamline reduction method,summarizes some problems and solutions adopted in 2D and 3D flow fields,and proposes the need for streamline visualization in view of the growing scientific data in the future.

Reference | Related Articles | Metrics

Select

Anomaly Propagation Based Fault Diagnosis for Microservices

WANG Tao, ZHANG Shu-dong, LI An, SHAO Ya-ru, ZHANG Wen-bo

Computer Science 2021, 48 (12): 8-16. DOI: 10.11896/jsjkx.210100149

Abstract （779）

PDF（pc）（1563KB）（1783）

Save

Microservice architectures separate a large-scale complex application into multiple independent microservices.These microservices with various technology stacks communicate with lightweight protocols to implement agile development and conti-nuous delivery.Since the application using a microservice architecture has a large number of microservices communicating with each other,the faulty microservice should cause other microservices interacting with the faulty one to appear anomalies.How to detect anomalous microservices and locate the root cause microservice has become one of the keys of ensuring the reliability of a microservice based application.To address the above issue,this paper proposes an anomaly propagation-based fault diagnosis approach for microservices by considering the propagation of faults.First,we monitor the interactions between microservices to construct a service dependency graph for characterizing anomaly propagation.Second,we construct a regression model between me-trics and API calls to detect anomalous services.Third,we get the fault propagation subgraph by combining the service dependency graph and the detected abnormal service.Finally,we calculate the anomaly degree of microservices with a PageRank algorithm to locate the most likely root cause of the fault.The experimental results show that our approach can locate faulty microservices with low overhead.

Reference | Related Articles | Metrics

Select

Method of Service Decomposition Based on Microservice Architecture

JIANG Zheng, WANG Jun-li, CAO Rui-hao, YAN Chun-gang

Computer Science 2021, 48 (12): 17-23. DOI: 10.11896/jsjkx.210500078

Abstract （605）

PDF（pc）（2010KB）（1525）

Save

The microservice decomposition of the monolithic system can effectively alleviate the problems of system redundancy and difficulty in maintenance of the monolithic architecture.However,the existing microservice decomposition methods fail to make full use of the attribute information of the microservice architecture,which leads to the low rationality of service decomposition results.This paper proposes a service decomposition method based on microservice architecture.The method constructs an entity-attribute relationship graph through the association information of system services and attributes.Then the service decomposition rules are formulated by combining the feature information of the microservice architecture with the demand information of the target system,the association information between the two types of vertices is quantified,and a weighted entity-attribute graph is generated.Finally,the weighted GN algorithm is applied to realize the microservice decomposition of the system automatically.The experimental results show that the method greatly improves the timeliness of service decomposition,and the gene-rated microservice decomposition scheme performs better in terms of various evaluation metrics.

Reference | Related Articles | Metrics

Select

Parallel WMD Algorithm Based on GPU Acceleration

HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li

Computer Science 2021, 48 (12): 24-28. DOI: 10.11896/jsjkx.210600213

Abstract （594）

PDF（pc）（1806KB）（1286）

Save

Word Mover's Distance (WMD) is a method of measuring text similarity.It defines the difference between two texts as the minimum distance between the word embedding vectors of the text.WMD uses the vocabulary to represent the text as a normalized bag-of-words vector.Since the words of the text occupies a small proportion in the corpus,the document vector gene-rated by the bag-of-words model is very sparse.Multiple documents can form a high-dimensional sparse matrix,such a sparse matrix will generate a lot of unnecessary operations.By calculating the WMD of a single source document for multiple target documents at once,the calculation process can be highly parallelized.Aiming at the sparsity of text vectors,this paper proposes a GPU-based parallel Sinkhorn-WMD algorithm,which uses compressed format to store target text to improve memory utilization,and reduces intermediate calculations based on the sparse structure.The pre-trained word embedding vector is used to calculate the word distance matrix,the WMD algorithm is improved,and the optimization algorithm is verified on two public news data sets.The experimental results show that the parallel algorithm on NVIDIA TITAN RTX can achieve a speedup of up to 67.43× compared with the CPU serial algorithm.

Reference | Related Articles | Metrics

Select

High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD

YAO Jian-yu, ZHANG Yi-wei, ZHANG Guang-ting, JIA Hai-peng

Computer Science 2021, 48 (12): 29-35. DOI: 10.11896/jsjkx.201200135

Abstract （576）

PDF（pc）（1337KB）（2005）

Save

As a basic mathematical operation,the high-performance implementation of trigonometric functions is of great significance to the construction of the basic software ecology of the processor.Especially,the current processors have adopted the SIMD architecture,and the implementation of high-performance trigonometric functions based on SIMD has important research significance and application value.In this regard,this paper uses numerical analysis method to implement and optimize the five commonly used trigonometric functions sin,cos,tan,atan,atan2 with high performance.Based on the analysis of floating-point IEEE754 standard,an efficient trigonometric function algorithm is designed.Then,the algorithm accuracy is further improved by the application of Taylor formula,Pade approximation and Remez algorithm in polynomial approximation algorithm.Finally,the perfor-mance of the algorithm is further improved by using instruction pipeline and SIMD optimization.The experimental results show that,on the premise of satisfying the accuracy,the trigonometric function implemented is compared with libm algorithm library and ARM_M algorithm library,on the ARM V8 computing platform,has achieved great performance improvement,whose time performance is 1.77~6.26 times higher than libm algorithm library,and compared with ARM_M,its times performance is 1.34~1.5 times higher.

Reference | Related Articles | Metrics

Select

Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System

XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na

Computer Science 2021, 48 (12): 36-42. DOI: 10.11896/jsjkx.201200023

Abstract （355）

PDF（pc）（1622KB）（1170）

Save

The “Songshan” supercomputer system is a new generation of heterogeneous supercomputer cluster independently developed by China.The CPU and DCU accelerators it carries are all independently developed by my country.In order to expand the scientific computing ecology of the platform and verify the feasibility of quantum computing research on this platform,the paper uses a heterogeneous programming model to implement a heterogeneous version of the quantum Fourier transform simulation on the “Songshan” supercomputer system.The computing hotspots of the program are allocated to run on the DCU;then MPI is used to enable multiple processes on a single computing node to realize the concurrent data transmission and calculation of the DCU accelerator;finally,the hiding of computing and communication prevents the DCU from being in the middle of data transmission.The experiment implements a 44 Qubits-scale quantum Fourier transform simulation on a supercomputing system for the first time.The results show that the heterogeneous version of the quantum Fourier transform module makes full use of the computing resources of the DCU accelerator and achieves 11.594 compared to the traditional CPU version.The speedup ratio is high,and it has good scalability on the cluster.This implementation method provides a reference for the simulation implementation and optimization of other quantum algorithms on the “Songshan” supercomputer system.

Reference | Related Articles | Metrics

Select

DGX-2 Based Optimization of Application for Turbulent Combustion

WEN Min-hua, WANG Shen-peng, WEI Jian-wen, LI Lin-ying, ZHANG Bin, LIN Xin-hua

Computer Science 2021, 48 (12): 43-48. DOI: 10.11896/jsjkx.201200129

Abstract （348）

PDF（pc）（2335KB）（728）

Save

Numerical simulation of turbulent combustion is a key tool for aeroengine design.Due to the need of high-precision model to Navier-Stokes equation,numerical simulation of turbulent combustion requires huge amount of calculations,and the phy-sicochemical models causes the flow field to be extremely complicated,making the load balancing a bottleneck for large-scale pa-rallelization.We port and optimize the numerical simulation method of turbulent combustion on a powerful computing server,DGX-2.We design the threading method of flux calculation and use Roofline model to guide the optimization.In addition,we design an efficient communication method and propose a multi-GPU parallel method for turbulent combustion based on high-speed interconnection of DGX-2.The results show that the performance of a single V100 GPU is 8.1x higher than that on dual-socket Intel Xeon 6248 CPU node with 40 cores.And the multi-GPU version on DGX-2 with 16 V100 GPUs achieves 66.1x speedup,which is higher than the best performance on CPU cluster.

Reference | Related Articles | Metrics

Select

Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation

HU Wei-fang, CHEN Yun, LI Ying-ying, SHANG Jian-dong

Computer Science 2021, 48 (12): 49-58. DOI: 10.11896/jsjkx.210200071

Abstract （576）

PDF（pc）（2471KB）（1048）

Save

Existing polyhedral compilation tools often use some simple heuristic strategies to find the optimal loop fusion decisions.It is necessary to manually adjust the loop fusion strategy to get the best performance for different programs.To solve this problem,a fusion strategy based on data reuse analysis is proposed for multi-core CPU platform.This strategy avoids unnecessary fusion constraints that affecting the mining of data locality.For different stages of scheduling,the parallelism constraint for diffe-rent parallel levels is proposed.And a tiling constraint for CPU cache optimization is proposed for statements with complex array accesses.Compared with the previous loop fuion strategies,this strategy takes into account the changes in spatial locality when calculating the fusion profits.This strategy is implemented based on the polyhedral compilation module Polly in the LLVM compilation framework,and some test cases in test suites such as Polybench are selected for testing.In the case of single-core testing,compared with the existing fusion strategies,the average performance is improved by 14.9%~62.5%.In the case of multi-core testing,compared with the existing fusion strategies,the average performance is improved by 19.7%~94.9%,and the speedup is up to 1.49x~3.07x.

Reference | Related Articles | Metrics

Select

Performance Skeleton Analysis Method Towards Component-based Parallel Applications

FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen

Computer Science 2021, 48 (6): 1-9. DOI: 10.11896/jsjkx.201200115

Abstract （705）

PDF（pc）（2185KB）（1167）

Save

Performance skeleton analysis technology (PSTAT) provides input parameters for performance modeling of parallel applications by describing the program structure of parallel applications.PSTAT is the basis of performance analysis and performance optimization for large-scale parallel applications.Aiming at a kind of component-based parallel applications in the field of numerical simulation,based on the dynamic and static application structure analysis technology oriented to general program binary file,this paper proposes and implements an automatic performance skeleton generation method based on “component-loop-call” tree.On this foundation,a performance skeleton analysis toolkit CLCT-STAT(Component-Loop-Call-Tree SkeleTon Analysis Toolkit) is developed.This method can automatically identify the function symbols of component class members in component-based applications,and generate the performance skeleton of parallel application with component as the smallest unit.Compared with the method of manual generation of performance skeleton by analytical modeling,the proposed method can provide more program structure information and save the cost of manual analysis.

Reference | Related Articles | Metrics

Select

Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor

ZHU Yu, PANG Jian-min, XU Jin-long, TAO Xiao-han, WANG Jun

Computer Science 2021, 48 (6): 10-18. DOI: 10.11896/jsjkx.200700059

Abstract （613）

PDF（pc）（2242KB）（1134）

Save

Stencil computation is an important part of scientific computing and large-scale applications.Tiling is a widely-used technique to explore the data locality of Stencil computation.In the existing methods of 3D Stencil optimization on SW26010,time tiling is rarely used and manual tuning is needed for tiling size.To solve this problem,this paper introduces time tiling method and proposes an adaptive tiling size algorithm for 3D Stencil computation on SW26010 many-core processor.By establishing a performance analysis model,we systematically analyze the influence of tiling size to the performance of 3D Stencil computation,identify the performance bottleneck and guide the optimization direction under the hardware resource constraints.Based on the performance analysis model,the adaptive tiling size algorithm provides the predicted optimal tiling size,which can be helpful to deploy 3D Stencil rapidly on SW26010 processor.3D-7P Stencil and 3D-27P Stencil are selected for experiment.Compared with the result lacking of time tiling,the speedup rates of the above two examples with optimal tiling size given by our algorithm can reach 1.47 and 1.29,and the optimal tiling size in experiment is consistent with that given by our model,which verify the proposed performance analysis model and tiling size adaptive algorithm.

Reference | Related Articles | Metrics

Select

List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip

GUO Biao, TANG Qi, WEN Zhi-min, FU Juan, WANG Ling, WEI Ji-bo

Computer Science 2021, 48 (6): 19-25. DOI: 10.11896/jsjkx.200700198

Abstract （462）

PDF（pc）（2039KB）（792）

Save

Parallel computing is an important means to improve the utilization rate of system resources.More and more systems on multi-processor chip meet the requirements of different computing tasks by integrating processors with different functional characteristics.A heterogeneous multiprocessor system-on-chip (DPR-HMPSoC) with dynamic partial reconfigurability is widely used because of its good parallelism and high computing efficiency,while the software/hardware partitioning algorithm with low complexity and high solving performance is an important guarantee for giving full play to its computational performance advantages.The existing related software/hardware partitioning algorithms have high time complexity and insufficient support for the DPR-HMPSoC platform.In response to the above problems,this paper proposes a list heuristic software/hardware partitioning and scheduling algorithm.By constructing a scheduling list based on task priority,a series of operations such as task scheduling,mapping and FPGA dynamic partial reconfigurable area partitioning are completed.It introduces software application mode-ling,computing platform modeling,and the detailed design scheme of the proposed algorithm.The simulation experiment results show that the proposed algorithm can effectively reduce the solution time compared with the MILP and ACO algorithms,and the time advantage is proportional to the task scale.In terms of scheduling length,the average performance of the proposed algorithm is improved by about 10%.

Reference | Related Articles | Metrics

Select

Implementation of Transcendental Functions on Vectors Based on SIMD Extensions

LIU Dan, GUO Shao-zhong, HAO Jiang-wei, XU Jin-chen

Computer Science 2021, 48 (6): 26-33. DOI: 10.11896/jsjkx.200400007

Abstract （704）

PDF（pc）（1971KB）（1372）

Save

The basic mathematical function library is a critical soft module in the computer system.However,the long vector transcendental function on the domestic Shenwei platform can only be implemented indirectly by cyclic utilizing the system scalar function currently,thus limiting the computing capability of the SIMD extensions of Shenwei platform.In order to solve this problem effectively,this paper implements the long vector transcendental function based on lower-level optimization of SIMD extensions of Shenwei platform and proposes the floating-point computing fusion algorithm for solving the problem that the two-branch structure algorithm is difficult to vectorize.It also proposes the implementation method of higher degree polynomials based on the dynamic grouping of Estrin algorithm,which improves the pipelining performance of polynomial assembly evaluation.This is the first time to implement the long vector transcendental function library on the Shenwei platform.The providedfunction interfaces include trigonometric functions,inverse trigonometric functions,logarithmic functions,exponential functions,etc.The experimental result shows that the maximum error of double precision version is controlled below 3.5ULP (unit in the last place),and the maximum error of single precision version is controlled below 0.5ULP.Compared with the scalar function of Shenwei platform,the performance is significantly improved,and the average speedup ratio is 3.71.

Reference | Related Articles | Metrics

Select

Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform

HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han

Computer Science 2021, 48 (6): 34-40. DOI: 10.11896/jsjkx.201100051

Abstract （515）

PDF（pc）（2223KB）（1121）

Save

The Floyd algorithm for finding shortest paths in a weighted graph is a key building block which is used frequently in a variety of practical applications.However,the Floyd algorithm cannot scale to large-scale graphs due to its time complexity.Its parallel implementations for different architectures are thus proposed and have been proved effective.To address the mismatching between existing ineffective parallel implementation of the Floyd algorithm and domestically designed processors,this paper implements and optimizes the Floyd algorithm targeting the Sunway platform.More specifically,this paper implements the algorithm using the programming model designed for the heterogeneous architecture of the Sunway TaihuLight,and captures the performance bottleneck when executed on the target.This paper next improves the performance of the Floyd algorithm by means of algorithmic optimization,array partitioning and double buffering.The experimental results show that the implementation of the Floyd algorithm on the Sunway platform can achieve the highest speedup of 106X over the sequential version executed on the managing processing element of the SW26010 processor.

Reference | Related Articles | Metrics

Select

Automatic Porting of Basic Mathematics Library for 64-bit RISC-V

CAO Hao, GUO Shao-zhong, LIU Dan, XU Jin-chen

Computer Science 2021, 48 (6): 41-47. DOI: 10.11896/jsjkx.201200058

Abstract （403）

PDF（pc）（1983KB）（743）

Save

Subject to the core technology and intellectual property rights and other objective conditions,the research and development of domestic independent chip is highly restricted.RISC-V has the advantages of simplicity and modularity as an open source instruction set architecture(ISA),and it will become a new choice of domestic processor.As one of the most basic core software libraries of computer system,basic mathematics library is particularly important to the software ecological construction and healthy development of domestic processors.However,RISC-V has no relevant basic mathematics library at present.Therefore,this paper aims at porting basic mathematics library based on domestic Shenwei processor to the 64-bit RISC-V platform.In order to solve the problem of efficient transportation of the library,an automatic porting framework is designed at first,which can achieve high scalability through loose coupling between functional modules.Secondly,based on the characteristics of 64-bit RISC-V ISA,a global active register allocation method and a hierarchical instruction selection strategy are proposed.Finally,the framework is applied to bring about the transportation of some typical functions in the Shenwei basic mathematics library.Test results show that the ported functions are working correctly and the performance is improved compared with GLIBC.

Reference | Related Articles | Metrics

Select

Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster

LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng

Computer Science 2020, 47 (4): 6-12. DOI: 10.11896/jsjkx.191000009

Abstract （492）

PDF（pc）（1983KB）（936）

Save

The solution of the generalized dense symmetric eigenproblem is the main task of many applied sciences and enginee-ring,and is an important part in the calculation of electromagnetics,electronic structures,finite element models and quantum che-mistry.Transforming generalized symmetric eigenproblem into a standard symmetric eigenproblem is an important computational step for solving the generalized dense symmetric eigenproblem.For the GPU cluster,the generalized blocked algorithm for gene-ralized dense symmetric eigenproblem was presented based on MPI+CUDA on GPU cluster.In order to adapt to the architecture of the GPU cluster,the generalized symmetric eigenproblem standardization algorithm presented in this paper adopts the method of combining the Cholesky decomposition of the positive definite matrix with the traditional standardized blocked algorithm,which reduces the unnecessary communication overhead in the standardized algorithm and increases the parallelism of the algorithm.Moreover,In the MPI+CUDA based generalized symmetric eigenproblem standardization algorithm,the data transferoperation between the GPU and the CPU is utilized to mask the data copy operation in the GPU,which eliminates the time spent on copying,thereby improving the performance of the program.At the same time,a fully parallel point-to-point transposition algo-rithm between the row communication domain and the column communication domain in the two-dimensional communication grid was presented.In addition,a parallel blocked algorithm based on MPI+CUDA for the solution of the triangular matrix equation BX=A with multiple right-end terms was also given.On the supercomputer system “Era” of the Computer Network Information Center of Chinese Academy of Sciences,each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors.This paper tested different scale matrices using up to 32GPUs.The implementation performance of the ge-neralized symmetric eigenproblem standardization algorithm based on MPI+CUDA has achieved better acceleration and perfor-mance,and have good scalability.When tested with 50000×50000-order matrix using 32GPUs,the peak performance reach approximately 9.21 Tflops.

Reference | Related Articles | Metrics

Select

Extreme-scale Simulation Based LBM Computing Fluid Dynamics Simulations

LV Xiao-jing, LIU Zhao, CHU Xue-sen, SHI Shu-peng, MENG Hong-song, HUANG Zhen-chun

Computer Science 2020, 47 (4): 13-17. DOI: 10.11896/jsjkx.191000010

Abstract （885）

PDF（pc）（3663KB）（1721）

Save

Lattice Boltzmann Method (LBM) is a computational fluid dynamics method based on mesoscopic simulation scales and has been widely used in theoretical research and processing engineering problems.Improving the parallel simulation capability of LBM Computing Fluid software is an important study for high performance computing and applications.The research aims to design and implement a set of highly efficient extended LBM computational fluid dynamics software based on the “Sunway TaihuLight” supercomputing system.According to the architecture of domestic multi-core processor SW26010,several parallel optimization multi-level parallelism techniques to boost the simulation speed and improve the scalability of SWLBM are designed,including date reuse of 19-point stencil,vectorization of collision process and communication overlap computing.Based on these parallel optimization schemes,the numerical simulation with over 10million cores and up to 5.6trillion grids is tested and the SWLBM software can bring up to 172x speed up and achieve a sustained floating of 4.7 PFlops.Compared with the million-core 10000*10000*5000 grid wind filed simulation,the SWLBM machine has a core efficiency of 87%.Test results show that SWLBM has the ability to provide practical large-scale parallel simulation solutions for industrial applications.

Reference | Related Articles | Metrics

Select

Efficient MILP Model for HW/SW Partitioning of Dynamic Partial Reconfigurable SoC

ZHU Li-hua, WANG Ling, TANG Qi, WEI Ji-bo

Computer Science 2020, 47 (4): 18-24. DOI: 10.11896/jsjkx.190300001

Abstract （474）

PDF（pc）（2543KB）（1193）

Save

Heterogeneous System-on-Chip (SoC) integrates multiple types of processors on the same chip.It has great advantages in many aspects such as processing capacity,size,weight and power consumption,which makes it be widely used in many fields.The SoC with dynamic partial reconfigurability is an important type of heterogeneous SoC,which combines software flexibility with hardware efficiency.The design of such systems usually involves hardware and software coordination problems,and how to divide the software and hardware of the application is the key technology to ensure the real-time performance of the system.The hardware and software partitioning technology in this kind of system is a key technology for ensuring real-time system performance.The problem of hardware and software partitioning in DPR-SoC can be classified as a combinatorial optimization problem.The goal of this problem is to obtain the optimal solution with the shortest schedule length,including task mapping,sorting,and timing.Mixed integer linear programming (MILP) is an effective method for solving combinatorial optimization problems.However,building a proper model for a specific problem is the key part of solving the problem,which has great impacts on solving time.The existing MILP models for the HW/SW partitioning of DPR-SoC have a lot of variables and constraint equations.The redundant variables and constraint equations have negative impacts on solving time.Besides,the solution of the available method does not match with actual applications,for it makes too many assumptions.Basing on these exiting problems proposes a novel model which focuses on reducing the model complexity and improving its suitability to the application.Model the application as a DAG graph and solve the problem by an integer linear programming solver.Plenty of results show that the proposed model can reduce the model complexity and shorten the solving time.Further,as the scale of the problem grows,the solve time shortened more significantly.

Reference | Related Articles | Metrics

Select

Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model

ZUO Xian-yu, ZHANG Zhe, SU Yue-han, LIU Yang, GE Qiang, TIAN Jun-feng

Computer Science 2020, 47 (4): 25-29. DOI: 10.11896/jsjkx.190500029

Abstract （421）

PDF（pc）（1728KB）（947）

Save

In general,the Normalized Differential Vegetation Index (NDVI) extraction algorithm optimized by GPU usually adopts GPU multi-thread parallel model,and there are problems such as data transmission between CPU and GPU and weak correlation calculations taking more time,which affect the further improvement of performance.Aiming at the above problems and the characteristics of NDVI,a NDVI extraction algorithm based on GPU multi-stream parallel model was proposed.Through the features of CUDA stream and Hyper-Q,the GPU multi-stream parallel model can overlap not only data transmission and kernel execution,but also kernel execution and kernel execution,and further improve parallelism and resources utilization of GPU.Firstly,the NDVI algorithm is optimized by the GPU multi-thread parallel model,and the optimized procedures are decomposed to find out the parts of the algorithm with data transmission or weak correlation calculation.Secondly,parts of data transmission and weak correlation calculation are reconstructed and optimized by GPU multi-stream parallel model to achieve overlapping between weak correlation calculation and weak correlation calculation,or weak correlation calculation and data transmission.Finally,expe-riments of NDVI algorithm that based on both GPU parallel models respectively were carried out,and the remote sensing image taken by the GF1 satellite were used as experimental data.The experimental results show that the proposed algorithm,when the image is larger than 12000x13400 pixels,achieves about 1.5 times acceleration compared with the traditional parallel NDVI algorithm based on the GPU multi-thread parallel model,and about 260 times acceleration compared with the NDVI sequential extraction algorithm,which has better performance and parallelism.

Reference | Related Articles | Metrics

Select

Application of Atomic Dynamics Monte Carlo Program MISA-KMC in Study of Irradiation Damage of Reactor Pressure Vessel Steel

WANG Dong, SHANG Hong-hui, ZHANG Yun-quan, LI Kun, HE Xin-fu, JIA Li-xia

Computer Science 2020, 47 (4): 30-35. DOI: 10.11896/jsjkx.191100045

Abstract （422）

PDF（pc）（3192KB）（1451）

Save

With the rapid development of material science,the microstructure of nuclear materials (reactor pressure vessel steel) is subject to radiation damage.The behavior of solute precipitation in reactor pressure vessel steel can be simulated by dynamic monte carlo method.In order to provide theoretical basis for studying the microstructure evolution and performance change of nuclear materials after long-term service,this paper introduced the parallel strategy and large-scale test results of MISA-KMC program developed by ourselves.Based on the correctness of the program,the precipitation process of solute atoms in reactor pressure vessel steel was studied by MISA-KMC program.The results show that,after a long period of evolution,solute atoms will aggregate to form Cu-rich clusters,which is one of the main microstructure leading to the embrittlement of steel in the reactor pressure vessel.The accuracy of the simulation results of MISA-KMC program,the size of the simulation that can be supported,and the diversity of simulation elements provide a guarantee for the subsequent research on material performance changes.

Reference | Related Articles | Metrics

Select

Streaming Parallel Text Proofreading Based on Spark Streaming

YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie

Computer Science 2020, 47 (4): 36-41. DOI: 10.11896/jsjkx.190300070

Abstract （426）

PDF（pc）（1901KB）（864）

Save

The rapid development of the Internet has prompted the generation of massive amounts of network text,which poses new performance challenges for traditional serial text proofreading algorithms.Although the text automatic proofreading task has received more and more attention in recent years,the related research work mostly focuses on serial algorithms,and rarely involves the parallelization of proofreading.Firstly,the serial proofreading algorithm is generalized,and a general framework of serialproofreading is given.Then,in view of the shortcomings of serial proofreading for processing large-scale texts,three general text proofreading parallelization methods are proposed:1)a parallel proofreading method based on multi-threading,which implements simultaneous parallelism of paragraph and proofreading functions based on the thread pool;2)a batch processing parallel proofreading method based on Spark MapReduce,which implements paragraph parallel proofreading by means of RDD parallel computing;3)a Spark Streaming-based parallel proofreading approach,which converts the real-time calculation of text streams into a series of small-scale time fragmentation based batch jobs,making it can effectively avoid fixed overhead and significantly reduce proofreading delay.Because the streaming computing has the advantages of low delay and high throughput,the paper finally chooses the streaming computing-based method to build the parallel proofreading system.Performance comparison experiments demonstrate that thread parallelism is suitable for proofreading small-scale text,batch processing is suitable for off-line proofreading of large-scale text,and streaming parallel proofreading effectively reduces the fixed delay of about 110 seconds.Compared with batch proofreading,the streaming proofreading using a real-time computing framework has achieved a great performance improvement.

Reference | Related Articles | Metrics

Select

Design of Fault-tolerant L1 Cache Architecture at Near-threshold Voltage

CHENG Yu, LIU Wei, SUN Tong-xin, WEI Zhi-gang, DU Wei

Computer Science 2020, 47 (4): 42-49. DOI: 10.11896/jsjkx.190300088

Abstract （359）

PDF（pc）（2235KB）（728）

Save

With the aggressive silicon integration and clock frequency increasing,power consumption and heat dissipation have become key challenges in the design of high-performance processors.NTC is emerging as a promising solution to achieve an order of magnitude reduction in energy consumption in future processors.However,reducing the supply voltage to near-threshold level significantly increases the SRAM bit-cell failures,leading to the high error rate in L1 cache.Researchers have proposed techniques either by sacrificing capacity or incurring additional latency to correct the errors in L1 cache.But most schemes can only adapt to the low error rate environment of SRAM bit-cell,and perform poorly in high error rate environment.In this paper,this paper proposed a fault-tolerant First-Level Cache design (FTFLC) based on conventional 6T SRAM cells to solve reliability challenges in high error rate environment.FTFLC adopts a two-level mapping mechanism,which uses block mapping mechanism and bit correction mechanism to protect the faulty bits data in the cache line.In addition,this paper proposed a FTFLC initialization algorithm to improve the available cache capacity by combining two mapping mechanisms.Experimental results show that compared with three existing schemes,FTFLC improves performance by 3.86% and increases 12.5% L1 cache capacity while maintaining a low area and energy consumption.

Reference | Related Articles | Metrics

Select

High Performance Computing and Astronomical Data:A Survey

WANG Yang, LI Peng, JI Yi-mu, FAN Wei-bei, ZHANG Yu-jie, WANG Ru-chuan, CHEN Guo-liang

Computer Science 2020, 47 (1): 1-6. DOI: 10.11896/jsjkx.190900042

Abstract （1488）

PDF（pc）（2333KB）（3132）

Save

Data is an important driver of astronomical development.Distributed storage and High Performance Computing (HPC) have an positive effect on the complexity,irregular storage and calculation of massive astronomical data.The multi-information and multi-disciplinary integration of astronomical research has become inevitable,and astronomical big data has entered the era of large-scale computing.HPC provides a new means for astronomical big data processing and analysis,and presents new solutions to problems that cannot be solved by traditional methods.Based on the classification and characteristics of astronomical data,and supported by HPC,this paper studied the data fusion,efficient access,analysis and subsequent processing,visualization of astronomical big data,and summarized the current situation.Furthermore,this paper summarized the technical characteristics of the current stage,put forward the research strategies and technical methods for dealing with astronomical big data,and discussed the problems and development trends of the processing of astronomical big data.

Reference | Related Articles | Metrics

Select

Research on Locality-aware Design Mechanism of State-of-the-art Parallel Programming Languages

YUAN Liang,ZHANG Yun-quan,BAI Xue-rui,ZHANG Guang-ting

Computer Science 2020, 47 (1): 7-16. DOI: 10.11896/jsjkx.181202409

Abstract （716）

PDF（pc）（1560KB）（1411）

Save

The memory access locality of a parallel program becomes a more and more important factor for exploiting more performance from the more and more complex memory hierarchy of current multi-core processors.In this paper,two different kinds of locality concept,horizontal locality and vertical locality,were proposed and defined.The state-of-the-art of parallel programming languages were investigated and analyzed,while the methods and mechanisms on how these parallel programming languages describe and control the memory access locality were analyzed in detail based on these two kinds view of horizontal locality and vertical locality.Finally,some future research directions on parallel programming languages were summarized,especially on the importance of integrating and support both horizontal locality and vertical locality in the future parallel programming language research.

Reference | Related Articles | Metrics

Select

Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python

XU Chuan-fu,WANG Xi,LIU Shu,CHEN Shi-zhao,LIN Yu

Computer Science 2020, 47 (1): 17-23. DOI: 10.11896/jsjkx.190500009

Abstract （924）

PDF（pc）（2190KB）（1940）

Save

Due to the plenty of third-party libraries and development productivity,Python is becoming increasingly popular as a programming language in areas such as data science and artificial intelligence.Python has been providing fundamental support for scientific and engineering computing.For example,libraries such as NumPy and SciPy provide efficient data structures for multi-dimensional arrays and rich numerical functions.Traditionally,Python was used as a script language,gluing preprocessors,solvers and postprocessors and enhancing automation in numerical simulations.Recently,some foreign researchers implement their solvers using Python and parallelize their Python codes on high performance computers,with impressive results achieved.Because of its intrinsic features,implementation and optimization of high performance large-scale numerical simulations with Python are quite different with traditional language such as C/C++ and Fortran.This paper presented a large-scale parallel open source 3D Lattice Boltzmann multi-phase flow simulation code PyLBMFlow with Python,and investigated large-scale parallel computing and performance optimization for Python numerical applications.It designed LBM flow data structures and computational kernels with NumPy multi-dimensional arrays and universal functions.Through a range of optimization including reconstruction of boundary processing,it dramatically improves the efficiency of Python computing by about 100X,compared with the baseline version,on a CPU core.Furthermore,this paper designed a 3D decomposition method and implement hybrid MPI+OpenMP parallelization using mpi4py and Cython.Tests for 3D multi-phase(liquid and gases) problem(about 10 Billion lattices) simulating drop impact with gravity effect using D3Q19 Lattice Boltzmann discretization and Shan-Chen BGK single relaxation time collision model were presented,achieving a weak parallel efficiency of above 90% in going from 64 to 1024 compute nodes.

Reference | Related Articles | Metrics

Select

Research on Adaptation of CFD Software Based on Many-core Architecture of 100P Domestic Supercomputing System

LI Fang,LI Zhi-hui,XU Jin-xiu,FAN Hao,CHU Xue-sen,LI Xin-liang

Computer Science 2020, 47 (1): 24-30. DOI: 10.11896/jsjkx.181102176

Abstract （684）

PDF（pc）（3218KB）（1320）

Save

Domestic many-core super computing system provides two program languages with different program difficulty.Adaptation to many-core architecture of CFD software decides which program language should be used.Firstly,this paper briefly introduced the many-core architecture,program model and program languages.And then challenges on the adaptation of CFD software were analyzed,including data relativity of implicit method,solving of big parse linear equations,many grid method and unstructured grids.For each challenge,corresponding countermeasure was provided too.At last,the paper provided the speedup ratio of some typical software of fluid dynamics based on theory analysis and experiments.Facts prove that most CFD softwares adapt well to domestic many-core architecture and can use simple program language to get better parallel ration on million cores.

Reference | Related Articles | Metrics

Select

High-performance Implementation Method for Even Basis of Cooley-Tukey FFT

GONG Tong-yan,ZHANG Guang-ting,JIA Hai-peng,YUAN Liang

Computer Science 2020, 47 (1): 31-39. DOI: 10.11896/jsjkx.190900179

Abstract （878）

PDF（pc）（2671KB）（1730）

Save

Fast Fourier transform (FFT) is one of the most important basic algorithms,which is widely used in scientific calculation,signal processing,image processing and other fields.With the further improvement of real-time requirements in these application fields,fast Fourier transform algorithms are facing higher and higher performance requirements.In the existing FFT algorithm library,the solution speed and calculation accuracy of FFT algorithm are limited to a certain extent,and few researchers put forward corresponding optimization strategies and conducted in-depth research on the implementation of cooley-tukey fast Fourier transform based on even Numbers.Based on this,this paper put forward a set of for even basis of optimization strategy and methodfor Colley-Turkey fast Fourier transform.Firstly,a friendly butterfly network supporting SIMD mixed is constructed.Secondly,according to the even base rotation factor characteristics,the complexity of the butterfly calculation is reduced to a maximum degree.Thirdly,through the SIMD assembly optimization,assembly instruction rearrangement and selection,register allocation strategy and high performance matrix transpose algorithm method,the application is optimized .Finally a high performance FFT algorithm library is achieved.Currently,the most popular and widely used FFT are FFTW and Intel MKL.Experimental results show that on X86 computing platform,the performance of FFT library based on cooley-tukey FFT is better than MKL and FFTW.The high performance algorithm is put forward by the new optimization method and implementation technology system,which can be generalized to other except the even base based on the realization and optimization of a certain basis for further research and development work,to break through the FFT algorithm performance bottlenecks in the hardware platform,to achieve a high performance FFT algorithms library for a specific platform.

Reference | Related Articles | Metrics