Computer Science

Select

Review of Visualization Drawing Methods of Flow Field Based on Streamlines

ZHANG Qian, XIAO Li

Computer Science 2021, 48 (12): 1-7. DOI: 10.11896/jsjkx.201200108

Abstract （1023）

PDF（pc）（3214KB）（2139）

Save

Flow visualization is an important branch of scientific computational visualization.It mainly visualizes the simulation calculation results of computational fluid dynamics,and provides researchers with visually intuitive graphical images to facilitate researchers' analysis.The known techniques for flow visualization include geometric-based methods:such as streamline,particle tracking methods;and texture-based methods:LIC,spot noise and IBFV.Streamline visualization is an important and commonly used geometric visualization method for flow field visualization.In the study of streamline visualization,the placement of streamline is the focus of the entire streamline visualization,and the number and position of streamline affect the entire visualization effect.When too many streamlines are placed,it will cause visual clutter,and too little cause the flow field information to be incompletely expressed and cannot be transmitted to domain experts.In order to achieve accurate display of scientific data,streamline visualization has generated two important research directions:placement of seed points and reduction of streamline.This article introduces the related research of seed point placement method and streamline reduction method,summarizes some problems and solutions adopted in 2D and 3D flow fields,and proposes the need for streamline visualization in view of the growing scientific data in the future.

Reference | Related Articles | Metrics

Select

Anomaly Propagation Based Fault Diagnosis for Microservices

WANG Tao, ZHANG Shu-dong, LI An, SHAO Ya-ru, ZHANG Wen-bo

Computer Science 2021, 48 (12): 8-16. DOI: 10.11896/jsjkx.210100149

Abstract （790）

PDF（pc）（1563KB）（1801）

Save

Microservice architectures separate a large-scale complex application into multiple independent microservices.These microservices with various technology stacks communicate with lightweight protocols to implement agile development and conti-nuous delivery.Since the application using a microservice architecture has a large number of microservices communicating with each other,the faulty microservice should cause other microservices interacting with the faulty one to appear anomalies.How to detect anomalous microservices and locate the root cause microservice has become one of the keys of ensuring the reliability of a microservice based application.To address the above issue,this paper proposes an anomaly propagation-based fault diagnosis approach for microservices by considering the propagation of faults.First,we monitor the interactions between microservices to construct a service dependency graph for characterizing anomaly propagation.Second,we construct a regression model between me-trics and API calls to detect anomalous services.Third,we get the fault propagation subgraph by combining the service dependency graph and the detected abnormal service.Finally,we calculate the anomaly degree of microservices with a PageRank algorithm to locate the most likely root cause of the fault.The experimental results show that our approach can locate faulty microservices with low overhead.

Reference | Related Articles | Metrics

Select

Method of Service Decomposition Based on Microservice Architecture

JIANG Zheng, WANG Jun-li, CAO Rui-hao, YAN Chun-gang

Computer Science 2021, 48 (12): 17-23. DOI: 10.11896/jsjkx.210500078

Abstract （616）

PDF（pc）（2010KB）（1534）

Save

The microservice decomposition of the monolithic system can effectively alleviate the problems of system redundancy and difficulty in maintenance of the monolithic architecture.However,the existing microservice decomposition methods fail to make full use of the attribute information of the microservice architecture,which leads to the low rationality of service decomposition results.This paper proposes a service decomposition method based on microservice architecture.The method constructs an entity-attribute relationship graph through the association information of system services and attributes.Then the service decomposition rules are formulated by combining the feature information of the microservice architecture with the demand information of the target system,the association information between the two types of vertices is quantified,and a weighted entity-attribute graph is generated.Finally,the weighted GN algorithm is applied to realize the microservice decomposition of the system automatically.The experimental results show that the method greatly improves the timeliness of service decomposition,and the gene-rated microservice decomposition scheme performs better in terms of various evaluation metrics.

Reference | Related Articles | Metrics

Select

Parallel WMD Algorithm Based on GPU Acceleration

HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li

Computer Science 2021, 48 (12): 24-28. DOI: 10.11896/jsjkx.210600213

Abstract （599）

PDF（pc）（1806KB）（1298）

Save

Word Mover's Distance (WMD) is a method of measuring text similarity.It defines the difference between two texts as the minimum distance between the word embedding vectors of the text.WMD uses the vocabulary to represent the text as a normalized bag-of-words vector.Since the words of the text occupies a small proportion in the corpus,the document vector gene-rated by the bag-of-words model is very sparse.Multiple documents can form a high-dimensional sparse matrix,such a sparse matrix will generate a lot of unnecessary operations.By calculating the WMD of a single source document for multiple target documents at once,the calculation process can be highly parallelized.Aiming at the sparsity of text vectors,this paper proposes a GPU-based parallel Sinkhorn-WMD algorithm,which uses compressed format to store target text to improve memory utilization,and reduces intermediate calculations based on the sparse structure.The pre-trained word embedding vector is used to calculate the word distance matrix,the WMD algorithm is improved,and the optimization algorithm is verified on two public news data sets.The experimental results show that the parallel algorithm on NVIDIA TITAN RTX can achieve a speedup of up to 67.43× compared with the CPU serial algorithm.

Reference | Related Articles | Metrics

Select

High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD

YAO Jian-yu, ZHANG Yi-wei, ZHANG Guang-ting, JIA Hai-peng

Computer Science 2021, 48 (12): 29-35. DOI: 10.11896/jsjkx.201200135

Abstract （585）

PDF（pc）（1337KB）（2041）

Save

As a basic mathematical operation,the high-performance implementation of trigonometric functions is of great significance to the construction of the basic software ecology of the processor.Especially,the current processors have adopted the SIMD architecture,and the implementation of high-performance trigonometric functions based on SIMD has important research significance and application value.In this regard,this paper uses numerical analysis method to implement and optimize the five commonly used trigonometric functions sin,cos,tan,atan,atan2 with high performance.Based on the analysis of floating-point IEEE754 standard,an efficient trigonometric function algorithm is designed.Then,the algorithm accuracy is further improved by the application of Taylor formula,Pade approximation and Remez algorithm in polynomial approximation algorithm.Finally,the perfor-mance of the algorithm is further improved by using instruction pipeline and SIMD optimization.The experimental results show that,on the premise of satisfying the accuracy,the trigonometric function implemented is compared with libm algorithm library and ARM_M algorithm library,on the ARM V8 computing platform,has achieved great performance improvement,whose time performance is 1.77~6.26 times higher than libm algorithm library,and compared with ARM_M,its times performance is 1.34~1.5 times higher.

Reference | Related Articles | Metrics

Select

Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System

XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na

Computer Science 2021, 48 (12): 36-42. DOI: 10.11896/jsjkx.201200023

Abstract （360）

PDF（pc）（1622KB）（1187）

Save

The “Songshan” supercomputer system is a new generation of heterogeneous supercomputer cluster independently developed by China.The CPU and DCU accelerators it carries are all independently developed by my country.In order to expand the scientific computing ecology of the platform and verify the feasibility of quantum computing research on this platform,the paper uses a heterogeneous programming model to implement a heterogeneous version of the quantum Fourier transform simulation on the “Songshan” supercomputer system.The computing hotspots of the program are allocated to run on the DCU;then MPI is used to enable multiple processes on a single computing node to realize the concurrent data transmission and calculation of the DCU accelerator;finally,the hiding of computing and communication prevents the DCU from being in the middle of data transmission.The experiment implements a 44 Qubits-scale quantum Fourier transform simulation on a supercomputing system for the first time.The results show that the heterogeneous version of the quantum Fourier transform module makes full use of the computing resources of the DCU accelerator and achieves 11.594 compared to the traditional CPU version.The speedup ratio is high,and it has good scalability on the cluster.This implementation method provides a reference for the simulation implementation and optimization of other quantum algorithms on the “Songshan” supercomputer system.

Reference | Related Articles | Metrics

Select

DGX-2 Based Optimization of Application for Turbulent Combustion

WEN Min-hua, WANG Shen-peng, WEI Jian-wen, LI Lin-ying, ZHANG Bin, LIN Xin-hua

Computer Science 2021, 48 (12): 43-48. DOI: 10.11896/jsjkx.201200129

Abstract （363）

PDF（pc）（2335KB）（741）

Save

Numerical simulation of turbulent combustion is a key tool for aeroengine design.Due to the need of high-precision model to Navier-Stokes equation,numerical simulation of turbulent combustion requires huge amount of calculations,and the phy-sicochemical models causes the flow field to be extremely complicated,making the load balancing a bottleneck for large-scale pa-rallelization.We port and optimize the numerical simulation method of turbulent combustion on a powerful computing server,DGX-2.We design the threading method of flux calculation and use Roofline model to guide the optimization.In addition,we design an efficient communication method and propose a multi-GPU parallel method for turbulent combustion based on high-speed interconnection of DGX-2.The results show that the performance of a single V100 GPU is 8.1x higher than that on dual-socket Intel Xeon 6248 CPU node with 40 cores.And the multi-GPU version on DGX-2 with 16 V100 GPUs achieves 66.1x speedup,which is higher than the best performance on CPU cluster.

Reference | Related Articles | Metrics

Select

Loop Fusion Strategy Based on Data Reuse Analysis in Polyhedral Compilation

HU Wei-fang, CHEN Yun, LI Ying-ying, SHANG Jian-dong

Computer Science 2021, 48 (12): 49-58. DOI: 10.11896/jsjkx.210200071

Abstract （594）

PDF（pc）（2471KB）（1070）

Save

Existing polyhedral compilation tools often use some simple heuristic strategies to find the optimal loop fusion decisions.It is necessary to manually adjust the loop fusion strategy to get the best performance for different programs.To solve this problem,a fusion strategy based on data reuse analysis is proposed for multi-core CPU platform.This strategy avoids unnecessary fusion constraints that affecting the mining of data locality.For different stages of scheduling,the parallelism constraint for diffe-rent parallel levels is proposed.And a tiling constraint for CPU cache optimization is proposed for statements with complex array accesses.Compared with the previous loop fuion strategies,this strategy takes into account the changes in spatial locality when calculating the fusion profits.This strategy is implemented based on the polyhedral compilation module Polly in the LLVM compilation framework,and some test cases in test suites such as Polybench are selected for testing.In the case of single-core testing,compared with the existing fusion strategies,the average performance is improved by 14.9%~62.5%.In the case of multi-core testing,compared with the existing fusion strategies,the average performance is improved by 19.7%~94.9%,and the speedup is up to 1.49x~3.07x.

Reference | Related Articles | Metrics

Select

Performance Skeleton Analysis Method Towards Component-based Parallel Applications

FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen

Computer Science 2021, 48 (6): 1-9. DOI: 10.11896/jsjkx.201200115

Abstract （713）

PDF（pc）（2185KB）（1176）

Save

Performance skeleton analysis technology (PSTAT) provides input parameters for performance modeling of parallel applications by describing the program structure of parallel applications.PSTAT is the basis of performance analysis and performance optimization for large-scale parallel applications.Aiming at a kind of component-based parallel applications in the field of numerical simulation,based on the dynamic and static application structure analysis technology oriented to general program binary file,this paper proposes and implements an automatic performance skeleton generation method based on “component-loop-call” tree.On this foundation,a performance skeleton analysis toolkit CLCT-STAT(Component-Loop-Call-Tree SkeleTon Analysis Toolkit) is developed.This method can automatically identify the function symbols of component class members in component-based applications,and generate the performance skeleton of parallel application with component as the smallest unit.Compared with the method of manual generation of performance skeleton by analytical modeling,the proposed method can provide more program structure information and save the cost of manual analysis.

Reference | Related Articles | Metrics

Select

Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor

ZHU Yu, PANG Jian-min, XU Jin-long, TAO Xiao-han, WANG Jun

Computer Science 2021, 48 (6): 10-18. DOI: 10.11896/jsjkx.200700059

Abstract （633）

PDF（pc）（2242KB）（1143）

Save

Stencil computation is an important part of scientific computing and large-scale applications.Tiling is a widely-used technique to explore the data locality of Stencil computation.In the existing methods of 3D Stencil optimization on SW26010,time tiling is rarely used and manual tuning is needed for tiling size.To solve this problem,this paper introduces time tiling method and proposes an adaptive tiling size algorithm for 3D Stencil computation on SW26010 many-core processor.By establishing a performance analysis model,we systematically analyze the influence of tiling size to the performance of 3D Stencil computation,identify the performance bottleneck and guide the optimization direction under the hardware resource constraints.Based on the performance analysis model,the adaptive tiling size algorithm provides the predicted optimal tiling size,which can be helpful to deploy 3D Stencil rapidly on SW26010 processor.3D-7P Stencil and 3D-27P Stencil are selected for experiment.Compared with the result lacking of time tiling,the speedup rates of the above two examples with optimal tiling size given by our algorithm can reach 1.47 and 1.29,and the optimal tiling size in experiment is consistent with that given by our model,which verify the proposed performance analysis model and tiling size adaptive algorithm.

Reference | Related Articles | Metrics

Select

List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip

GUO Biao, TANG Qi, WEN Zhi-min, FU Juan, WANG Ling, WEI Ji-bo

Computer Science 2021, 48 (6): 19-25. DOI: 10.11896/jsjkx.200700198

Abstract （467）

PDF（pc）（2039KB）（801）

Save

Parallel computing is an important means to improve the utilization rate of system resources.More and more systems on multi-processor chip meet the requirements of different computing tasks by integrating processors with different functional characteristics.A heterogeneous multiprocessor system-on-chip (DPR-HMPSoC) with dynamic partial reconfigurability is widely used because of its good parallelism and high computing efficiency,while the software/hardware partitioning algorithm with low complexity and high solving performance is an important guarantee for giving full play to its computational performance advantages.The existing related software/hardware partitioning algorithms have high time complexity and insufficient support for the DPR-HMPSoC platform.In response to the above problems,this paper proposes a list heuristic software/hardware partitioning and scheduling algorithm.By constructing a scheduling list based on task priority,a series of operations such as task scheduling,mapping and FPGA dynamic partial reconfigurable area partitioning are completed.It introduces software application mode-ling,computing platform modeling,and the detailed design scheme of the proposed algorithm.The simulation experiment results show that the proposed algorithm can effectively reduce the solution time compared with the MILP and ACO algorithms,and the time advantage is proportional to the task scale.In terms of scheduling length,the average performance of the proposed algorithm is improved by about 10%.

Reference | Related Articles | Metrics

Select

Implementation of Transcendental Functions on Vectors Based on SIMD Extensions

LIU Dan, GUO Shao-zhong, HAO Jiang-wei, XU Jin-chen

Computer Science 2021, 48 (6): 26-33. DOI: 10.11896/jsjkx.200400007

Abstract （720）

PDF（pc）（1971KB）（1386）

Save

The basic mathematical function library is a critical soft module in the computer system.However,the long vector transcendental function on the domestic Shenwei platform can only be implemented indirectly by cyclic utilizing the system scalar function currently,thus limiting the computing capability of the SIMD extensions of Shenwei platform.In order to solve this problem effectively,this paper implements the long vector transcendental function based on lower-level optimization of SIMD extensions of Shenwei platform and proposes the floating-point computing fusion algorithm for solving the problem that the two-branch structure algorithm is difficult to vectorize.It also proposes the implementation method of higher degree polynomials based on the dynamic grouping of Estrin algorithm,which improves the pipelining performance of polynomial assembly evaluation.This is the first time to implement the long vector transcendental function library on the Shenwei platform.The providedfunction interfaces include trigonometric functions,inverse trigonometric functions,logarithmic functions,exponential functions,etc.The experimental result shows that the maximum error of double precision version is controlled below 3.5ULP (unit in the last place),and the maximum error of single precision version is controlled below 0.5ULP.Compared with the scalar function of Shenwei platform,the performance is significantly improved,and the average speedup ratio is 3.71.

Reference | Related Articles | Metrics

Select

Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform

HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han

Computer Science 2021, 48 (6): 34-40. DOI: 10.11896/jsjkx.201100051

Abstract （522）

PDF（pc）（2223KB）（1135）

Save

The Floyd algorithm for finding shortest paths in a weighted graph is a key building block which is used frequently in a variety of practical applications.However,the Floyd algorithm cannot scale to large-scale graphs due to its time complexity.Its parallel implementations for different architectures are thus proposed and have been proved effective.To address the mismatching between existing ineffective parallel implementation of the Floyd algorithm and domestically designed processors,this paper implements and optimizes the Floyd algorithm targeting the Sunway platform.More specifically,this paper implements the algorithm using the programming model designed for the heterogeneous architecture of the Sunway TaihuLight,and captures the performance bottleneck when executed on the target.This paper next improves the performance of the Floyd algorithm by means of algorithmic optimization,array partitioning and double buffering.The experimental results show that the implementation of the Floyd algorithm on the Sunway platform can achieve the highest speedup of 106X over the sequential version executed on the managing processing element of the SW26010 processor.

Reference | Related Articles | Metrics

Select

Automatic Porting of Basic Mathematics Library for 64-bit RISC-V

CAO Hao, GUO Shao-zhong, LIU Dan, XU Jin-chen

Computer Science 2021, 48 (6): 41-47. DOI: 10.11896/jsjkx.201200058

Abstract （408）

PDF（pc）（1983KB）（752）

Save

Subject to the core technology and intellectual property rights and other objective conditions,the research and development of domestic independent chip is highly restricted.RISC-V has the advantages of simplicity and modularity as an open source instruction set architecture(ISA),and it will become a new choice of domestic processor.As one of the most basic core software libraries of computer system,basic mathematics library is particularly important to the software ecological construction and healthy development of domestic processors.However,RISC-V has no relevant basic mathematics library at present.Therefore,this paper aims at porting basic mathematics library based on domestic Shenwei processor to the 64-bit RISC-V platform.In order to solve the problem of efficient transportation of the library,an automatic porting framework is designed at first,which can achieve high scalability through loose coupling between functional modules.Secondly,based on the characteristics of 64-bit RISC-V ISA,a global active register allocation method and a hierarchical instruction selection strategy are proposed.Finally,the framework is applied to bring about the transportation of some typical functions in the Shenwei basic mathematics library.Test results show that the ported functions are working correctly and the performance is improved compared with GLIBC.

Reference | Related Articles | Metrics