Computer Science

Select

Performance Optimization of Complex Stencil in Weather Forecast Model WRF

DI Jianqiang, YUAN Liang, ZHANG Yunquan, ZHANG Sijia

Computer Science 2024, 51 (4): 56-66. DOI: 10.11896/jsjkx.231000124

Abstract （36）

PDF（pc）（2329KB）（59）

Save

The weather research and forecasting model(WRF) is a widely used mesoscale numerical weather forecasting system that plays an important role in the fields of atmospheric research and meteorological operational forecasting.Stencil computation is a common nested loop pattern in scientific and engineering applications.WRF performs a large number of complex stencil computation on spatial grids to solve numerical equations of atmospheric dynamics and thermodynamics.The stencils in WRF are featured by multi-dimensionality,multi-variables,particularity of physical model boundaries,and complexity of physical and dynamic processes.This study analyzes the typical stencil pattern in WRF,identifies and abstracts the concept of “intermediate variable”,and implements three optimization schemes,namely,intermediate variable computation merging,intermediate variable dimensio-nality reduction storage,and intermediate variables extraction.The optimization schemes effectively improve the data locality,increase data reuse and spatial reuse rates,and reduces redundant computing and memory access overhead.The results show that the WRF 4.2 typical hotspot functions achieve significant performance improvements on both Intel CPU and Hygon CPU,with the highest speedup ratios of 21.3% and 17.8% respectively.

Reference | Related Articles | Metrics

Select

Transplantation and Optimization of Graph Matching Algorithm Based on Domestic DCUHeterogeneous Platform

HAO Meng, TIAN Xueyang, LU Gangzhao, LIU Yi, ZHANG Weizhe, HE Hui

Computer Science 2024, 51 (4): 67-77. DOI: 10.11896/jsjkx.230800193

Abstract （39）

PDF（pc）（3041KB）（62）

Save

Subgraph matching is a basic graph algorithm that is widely used in various fields such as social networks and graph neural networks.As the scale of graph data grows,there is an increasing need for efficient subgraph matching algorithms.GENEVA is a GPU-based parallel subgraph matching algorithm.It uses the interval index graph storage structure and parallel matching optimization method to greatly reduce storage overhead and improve subgraph matching performance.However,due to the diffe-rence in the underlying hardware architecture and compilation environment of the platform,GENEVA cannot be directly applied to the domestic DCU platform.In order to solve this problem,this paper proposes GENEVA's transplantation and optimization scheme for domestic DCU.IO time consumption is the main performance bottleneck of GENEVA algorithm.This paper proposes three optimization strategies of page-locked memory,preloading,and scheduler to alleviate this bottleneck.Among them,page-locked memory technology avoids additional data transfer from pageable memory to temporary page-locked memory,and greatly reduces the time consumption of IO transfer on the DCU platform.The preloading technology overlaps IO data transmission with DCU kernel function computation to mask IO time consumption.The scheduler reduces redundant data transfer while satisfy preloading requirements.In this paper,Experiments are carried out on three real-world datasets of different sizes,and the results show that the algorithm performance is significantly improved after using the optimization strategies.On 92.6% of the test cases,the optimized GENEVA-HIP execution time on the Sugon DCU platform is less than that of the unported GENEVA on the GPU server.On a larger dataset,the execution time of the optimized Geneva-HIP algorithm on the DCU platform is reduced by 52.73% compared with the the pre-port GENEVA algorithm on the GPU server.

Reference | Related Articles | Metrics

Select

Auto-vectorization Cost Model Based on Instruction MKS

WANG Zhen, NIE Kai, HAN Lin

Computer Science 2024, 51 (4): 78-85. DOI: 10.11896/jsjkx.230200024

Abstract （37）

PDF（pc）（2431KB）（45）

Save

The auto-vectorization cost model is an important component of compiler's auto-vectorization optimization.Its role is to evaluate whether the code can achieve performance improvement after applying vectorization transformation.When the cost model is inaccurate,the compiler will apply vectorization transformation with negative benefit,thus reducing the execution efficiency of the program.Aiming at the inaccuracy of the default cost model of GCC compiler,based on Intel Xeon Silver 4214R CPU,an auto-vectorization cost model based on instruction MKS is proposed.The model fully considers the machine mode,operation type and operation intensity of instructions,and uses gradient descent algorithm to automatically search the approximate cost of different instruction types.Single-thread tests are carried out on SPEC2006 and SPEC2017.Experimental results show that the model can reduce the error of benefit estimation.Compared with the vector program generated by the default cost model,the GCC compiler,after adding the MKS cost model,achieves a maximum speedup of 4.72% on the SPEC2006 benchmark and 7.08% on the SPEC2017 benchmark.

Reference | Related Articles | Metrics

Select

Floating-point Expression Precision Optimization Method Based on Multi-type Calculation
Rewriting

HAO Jiangwei, YANG Hongru, XIA Yuanyuan, LIU Yi, XU Jinchen , PANG Jianmin

Computer Science 2024, 51 (4): 86-94. DOI: 10.11896/jsjkx.221200072

Abstract （32）

PDF（pc）（2371KB）（49）

Save

Expression rewriting is an emerging method in the field of precision optimization.Its core idea is to transform an expression into a semantically equivalent expression without changing its precision representation to improve precision.However,given the large number of transformation rules and the huge transformation space,the problem of the rewriting method is how to choose an appropriate transformation strategy.In response to the above problems,this paper proposes a precision optimization method for floating-point expressions based on multi-type calculation rewriting,which supports expressions including mathematical function calculations and four arithmetic calculations,and implements an expression rewriting tool exprAuto.Unlike other precision optimization tools that focus on replacing sub-expressions,exprAuto pays more attention to transform the order of expression operations.After the expression is simplified and mathematically transformed,exprAuto obtains different calculation orders through polynomial transformation,and tries to improve the precision by reducing the number of operations.Finally,exprAuto generates an equivalent set of expressions with different calculation orders and selects the final precision optimization result through sorting screening and error detection.In this paper,41 expressions from the FPBench standard set and 18 approximate polynomials of common mathematical functions are selected as test cases.After exprAuto optimization,the maximum error is reduced by 45.92% and the average error is reduced by 34.98% compared to the original expression.For 18 approximate polynomials,the maximum error is reduced by 58.35%,and the average error is reduced by 43.73%.Experimental results show that exprAuto can effectively improve the precision of expressions,especially polynomials.

Reference | Related Articles | Metrics

Select

Survey on High-performance Computing Technology and Standards

LU Pingjing, XIONG Zeyu, LAI Mingche

Computer Science 2023, 50 (11): 1-7. DOI: 10.11896/jsjkx.221100021

Abstract （414）

PDF（pc）（2112KB）（2044）

Save

As an indispensable support for knowledge and technological innovation,high-performance computing(HPC) is an important component for scientific and technological innovation system.In the new era,being an alternative way of scientific research,it is equally important as theories and experiments.In the past thirty years,HPC has achieved magic improvement,and has entered the era of exascale computing.China has achieved remarkable development in HPC,and has achieved a series of achievements represented by Tianhe,Sunway,and Dawning.China's high-performance system development level ranks among the top international ranks.Performance gains through semiconductor miniaturization is challenging after Moore's law ends.In the post-Moore era,improvements in computing power,opportunities for growth in computing performance will increasingly come from technologies from software,algorithms,and hardware architecture.Meanwhile,there are still many deficiencies in the development of HPC standards.This paper analyzes the current status and development trends of HPC technology and standards,analyzes the state-of-art of the current HPC standards,and proposes the necessity and importance of developing national HPC standards.

Reference | Related Articles | Metrics

Select

Acceleration Design and FPGA Implementation of CNN Scene Matching Algorithm

WANG Xiaofeng, LI Chaoran, LU Kunfeng, LUAN Tianjiao, YAO Na, ZHOU Hui, XIE Yujia

Computer Science 2023, 50 (11): 8-14. DOI: 10.11896/jsjkx.221100104

Abstract （265）

PDF（pc）（2126KB）（1927）

Save

Compared with traditional methods,the CNN-based scene matching algorithm has higher matching accuracy,better adaptability and stronger anti-interference ability.However,the algorithm has massive computing and storage requirements,which makes it difficult to deploy at the edge.To improve the real-time computing,an efficient edge-side acceleration scheme is designed and implemented.On the basis of analyzing the computation characteristics and overall architecture of the algorithm,correlation specific accelerator(CSA) is designed based on Winograd fast convolution method,and the acceleration scheme using CSA and deep-learning processor unit(DPU) pipelined computing feature correlation layer and feature extraction network is proposed.Experiments on Xilinx's ZCU102 development board finds that the peak perfor-mance of CSA reaches 576 GOPS,the actual performance reaches 422.08 GOPS,and the DSP usage efficiency reaches 4.5 Operation/clock.The peak performance of the accele-ration system reaches 1 600 GOPS,and the throughput delay of the algorithm is reduced to 157.89 ms.Experimental results show that the acceleration scheme can efficiently utilize the computing resources of the FPGA,to realize the real-time computing of the CNN-based scene matching algorithm.

Reference | Related Articles | Metrics

Select

Fast Performance Evaluation Method for Processor Design

DENG Lin, ZHANG Yao, LUO Jiahao

Computer Science 2023, 50 (11): 15-22. DOI: 10.11896/jsjkx.220900250

Abstract （238）

PDF（pc）（2351KB）（1875）

Save

In the face of increasingly complex processor design and limited design cycles,how to efficiently and quickly perform performance evaluation is a problem faced by each processor design team.The complete performance test suite requires longer run time,especially in the pre-silicon validation phase,and the high time cost makes it impossible for the design team to use the full performance test suite for performance evaluation analysis.In this paper,a general processor Fast-Eval method based on the SimPoint technique,using the Fast Parallel-BBV method,the selection of the optimal simulation points and the thermal migration of the simulation points,significantly reduces the performance test time and BBV generation time.Experimental results show that the performance evaluation time of the ARM64 processor is reduced to16.88% ofthe original,and the performance evaluation time is reduced to 1.26% of the original,and the average relative error of the performance evaluation results is 0.53%.The ave-rage relative error of the test set on the FPGA board can reach 0.40%,and the running time is only 0.93% of the full running time.

Reference | Related Articles | Metrics

Select

Convergence Analysis of Multigrid Solver for Cahn-Hilliard Equation

GUO Jing, QI Deyu

Computer Science 2023, 50 (11): 23-31. DOI: 10.11896/jsjkx.220800030

Abstract （243）

PDF（pc）（1930KB）（1810）

Save

The Cahn-Hilliard(CH) equation is a fundamental nonlinear equation in the phase field model and is usually analyzed using numerical methods.Following a numerical discretization,we get a nonlinear equations system.The full approximation scheme(FAS) is an efficient multigrid iterative scheme for solving such nonlinear equations.In the numerous articles on solving the CH equation,the main focus is on the convergence of the numerical format,without mentioning the stability of the solver.In this paper,the convergence property of the multigrid algorithm is established,which is from the nonlinear equation system obtained by solving the discrete CH equation,and the reliability of the calculation process is guaranteed theoretically.For the diffe-rence discrete numerical scheme of the CH equation,which is both second-order in spatial and time,we use the fast subspace descent method(FASD) framework to give the estimation of the convergence constant of its FAS scheme multigrid solver.First,we transform the original difference problem into a fully equivalent finite element problem.It demonstrates that the finite element problem comes from the minimization of convex functional energy.Then it is verified that the energy functional and the spatial decomposition satisfy the FASD framework assumption.Finally,the convergence coefficient estimate of the original multigrid algorithm is obtained.The results show that in the case of nonlinearity,the parameter ε in the CH equation imposes restrictions on the grid size,which will cause the numerical calculation process not to converge when it is too small.Finally,the spatial and temporal accuracy of the numerical format is verified by numerical experiment,and the dependence of the convergence coefficient on the equation parameters and grid-scale is analyzed.

Reference | Related Articles | Metrics

Select

Study on Cross-platform Heterogeneous Parallel Computing for Lattice Boltzmann Multi-phase Flow Simulations Based on SYCL

DING Yue, XU Chuanfu, QIU Haozhong, DAI Weixi, WANG Qingsong, LIN Yongzhen, WANG Zhenghua

Computer Science 2023, 50 (11): 32-40. DOI: 10.11896/jsjkx.230300123

Abstract （187）

PDF（pc）（2328KB）（1731）

Save

Heterogeneous parallel architecture is an important technology trend in current high-performance computing.Since various heterogeneous platforms usually support different programming models,the development of cross-platform performance portable heterogeneous parallel application is difficult.SYCL is a single-source cross-platform parallel programming open standard based on C++ language.The current research on SYCL mainly focuses on the performance comparison with other parallel programming models,but there are few researches on the different parallel kernel implementations provided in SYCL and their performance optimization.To address this situation,the open source multi-phase flow simulation software openLBMflow is implemented based on the SYCL programming model for cross-platform heterogeneous parallel simulation.The performance optimization methods of SYCL parallel applications are systematically summarized by comparing the basic parallel version,the fine-grained tuned ND-range parallel version and many-to-one mapping computation to work-items method.The results show that on Intel Xeon Platinum 9242 CPU and NVIDIA Tesla V100 GPU,the basic parallel kernel achieves a speedup of 2.91 on CPU without additional tuning compared to the optimized OpenMP parallel implementation,indicating the out-of-the-box performance advantage of SYCL.Using the basic parallel version as a baseline,the ND-range parallel version achieves up to 1.45x speedup on the CPU and 2.23x speedup on the GPU respectively by changing the work-group size and shape.By changing and optimizing the number and shape of lattices processed per work-item,the many-to-one mapping computation to work-items method achieves up to 1.57x speedup on the CPU and 1.34x speedup on the GPU respectively compared to the basic parallel version.The results show that SYCL parallel applications are more suitable for many-to-one mapping computation to work-items method on the CPU and ND-range parallel kernels on the GPU to improve performance.

Reference | Related Articles | Metrics

Select

Many-core Optimization Method for the Calculation of Ab initio Polarizability

LUO Haiwen, WU Yangjun, SHANG Honghui

Computer Science 2023, 50 (6): 1-9. DOI: 10.11896/jsjkx.220700162

Abstract （458）

PDF（pc）（3054KB）（4181）

Save

Density-functional perturbation theory(DFPT) based on quantum mechanics can be used to calculate a variety of physicochemical properties of molecules and materials and is now widely used in the research of new materials.Meanwhile,heteroge-neous many-core processor architectures are becoming the mainstream of supercomputing.Therefore,redesigning and optimizing DFPT programs for heterogeneous many-core processors to improve their computational efficiency is of great importance for the computation of physicochemical properties and their scientific applications.In this work,the computation of first-order response density and first-order response Hamiltonian matrix in DFPT is optimized for many-core processor architecture and verified on the new generation Sunway processors.Optimization techniques include loop tiling,discrete memory access processing and colla-borative reduction.Among them,loop tiling divides tasks so that they can be executed by many cores in parallel;discrete memory access processing converts discrete accesses into more efficient continuous memory accesses;collaborative reduction solves the write conflict problem.Experimental results show that the performance of the optimized program improves by 8.2 to 74.4 times over the pre-optimization program on one core group,and has good strong scalability and weak scalability.

Reference | Related Articles | Metrics

Select

Implementation and Optimization of Apache Spark Cache System Based on Mixed Memory

WEI Sen, ZHOU Haoran, HU Chuang, CHENG Dazhao

Computer Science 2023, 50 (6): 10-21. DOI: 10.11896/jsjkx.220900261

Abstract （366）

PDF（pc）（3181KB）（3977）

Save

With increasing data scale in the “big data era”,in-memory computing frameworks have grown significantly.The mainstream in-memory computing framework Apache Spark uses memory to cache intermediate results,which greatly improves data processing performance.At the same time,non-volatile memory (NVM) with fast read and write performance has great development prospects in the field of in-memory computing,so there is huge promise in building Spark's cache with a mix of DRAM and NVM.In this paper,a Spark cache system based on DRAM-NVM hybrid memory is proposed,which selects the flat hybrid cache model as the design scheme,and then designs a dedicated data structure for the cache block management system,and proposes the overall design architecture of the hybrid cache system for Spark.In addition,in order to save frequently accessed cache blocks in the DRAM cache,a hybrid cache management strategy based on the minimum reuse cost of cache blocks is proposed.First,the future reuse of RDD is obtained from the DAG information,and the cache blocks with high future reuse times will be stored in the DRAM cache first,and the migration cost is considered when the cache block is migrated.The design experiments show that the DRAM-NVM hybrid cache has an average performance improvement of 53.06% compared to the original cache system,and the proposed strategy has an average improvement of 35.09%compared to the default cache strategy for the same hybrid memory.At the same time,the hybrid system designed in this paper only needs 1/4 of the DRAM and 3/4 of the NVM as the cache,and the running time of the total DRAM cache can be achieved by 85.49%.

Reference | Related Articles | Metrics

Select

Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU

ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing

Computer Science 2023, 50 (6): 22-28. DOI: 10.11896/jsjkx.230300005

Abstract （246）

PDF（pc）（2813KB）（3913）

Save

DVB-RCS2 is widely used in satellite broadcasting,maritime satellite communication and military satellite communication fields.For high-throughput software decoding of dual binary Turbo codes in DVB-RCS2 and application of software-defined radio platform,a high-speed parallel software decoding scheme based on multi-core CPU is proposed.Firstly,the computational complexity of dual binary Turbo codes and traditional binary Turbo codes is compared and analyzed.Then,a parallel decoding implementation based on multi-core CPU is designed.The memory footprint and the input quantization method in parallel computing with 8-bit integer data are analyzed and optimized.Finally,our software decoder exceeds 169 Mbps information throughput using the SSE instruction on the Intel 12-core CPU,and the BER performance degradation is less than 0.1dB compared to the floating-point decoder.The results show that proposed implementation is a challenging alternative to GPU implementation in terms of throughput and energy efficiency,and it has an extremely high application value in high-speed satellite receivers.

Reference | Related Articles | Metrics

Select

Lock-free Parallel Semi-naive Algorithm Based on Multi-core CPU

YU Ting, WANG Lisong, QIN Xiaolin

Computer Science 2023, 50 (6): 29-35. DOI: 10.11896/jsjkx.220800050

Abstract （241）

PDF（pc）（2488KB）（3816）

Save

Datalog system has a wide range of applications in many fields,such as graph databases,networks,and static program analysis.When dealing with massive data,the serial-based Datalog evaluation strategy cannot fully utilize the computational resources of existing multicore processors.To address these problems,a lock-free parallelized semi-naive algorithm,parallel semi-naive(PSN) based on multi-core CPU is proposed to support efficient Datalog evaluation.PSN uses the B⁺ tree index to partition data and allocates data to different threads to perform calculations.The intermediate result tuples generated from each partition are different from each other,which is conducive to the realization of lock-free parallelism during calculation.PSN uses a double-layer hash table structure to index intermediate results to improve the efficiency of duplicate checking.Each thread only performs related calculations in a specific area,without using locks to prevent write conflicts.PSN adopts task queue and thread pool to allocate tasks to idle threads to achieve multi-thread load balancing.Experimental results on the public datasets of Koblenz network collection(KONECT) and Stanford network analysis platform( SNAP) show that the PSN algorithm can optimize the query performance of datalog rules.

Reference | Related Articles | Metrics

Select

Virtual Machine Consolidation Algorithm Based on Decision Tree and Improved Q-learning by Uniform Distribution

SHI Liang, WEN Liangming, LEI Sheng, LI Jianhui

Computer Science 2023, 50 (6): 36-44. DOI: 10.11896/jsjkx.220300192

Abstract （185）

PDF（pc）（3252KB）（3812）

Save

As the scale of cloud data centers expands,problems such as high energy consumption,low resource utilization,and reduced quality of service caused by sub-optimal virtual machine consolidation algorithm becomes increasingly prominent.Therefore,this paper proposes DTQL-UD,a virtual machine consolidation algorithm based on decision tree and improved Q-learning by uniform distribution.It uses the decision tree to characterize the states and selects the next action by uniform distribution when evaluating the next state-action value.At the same time,it can optimize decision-making with real-time feedback directly from the state of the cloud data center to the virtual machine migration process.Besides,aiming at the difference between the simulator and real world in reinforcement learning,we train the simulator by supervised learning model based on a large amount of real cluster load tracking data to enhance the degree of the simulator.Compared with the existing heuristic methods,experiment results show that DTQL-UD can optimize energy consumption,resource utilization,quality of service,number of virtual machine migrations,and remaining active hosts,by 14%,12%,21%,40%,and 10%,respectively.Meanwhile,due to the stronger feature extraction capability of decision tree on tabular data,DTQL-UD can learn better scheduling strategy than other existing deep reinforcement learning(DRL)methods.And in our experiments,as the cluster size increases,the proposed algorithm can gradually reduce the training time of traditional reinforcement learning models by 60% to 92%.

Reference | Related Articles | Metrics

Select

QR Decomposition Based on Double-double Precision Gram-Schmidt Orthogonalization Method

JIN Jiexi, XIE Hehu, DU Peibing, QUAN Zhe, JIANG Hao

Computer Science 2023, 50 (6): 45-51. DOI: 10.11896/jsjkx.230200209

Abstract （157）

PDF（pc）（1989KB）（3809）

Save

The Gram-Schmidt orthogonalization algorithm and its related modified algorithms often show numerical instability when computing ill-conditioned or large-scale matrices.To solve this problem,this paper explores the cumulative effect of round-off errors of modified Gram-Schmidt algorithm(MGS),and then designs and implements a double-double precision modified Gram-Schmidt orthogonalization algorithm(DDMGS) based on the error-free transformation technology and double-double precision algorithm.A variety of accuracy tests illustrate that DDMGS algorithm has better numerical stability than the varients of BMGS_SVL,BMGS_CWY,BCGS_PIP and BCGS_PIO algorithms,which proves that DDMGS algorithm can effectively reduce the loss of orthogonality of matrix,improve the numerical accuracy,and demonstrate the stability of our algorithm.In the performance test, the floating point computations(flops) of different algorithms are calculated and then compared DDMGS algorithm with the modified Gram-Schmidt algorithm on ARM and Intel processors,the runtime of the DDMGS algorithm proposed in this paper isabout 5.03 and 18.06 times that of MGS respectively,but the accuracy is improved significantly.

Reference | Related Articles | Metrics

Select

Lattice QCD Calculation and Optimization on ARM Processors

SUN Wei, BI Yujiang, CHENG Yaodong

Computer Science 2023, 50 (6): 52-57. DOI: 10.11896/jsjkx.230200159

Abstract （167）

PDF（pc）（1750KB）（3726）

Save

Lattice quantum chromodynamics(lattice QCD) is one of the most important applications of large-scale parallel computing in high energy physics,researches in this field usually consume a large amount of computing resources,and its core is to solve the large scale sparse linear equations.Based on the domestic Kunpeng 920 ARM processor,this paper studies the hot spot of lattice QCD calculation,the Dslash,which is applied on up to 64 nodes(6 144 cores) and show the linear scalability.Based on the roofline performance analysis model,we find that lattice QCD is a typical memory bound application,and by using the compression of 3×3 complex unitary matrices in Dslash based on symmetry,we can improve the performance of Dslash by 22%.For the solving of large scale sparse linear equations,we also explore the usual Krylov subspace iterative algorithm such as BiCGStab and the newly developed state-of-art multigrid algorithm on the same ARM processor,and find that in the practical physics calculation the multigrid algorithm is several times to a magnitude faster than BiCGStab,even including the multigrid setup time.Moreover,we consider the NEON vectorization instructions on Kunpeng 920,and there is up to 20% improvement for multigrid algorithm.Therefore,the use of multigrid algorithm on ARM processors can speed up the physics research tremendously.

Reference | Related Articles | Metrics

Select

CP2K Software Porting and Optimization Based on Domestic c86 Processor

FAN Lilin, QIAO Yihang, LI Junfei, CHAI Xuqing, CUI Rongpei, HAN Bingyu

Computer Science 2023, 50 (6): 58-65. DOI: 10.11896/jsjkx.230200213

Abstract （198）

PDF（pc）（2603KB）（3865）

Save

CP2K is currently the fastest open source first-principles materials calculation and simulation software,and the part of the source code that calls the coprocessor is written based on the CUDA architecture.Due to the different underlying hardware architecture and compilation environment of the platform,the native CP2K software cannot call the DCU on the domestic c86 processor platform to achieve cross-platform applications.In order to solve this problem,a CP2K porting scheme for this platform is proposed.The core idea is to analyze the code of the DBCSR library mainly based on the CUDA interface in CP2K software,disassemble the encapsulation method of the corresponding structure and class,and implement and package it based on the programming standard of HIP.The DBCSR library of HIP version is compiled and installed on the domestic c86 processor platform,and the CP2K software is linked to finally realize the CP2K software running the DCU version.Then,two test studies are selected and optimized based on the compilation level and run-level.It is found that removing the FFTW library automatically installed by CP2K script chain can improve the accuracy of calculation results.Experimental results show that the optimized method used can significantly improve the computational efficiency and calculation accuracy of CP2K software,and contribute to the porting optimization and localization of open source software for domestic platforms.

Reference | Related Articles | Metrics

Select

Online Service Function Chain Orchestration Method for Profit Maximization

HUANG Hua, JIANG Jun, YANG Yongkang, CAO Bin

Computer Science 2023, 50 (6): 66-73. DOI: 10.11896/jsjkx.220400156

Abstract （320）

PDF（pc）（3266KB）（3698）

Save

With the development of network function virtualization technology,how to deploy service function chain flexibly to maximize the profit has become one of the major challenging issues for network service providers.In this paper,we formulate the service function orchestration problem for multi-data center as 0-1 integer programming with the aim to maximize the profit,and propose a two-stage heuristic algorithm to solve this problem.In the first stage,the weights of nodes and links are calculated according to the load condition and deployment cost,the service function chain is deployed on the node with the highest priority,then the link with the highest priority that meets the bandwidth constraint is selected according to the load condition.In the se-cond stage,by analogy with the longest effective function sequence method,a virtualized network function migration strategy is proposed to reduce the consumption of deployment resources.Simulation experiment is designed based on NSFNET and USNET network topology.Experimental results show that,compared with existing algorithms,the proposed method has a certain improvement in both total profit and deployment success rate.

Reference | Related Articles | Metrics

Select

Simulation Implementation of HHL Algorithm Based on Songshan Supercomputer System

XIE Haoshan, LIU Xiaonan, ZHAO Chenyan, LIU Zhengyu

Computer Science 2023, 50 (6): 74-80. DOI: 10.11896/jsjkx.220500108

Abstract （164）

PDF（pc）（1547KB）（3712）

Save

Quantum computing is a new computing mode that follows the laws of quantum mechanics to regulate and control quantum information units to perform calculations,while the quantum algorithm is composed of a series of quantum gates whose realized form is quantum circuit.Quantum circuits are circuits that operate on qubits,using qubits as the basic storage unit to connect quantum logic gates to achieve specific computing functions.This paper uses the MPI+OpenMP hybrid parallel programming model on the “Songshan” supercomputer to realize the construction of large-scale quantum circuits by splitting them into different nodes,which speeds up the construction of circuits.For the communication problem between nodes,serialization and deserialization functions are designed to ensure the transmission of data between nodes.Aiming at the exponential difference in the amount of tasks allocated by each node,an optimized method of splitting the task amount and round-robin processing of each node is designed to achieve load balance between nodes.Finally,the construction of a large-scale quantum phase estimation circuit is successfully implemented on the supercomputer CPU cluster.Compared with a single node,the speedup ratio of 8.63 is achieved.The HHL algorithm is used to verify the correctness of the designed parallel phase estimation sub-module,which provides a reference for the implementation of large-scale HHL algorithm on the supercomputing platform.

Reference | Related Articles | Metrics

Select

Study of Iterative Solution Algorithm of Response Density Matrix in Density Functional Perturbation Theory

LIU Renyu, XU Zhiqian, SHANG Honghui, ZHANG Yunquan

Computer Science 2023, 50 (6): 81-85. DOI: 10.11896/jsjkx.220500252

Abstract （133）

PDF（pc）（1498KB）（3765）

Save

For the problem of calculating the response density matrix in density-functional perturbation theory(DFPT),a new parallel solution method for the Sternheimer equation is proposed,i.e.,the Sternheimer equation is solved by the conjugate gra-dient algorithm and the matrix direct decomposition algorithm,and the two algorithms are implemented in the first-principles molecular simulation software FHI-aims.Experimental results show that the computational results using conjugate gradient algorithm and matrix direct decomposition algorithm are more accurate,with less error than those of traditional methods,and scalable,which verifies the correctness and validity of the solution of linear equations in the new Sternheimer equation.

Reference | Related Articles | Metrics

Select

GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

WANG Zhuang, WANG Pinghui, WANG Bincheng, WU Wenbo, WANG Bin, CONG Pengyu

Computer Science 2023, 50 (6): 86-91. DOI: 10.11896/jsjkx.220900110

Abstract （470）

PDF（pc）（1834KB）（3900）

Save

In recent years,containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However,the deep learning cloud platform still has deficiencies in GPU resource management,which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks,a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources,reduce resource efficiency and service availability.In response to this problem,this paper proposes a GPU sharing sche-duling system.On the one hand,the Kubernetes-based Operator mechanism extends the existing cluster functions,enabling multiple Pods to share GPU resources,and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand,based on the GPU time slice and preemption mechanism,the dynamic management and scheduling of GPU resources is realized,fine-grained coordination among multiple tasks is performed,and task interference is reduced.Experimental results show that compared with the native Kubernetes scheduling system,the proposed system can reduce the completion time of a group of deep learning training tasks by about 20% on average,and increase the utilization of cluster GPU resources by about 10% on average.When the GPU is shared,the performance loss of high-priority tasks is less than 5% compared to the exclusive GPU,and the low-priority tasks can run on the same GPU with 20% of the performance.

Reference | Related Articles | Metrics

Select

Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture

YE Yue-jin, LI Fang, CHEN De-xun, GUO Heng, CHEN Xin

Computer Science 2022, 49 (6): 73-80. DOI: 10.11896/jsjkx.210900045

Abstract （412）

PDF（pc）（3168KB）（604）

Save

How to efficiently solve the discrete-memory-accessing problem of unstructed-grid is one of the hot-spot issues in the field of parallel algorithms and application in scientific and engineering computing.The distributed block reconnection optimization algorithm,which is designed on the basis of domestic Sunway heterogeneous many-core architecture,can maintain high computing performance when solving the problem of unstructured sparsity in applications.After deeply analyzing the on-chip communication mechanism of the many-core architecture,an efficient message grouping strategy is designed to improve the bandwidth utilization of on-chip array on the slave core.At the same time,a barrier-free data distribution algorithm is combined to give full play to the network perfor-mance of the domestic heterogeneous many-core architecture.Through the establishment of perfor-mance models and experimental analysis,the average memory bandwidth of the proposed algorithm can reach more than 70% of the theoretical value under different memory access situations.Compared with the serial algorithm on the master core,it has an ave-rage of 10 times and a maximum of 45 times performance acceleration.At the same time,the universal applicability of the algorithm is proved by application tests in different fields.

Reference | Related Articles | Metrics

Select

Architecture Design for Particle Transport Code Acceleration

FU Si-qing, LI Tie-jun, ZHANG Jian-min

Computer Science 2022, 49 (6): 81-88. DOI: 10.11896/jsjkx.210600179

Abstract （363）

PDF（pc）（2310KB）（604）

Save

The stochastic simulation method of particle transport is usually used to solve the characteristic quantity of a large number of moving particles.Particle transport problems are widely found in the fields of medicine,astrophysics and nuclear phy-sics.The main challenge of current stochastic simulation methods for particle transport is the gap between the number of simulation samples supported by computers,the simulation timescale,and researchers’ needs to study practical problems.Since the development of processor performance has entered a new historical stage with the stagnation of process size progress,the integration of complex on-chip structures no longer meets the current requirements.For particle transport programs,this paper carries out a series of architecture design works.By analyzing and using the parallelism and access characteristics of the program,simplified kernel and reconfigurable cache are designed to speed up the program.Experiments show that compared to the traditional architecture composed of multiple out-of-order cores,this architecture can obtain more than 4.5x in performance per watt and 2.78x in performance per area,which lays a foundation for the further study of large-scale many-nucleus particle transport acce-lerator.

Reference | Related Articles | Metrics

Select

Survey on Multithreaded Data Race Detection Techniques

ZHAO Jing-wen, FU Yan, WU Yan-xia, CHEN Jun-wen, FENG Yun, DONG Ji-bin, LIU Jia-qi

Computer Science 2022, 49 (6): 89-98. DOI: 10.11896/jsjkx.210700187

Abstract （522）

PDF（pc）（1770KB）（824）

Save

Nowadays the multi-core processors and threaded parallel programs are increasingly more used.However,the uncertainty of multi-threaded program leads to concurrency problems such as data race,atomicity violation,order violation and deadlock in the process of program running.Recent researches show that data race accounts for the largest proportion of concurrency bug,and most atomicity violation and order violation are caused by data race.This paper summarizes the related detection techniques in recent years.Firstly,the related concepts,causes,and the main ideas of data race detection are introduced.Then,the existing data race detection techniques in multi-threaded program are classified into three types:static analysis,dynamic analysis and hybrid detection techniques,and their characteristics are summarized comprehensively and compared in detail.Next,the limitations of exis-ting data race detection tools are discussed.Finally,future research directions and challenges in this field are discussed.

Reference | Related Articles | Metrics

Select

Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture

CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang

Computer Science 2022, 49 (6): 99-107. DOI: 10.11896/jsjkx.210400157

Abstract （504）

PDF（pc）（3943KB）（1077）

Save

Sunway TaihuLight ranked first in the global supercomputer top 500 list 2016－2018 with a peak performance of 125.4 PFlops.Its computing power is mainly attributed to the domestic SW26010 many-core RISC processor.CFD unstructured-grid computing has always been a challenge for porting and optimizing in domestic many-core supercomputer,because of its complex topology,serious discrete memory access problems,and strongly correlated linear equation solution.In order to give fully play to the computing efficiency of domestic heterogeneous multi-core architecture,firstly,a data reconstruction model is proposed to improve the locality and parallelism of data,and the data structure is more suitable for the characteristics of multi-core architecture.Secondly,aiming at the discrete memory access problem caused by the disorder of unstructured-grid data storage,a discrete memory access optimization method based on prestorage of information relation is proposed,which transforms discrete memory access into continuous memory access.Finally,the pipeline parallelism mechanism in core array is introduced to realize many-core parallelism for solving linear equations with strong correlation.Experiments show that the overall performance of unstructured-grid computing in CFD is improved by more than 4 times,and is 1.2x faster than the general CPU.The computing cores scale to 624 000,and the parallelism efficiency is maintained at 64.5%.

Reference | Related Articles | Metrics

Select

GPU-based Parallel DILU Preconditioning Technique

WANG Jin, LIU Jiang

Computer Science 2022, 49 (6): 108-118. DOI: 10.11896/jsjkx.210300259

Abstract （584）

PDF（pc）（2690KB）（741）

Save

Large sparse linear equations often appear in scientific computation and engineering.There are many iterative methods and preconditioning techniques for solving these linear equations.Diagonal-based incomplete LU (DILU) is a preconditioning technique similar to incomplete LU (ILU) factorization.DILU is applied in OpenFOAM,an open source computational fluid dynamics software,and is a very important preconditioning technique in OpenFOAM.DILU has not received extensive attention outside OpenFOAM,and there is no complete GPU-based implementation so far.This paper compares DILU preconditioned BiCGStab with ILU preconditioned BiCGStab,and the time elapses in preconditioner constructions.The numeric experiments suggest that DILU may be more efficient and stable than ILU.As for GPU-based parallel implementations,this paper discusses two parallel schemes,that are level-set scheme and synchronization-free scheme,and gives related algorithms and some codes under these two parallel schemes.It compares the performances of DILU preconditioning technique under two parallel schemes.The numeric results show that each scheme has its own advantages and disadvantages in different equations,and we can select one according to their performances in practice.This paper compares the performance of DILU preconditioning on GPU and CPU,and the results show that GPU is more competitive.The applications that have performance bottlenecks on linear systems solutions can be improved by moving to GPU platforms.

Reference | Related Articles | Metrics