Computer Science

Select

Error Log Analysis and System Optimization for Lustre Cluster Storage

CHENG Wen, LI Yan, ZENG Ling-fang, WANG Fang, TANG Shi-cheng, YANG Li-ping, FENG Dan, ZENG Wen-jun

Computer Science 2022, 49 (10): 1-9. DOI: 10.11896/jsjkx.220100134

Abstract （439）

PDF（pc）（2684KB）（2899）

Save

Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules,the cluster storage system is becoming more and more complex,and the errors caused by cluster storage system emerge endlessly,which brings troubles and challenges to the researcher and developer.To address the pro-blems,we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days,we study nearly 2.26 GB of Lustre error logs,analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally,we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.

Reference | Related Articles | Metrics

Select

Study on Implementation and Optimization of ARM-based Image Geometric Transformation Library

WANG Lu-han, JIA Hai-peng, ZHANG Yun-quan, ZHANG Guang-ting

Computer Science 2022, 49 (10): 10-17. DOI: 10.11896/jsjkx.220100128

Abstract （387）

PDF（pc）（4816KB）（2881）

Save

Intel integrated performance primitives is a high-performance multimedia acceleration library for signal and image processing.However,as of now,there is no high-performance IPP library based on the ARM architecture.This paper implements a high-performance algorithm library PerfIPP based on the ARM computing platform for basic image geometric transformation algorithms such as mirror,remap,and affine/perspective transformation.The PerfIPP,optimized through SIMD assembly,memory alignment,data pre-calculation,high-performance matrix optimization techniques,has significantly improved the performance of the above algorithms.At the same time,This paper summarizes the key technologies for the realization and optimization of image geometric transformation algorithms on the ARM computing platform by comparing the performance differences brought about by different instruction combinations,different instruction arrangements,and different access and storage methods.Experimental results show that,on the Huawei Kunpeng 920 platform,thePerfIPP proposed in this paper can achieve 108.08%~435.5% performance improvement in image transformation compared with the open source computer vision library while meeting accuracy.It also achieves 83.79% of the average performance of Intel IPP library on Intel Xeon E5-2640 processor.

Reference | Related Articles | Metrics

Select

Prediction of Optimal Loop Tiling Size for stencil Computation Based on Neural Network Model

BAO Yi-kun, ZHANG Peng, XU Xiao-wen, MO Ze-yao

Computer Science 2022, 49 (10): 18-26. DOI: 10.11896/jsjkx.220100147

Abstract （407）

PDF（pc）（3004KB）（2825）

Save

Stencil computation is one kind of the most important loop kernels in scientific and engineering computing applications.Loop tiling can effectively improve the data locality of stencil computation and the degree of computational parallelism,but the best tile size is hard to choose.Traditional tile size selection methods usually have shortcomings in some ways of time overhead,labor cost and model accuracy.In this paper,a tile size selection method based on artificial neural network is proposed to predict the optimal tile size of three-dimensional Jacobi stencil loop programs.Experimental results show that,for 11 real stencil programs,the performance improvement of the programs using the model prediction tile size compared with the non tiling is 2% and 35% in serial and parallel tests respectively.Compared with the well-known grid search method,our method has a similar prediction accuracy,but only takes one 30 thousandth of the online time cost.In addition,compared with the Turbo-tiling method,our method improves the performance of tiled codes nearly 9% in average.

Reference | Related Articles | Metrics

Select

Design and Implementation of Multithreaded Reproducible DGEMV for Phytium Processor

CHEN Lei, TANG Tao, QI Hai-jun, JIANG Hao, HE Kang

Computer Science 2022, 49 (10): 27-35. DOI: 10.11896/jsjkx.220100125

Abstract （415）

PDF（pc）（2069KB）（2705）

Save

In high-performance computing,the accumulation of rounding error in the process of solving the large-scale,long time and ill-conditioned problem will lead to invalidated results.These results are useful for the developers to debug programs and check their correctness.Therefore,the reproducibility of the numerical results of the algorithm becomes very important.Based on the OpenBLAS’s framework,combining with Demmel’s reproducible method in ReproBLAS and multilayer block technology proposed by Castaldo,this paper designs a reproducible algorithm of multithreaded DGEMV for Phytium processor with rounding error analysis and error free transformation.Numerical experiments show that the output of the algorithm is the same as that of the ReproBLAS,which verifies the reproducibility.Our algorithm is up to 2x faster than that in ReproBLAS.Compared with the DGEMV function of OzBLAS proposed by Mukunoki,our algorithm runs at least 20x faster than that in OzBLAS with single thread,and 9x faster than that in OzBLAS with multi-threads.Theoretical analysis and numerical experiments illustrate that improved algorithm is accurate,validated and efficiency.

Reference | Related Articles | Metrics

Select

“AI+HPC”-based Time Prediction for the First Principle Calculations and Its Applications in Biomed Community

LI Zhi-ying, MA Shuo, ZHOU Chao, MA Ying-jin, LIU Qian, JIN Zhong

Computer Science 2022, 49 (10): 36-43. DOI: 10.11896/jsjkx.220100129

Abstract （562）

PDF（pc）（3636KB）（2556）

Save

In the commonly used first-principles methods,density functional theory(DFT) has the characteristics of low scale and high accuracy,so it has been more and more widely used in the fields of chemistry,biology,medicine and so on.However,in practical applications,its relatively high computational cost has posed new challenges to the decision-making on calculation parameters for users and the assignment of tasks for the computing centers.We have recently developed a time prediction system for DFT calculations based on machine learning technique,which can predict the actual computational cost before calculations.The mean relative errors are normally less than 0.15,so that it meets the prediction accuracy requirements in actual scenarios.In this work,we further promote and improve the prediction system,providing multi-GPU parallel computing functions and modular additions to the machine learning models;combined it with the biomed community to realize real-time display of the computing tasks submitted to the platform,which will be convenient for users to coordinate;an intelligent load balancing module is developed,which can improve the efficiency of first-principles calculations for the super-large molecules and cluster systems.These efforts improve the practicalities of the forecasting system,and the preliminary applications are reported in both the community platform and parallel computing.

Reference | Related Articles | Metrics

Select

Matrix Multiplication Vector Code Generation Based on Polyhedron Model

WANG Bo-yang, PANG Jian-min, XU Jin-long, ZHAO Jie, TAO Xiao-han, ZHU Yu

Computer Science 2022, 49 (10): 44-51. DOI: 10.11896/jsjkx.210800247

Abstract （401）

PDF（pc）（3412KB）（2727）

Save

Matrix multiplication is the core of many scientific calculations,and vectorized programming is one of the main means to improve its performance.In view of the existing vectorization optimization problems that often require manual tuning and need to be mapped to the hardware structure,based on the polyhedron compiler PPCG,a vector code generation framework is introduced into the polyhedron model,and a matrix multiplication vector code generation framework based on the polyhedron model is proposed.Through the profit analysis of the matrix multiplication vectorization program,the vectorization program is determined,and the code generation of the application framework is guided.Based on this framework,it is conducive to the rapid optimization of vectorization of matrix multiplication.Selecting 13 matrix multiplication cases with a scale between 64×64×64 and 1 024×1 024×1 024 for experiments.The results show that the framework can generate vectorized code correctly.Compared with the automatic vectorization of the basic compiler ICC,the vectorized code generated by the framework has a speedup of 5.09 times and an average speedup of 3.39 times.

Reference | Related Articles | Metrics

Select

Distributed Lock with Inter-core Passing for SW26010 Processor

LI Ming-liang, PANG Jian-min, YUE Feng

Computer Science 2022, 49 (10): 52-58. DOI: 10.11896/jsjkx.210800091

Abstract （217）

PDF（pc）（2412KB）（2634）

Save

In parallel programs,a mutual exclusive lock is often used to avoid conflict when accessing shared resources.The SW26010 processor,which is deployed on the Sunway TaihuLight supercomputer,is a heterogeneous many-core processor and there is no hardware lock mechanism for the co-processing cores.Developers have developed a software lock mechanism based on atomic instructions,but the software lock will lead to significant overhead and affect the performance of parallel programs.To solve this issue,the HDT-LOCK designed as distributed lock mechanism with inter-core passing is proposed.Firstly,the hybrid distributed lock is proposed and implemented based on scratchpad memory on co-processing cores to mitigate memory congestion.Furthermore,the inter-core passing mechanism using register communication and the single-instruction multiple-data instruction is developed to improve the throughput of HDT-LOCK.Experimental results show that the proposed HDT-LOCK mechanism mitigates memory congestion,and has better scalability.In addition,the lock passing mechanism improves HDT-LOCK throughput up to 5.6X.

Reference | Related Articles | Metrics

Select

CPU Power Model for ARM Architecture Cloud Servers

JIN Yu-yan, YU Tian-hao, WANG Song-bo, LIN Wei-wei, PAN Yu-cong

Computer Science 2022, 49 (10): 59-65. DOI: 10.11896/jsjkx.210800103

Abstract （347）

PDF（pc）（2200KB）（2567）

Save

The power model of cloud server is one of the important contents of the research on the energy consumption optimization of cloud data center.The CPU power model is an important part of the power models of cloud servers.However,the existing CPU power models do not consider the CPU heterogeneity,such as lack of research on the CPU power model of ARM architecture cloud servers.Based on the investigation and analysis of existing ARM architecture CPU power models,this paper proposes a new CPU power model oriented to the ARM architecture,namely the hybrid based model(HBM).HBM comprehensively considers modeling features such as CPU utilization and CPU performance events.Compared with existing PMC based model with high measurement accuracy,HBM has similar measurement accuracy and lower model training cost.Thus,HBM is more suitable for CPU power modeling of ARM servers.This paper uses the Sysbench benchmark to verify HBM,and experimental results show that the mean relative error(MRE) of HBM is within 1%,which means HBM has high measurement accuracy.Cross-experiments are also conducted for x86 and ARM architecture servers.,and experimental results show that the CPU power beha-viors of servers with different architectures are not the same,thus different CPU power modeling methods should be used.

Reference | Related Articles | Metrics

Select

Parallel Optimization of Computational Fluid Dynamics Application Palabos Based on NextGeneration Sunway Supercomputer

LIU An-jun, YIN Hong-hui, WANG Li, LIU Zhi-xiang, KONG Bo, GUO Meng, CHEN Cheng-min, YANG Mei-hong

Computer Science 2022, 49 (10): 66-73. DOI: 10.11896/jsjkx.220100089

Abstract （799）

PDF（pc）（2735KB）（2751）

Save

Parallel lattice Boltzmann(Palabos)software is a widely used computational fluid dynamics software based on lattice Boltzmann method(LBM),which is widely used in the field of porous media,free interface,particle motion,blood flow and so on due to its excellent computing power.Palabos has a wide range of user needs,which makes it urgent to transplant,optimize and accelerate parallel on Sunway supercomputer to serve the energy and chemical industry.In this paper,the heterogeneous parallel design of Palabos software is carried out on the new generation Sunway supercomputer system(SW26010pro).The data structure and template programming of Palabos are not suitable for the heterogeneous parallel of Sunway supercomputer system.So we design the parallel optimization techniques called direct getting address,polymorphic tag processing and data slicing to deal with the Palabos data structure and template programming.Combined with the characteristics of the new generation of Sunway supercomputer system,the optimization technology of shared memory and register memory access(RMA) is also adopted.The acceleration efficiency of 64 computing processing elements(CPEs) is 2~6 speed up.The Palabos software is realized the parallel computing of one million core scale of two-phase flow algorithm in the field of complex multi-scale chemical process in the new generation Sunway supercomputer system.The one million cores parallel efficiency is more than 40% compared with 64 000 cores.

Reference | Related Articles | Metrics

Select

Implementation of FPGA-based High-performance and Scalable SM4-GCM Algorithm

ZHAI Jia-qi, LI Bin, ZHOU Qing-lei, CHEN Xiao-jie

Computer Science 2022, 49 (10): 74-82. DOI: 10.11896/jsjkx.210900137

Abstract （315）

PDF（pc）（3407KB）（2898）

Save

In the context of vigorous development of big data and 5G technology,information encryption in high-speed communication systems has become a new research hotspot.How to increase data throughput and reduce the difficulty of adapting encryption algorithms to different application scenarios while ensuring high data security has become important research topics.Aiming at the problem that traditional software’s SM4-GCM algorithm has a low throughput rate and is difficult to apply in changing 5G and big data scenarios,this paper analyzes the characteristics of SM4-GCM algorithm based on the reconfigurable characteristics of FPGA,using Mastrovito,Karatsuba and fast remainder algorithms.Two high-performance,CNC-separated and expandable circuit structures are designed.Full-pipeline technology and four-degree parallel technology are used to accelerate the optimization of SM4-GCM algorithm.While ensuring high security,it can achieve a high throughput rate,and can be flexibly transplanted to various application scenarios.Experimental results show that the throughput rates of the proposed two solutions in this paper for a single SM4-GCM module have reach 28.16 Gbps and 28.8 Gbps,respectively,which are superior to similar published designs in terms of performance and scalability.

Reference | Related Articles | Metrics