Select

High-performance FPGA Implementation of Elliptic Curve ECC on Binary Domain

YOU Wen-zhu, GE Hai-bo

Computer Science 2020, 47 (8): 127-131. DOI: 10.11896/jsjkx.200600112

Abstract （523）

PDF（pc）（2308KB）（1318）

Save

In recent years, the communications field has achieved tremendous development.Applications such as online banking and mobile communications have increased the security requirements in resource-constrained environments.Compared with traditional cryptographic algorithms, elliptic curve cryptosystem(ECC) provides better security standards and more space for optimizing performance parameters.Therefore, an efficient elliptic curve cipher hardware design scheme is proposed.Based on the exis-ting research, the proposed scheme uses the projected coordinate system LD Montgomery ladder algorithm to study the core scalar multiplication operation in ECC, and uses parallel scheduling to reduce delay in the group operation layer.For finite field ope-rations, the bit-parallel multiplication algorithm and improved Euclidean inverse algorithm are adopted.Based on Xilinx Virtex-5 and Virtex-7 FPGA device, the architecture is implemented on the binary domains with lengths of 163, 233 and 283 respectively.The experimental results show that the proposed scheme requires less FPGA resource consumption and faster calculation speed.Compared with other methods, the hardware resource consumption is reduced by 52.9% and the scalar multiplication operation speed is increased by 3.7times, so it is better suitable for the application of resource-constrained devices.

Reference | Related Articles | Metrics

Select

ENLHS:Sampling Approach to Auto Tuning Kafka Configurations

XIE Wen-kang, FAN Wei-bei, ZHANG Yu-jie, XU He, LI Peng

Computer Science 2020, 47 (8): 119-126. DOI: 10.11896/jsjkx.200300010

Abstract （568）

PDF（pc）（2357KB）（1045）

Save

When Kafka is applied in a production environment, its performance is not only limited by the machine’s hardware environment and system platform.Its own configuration items are the key element to judge whether it can achieve the desired performance under the condition of limited hardware resources, but it is manually configured.The efficiency of item modification and tuning is extremely poor.If the actual resource environment is not optimized, Kafka cannot guarantee its performance in each production environment using default configuration parameters.Because Kafka’s configuration bound is extremely large, the perfor-mance of traditional adaptive algorithms in large-scale production systems is poor.Therefore, in order to improve Kafka’s adaptive ability, eliminate complexity in the system, and obtain better operating performance, an adaptive performance tuning method for Kafka is proposed, which fully considers the influence weights of Kafka’s characteristic parameters and performance.It uses the principle of sampling to improve the efficiency of data sets generation and optimize the range of data selection, improve the efficiency of modeling and reduce the complexity of optimization methods.Experiments show that the algorithm optimizes the throughput rate and latency of the open source version Kafka, improves Kafka’s throughputs under a given system resource, and reduces latency.

Reference | Related Articles | Metrics

Select

Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems

ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun

Computer Science 2020, 47 (8): 112-118. DOI: 10.11896/jsjkx.200300038

Abstract （513）

PDF（pc）（1843KB）（723）

Save

Cloud computing has become a very important computing service mode in various industries.Traditional studies on cloud computing mainly focus on the research of service quality such as the pricing mode, profit maximization and execution efficiency of cloud services.Green computing is the development trend of distributed computing.Aiming at the scheduling problem of workflow task set that meets the computing cost constraint of cloud users in heterogeneous cloud environment, an energy-aware based on budget level scheduling algorithm(EABL) with low time complexity is proposed.The EABL algorithm consists of three main stages:task priority establishment, task budget cost allocation, optimal execution virtual machine and energy efficiency frequency selection of the parallel task set, so as to minimize the energy consumption during task set execution under the constraint of budget cost.A large-scale workflow task sets in the real world are used to conduct a large number of tests on the algorithm for the experiment in this paper.Compared with famous algorithms EA_HBCS and MECABP, EABL algorithm can effectively reduce the energy consumption in the computing process of cloud data centers by making full use of the budget cost.

Reference | Related Articles | Metrics

Select

Real-time SIFT Algorithm Based on GPU

WANG Liang, ZHOU Xin-zhi, YNA Hua

Computer Science 2020, 47 (8): 105-111. DOI: 10.11896/jsjkx.190700036

Abstract （551）

PDF（pc）（4602KB）（1111）

Save

Aiming at the complex and low real-time defects of the SIFT feature extraction algorithm, a real-time SIFT algorithm based on GPU is proposed, called CoSift (CUDA Optimized SIFT).Firstly, the algorithm uses the CUDA stream concurrency mechanism to construct the SIFT scale space.In this process, the high-speed memory in the CUDA memory model is fully utilized to improve data access speed, and the two-dimensional Gaussian convolution kernel is optimized to reduce the amount of computation.Then, the warp-based histogram policy is designed to rebalance the workload during the characterization process.Compared with the traditional algorithm of the CPU version and the improved algorithm of the GPU version, the proposed algorithm greatly improves the real-time performance of the SIFT algorithm without reducing the accuracy of feature extraction, and has a relatively higher optimization effect on large-size images.CoSift can extract features within 7.7~8.8ms (116.28~129.87fps) on the GTX1080Ti.The algorithm effectively reduces the complexity of the traditional SIFT algorithm process, improves the real-time performance, and is convenient to be applied in scenarios where the real-time requirement of SIFT algorithm is higher.

Reference | Related Articles | Metrics

Select

Optimization of BFS on Domestic Heterogeneous Many-core Processor SW26010

YUAN Xin-hui, LIN Rong-fen, WEI Di, YIN Wan-wang, XU Jin-xiu

Computer Science 2020, 47 (8): 98-104. DOI: 10.11896/jsjkx.191000013

Abstract （436）

PDF（pc）（3058KB）（1157）

Save

In recent years, there is growing concern for the processing capabilities of data-intensive task.Breadth-first search(BFS) is a typical data-intensive problem, which is widely used in a variety of graph algorithms.Graph 500 Benchmark, taking BFS algorithm as the core, has become the benchmark for the evaluation of processing capabilities of data-intensive tasks.Sunway TaihuLight supercomputer topped the Top 500 list for four consecutive times from June 2016 to November 2017, the processor of which, named SW26010, is the first Chinese homegrown heterogeneous many-core processor.This paper studies how to use the architecture characteristics of SW26010 to accelerate BFS algorithm.A direction-optimizing hybrid BFS algorithm based on a single core group(CG) is implemented on SW26010, using bytemap to release the data dependencies in inner loops, hiding overhead of calculation and SPM access by using asynchronous DMA, taking advantage of heterogeneous architecture to compute collaboratively and carrying out graph preprocessing.Eventually, with Graph 500 as the benchmark processing a scale 22 graph, a single CG of SW26010 processor achieves a performance of 457.54MTEPS.

Reference | Related Articles | Metrics

Select

Large-scale Quantum Fourier Transform Simulation Based on SW26010

LIU Xiao-nan, JING Li-na, WANG Li-xin, WANG Mei-ling

Computer Science 2020, 47 (8): 93-97. DOI: 10.11896/jsjkx.200300015

Abstract （438）

PDF（pc）（2381KB）（890）

Save

Quantum computing has a natural parallel advantage due to its entanglement and superposition.However, current quantum computing equipment is limited to the technological level of physical realization.It takes a certain amount of time to accumulate and break through to achieve huge computing power and solve practical problems with practical significance.Therefore, using classical computers to simulate quantum computing has become an effective way to verify quantum algorithms.Quantum Fourier Transform is a key part of many quantum algorithms.It involves phase estimation, order finding, factors, etc.Research on Quantum Fourier Transform and large-scale simulation implementation can effectively promote the research, verification and optimization of related quantum algorithms.In this paper, a large-scale Quantum Fourier Transform is simulated using the supercomputer, “Sunway TaihuLight”, independently developed by our country.According to the heterogeneous parallel characteristics of SW26010 processor, MPI, accelerated thread library, and communication and computing hiding technology are adopted to optimize the system.The correctness of the Quantum Fourier Transform simulation is verified by seeking the period in the Shor algorithm, and the simulation and optimization of the Quantum Fourier Transform of 46-Qubits are realized, which provides reference for the verification and optimization of other quantum algorithms on the supercomputing platform and the proposal of new quantum algorithms.

Reference | Related Articles | Metrics

Select

Large Scalability Method of 2D Computation on Shenwei Many-core

ZHUANG Yuan, GUO Qiang, ZHANG Jie, ZENG Yun-hui

Computer Science 2020, 47 (8): 87-92. DOI: 10.11896/jsjkx.191000011

Abstract （417）

PDF（pc）（1963KB）（705）

Save

With the development of supercomputer and its programming environment, multilevel parallelism under heterogeneous system infrastructure is a promising trend.Applications ported to Sunway TaihuLight are typical.Since the Sunway TaihuLight was open to public in 2016, many scholars focus on the method study and application verification, so much experience on Shenwei many-core programming method is accumulated.However, when the CESM model is ported to Shenwei many-core infrastructure, some two dimensional computations in the ported POP model show quite good results under 1024 processes.On the contrary, they perform much worse than the original version, and false acceleration ratios appeared under 16800 processes.Upon this problem, a new parallel method based on slave-core partitions was proposed.Under the new parallel method, the 64 slave-cores in a core-group are divided into some disjoint small partitions, which make that different and independent computing kernels can run at different slave-core partitions simultaneously.In the method, the computing kernels can be loaded to different slave-core partitions with the suitable data size and computational load, where the amount and number of the slave-cores in each partition can be pro-perly set according to the computing scale, so the slave-core’s calculation ability can be fully utilized.Based on the new parallel method, also with the loops combination and function expansion, the slave-cores are fully applied and some computing time among several parallel running codes is hidden.Furthermore, it is effective to extend the parallel granularity of the kernels to be athrea-ded.Applied the above methods, the simulation speed of POP model in high-resolution CESM G-compset is improved by 0.8 si-mulation year per day under 1.1 million cores.

Reference | Related Articles | Metrics

Select

Cuckoo Hash Table Based on Smart Placement Strategy

JIN Qi, WANG Jun-chang, FU Xiong

Computer Science 2020, 47 (8): 80-86. DOI: 10.11896/jsjkx.191200109

Abstract （527）

PDF（pc）（2074KB）（988）

Save

Because time overhead of query is O (1), Cuckoo hash table has been widely used in big data, cloud computing and other fields.However, insertion of the existing Cuckoo hash tables generally uses random replacement strategy to replace existing entries when encountering conflicts.On the one hand, writers are prone to high-latency insertion and infinite loops, especially when the load rate of hash table is relatively high, and even have the risk of reconstructing entire hash table;on the other hand, due to existing random replacement strategies, entries are scattered as much as possible in the buckets of hash table.The lack of good spatial locality among entries reduces the efficiency of positive query.To solve the above problems, this paper proposes a Cuckoo hash table based on smart placement strategy.Specifically, in order to improve efficiency of insertion, a load-balance Cuc-koo hash table (LBCHT) is proposed, which limits the load of each bucket in real time, using breadth-first search to find the best Cuckoo path to ensures load balancing among buckets.Experimental results show that LBCHT can effectively reduce long tail effect of insertion under high load rate.And in order to improve efficiency of lookup, a Cuckoo hash table that makes full use of the locality principle is proposed (LPCHT).By fully exploring the spatial Locality among entries, CPU cache miss rate caused by lookup is reduced and efficiency of positive query is improved.Experiments show that in a high load rate stress test environment, compared with libcuckoo, LBCHT throughput of insertion can be increased by 50%, and LPCHT improves positive query efficiency by 7%.

Reference | Related Articles | Metrics

Select

Algorithm Design of Variable Precision Transcendental Functions

HAO Jiang-wei, GUO Shao-zhong, XIA Yuan-yuan, XU Jin-chen

Computer Science 2020, 47 (8): 71-79. DOI: 10.11896/jsjkx.200200013

Abstract （484）

PDF（pc）（3052KB）（1568）

Save

Transcendental functions are the main part of fundamental mathematical software library.Their accuracy and perfor-mance greatly determine those of the upper-layer applications.Aiming at the problems of tedious and error-prone implementation of transcendental functions as well as accuracy requirements of different applications, a variable precision transcendental function algorithm is proposed, which considers both generality and mathematical characteristics of functions.Based on the similarity of transcendental functions, a transformation-reduction-approximation-reconstruction algorithm template is constructed to unify common transcendental function algorithm implementations, and the algorithm template parameters are adjusted to handle errors to generate different precision versions of function codes.Experiment results show that the algorithm is able to generate function codes of different precision versions of common transcendental functions and has performance advantages over the corresponding functions in the standard mathematical software library.

Reference | Related Articles | Metrics

Select

Prediction of Loop Tiling Size Based on Neural Network

CHI Hao-yu, CHEN Chang-bo

Computer Science 2020, 47 (8): 62-70. DOI: 10.11896/jsjkx.191200180

Abstract （774）

PDF（pc）（2981KB）（1377）

Save

Optimization of loop programs is an important topic in program optimization.As a typical technique of loop optimization, loop tiling has been widely studied and applied.The selection of tile size, which is complex and highly dependent on program and hardware, has a great impact on the performance of loops.Traditional approaches based on static analysis and heuristic experience search have high cost of labor and time, and lack generality and portability.Therefore, the neural network method, which is good at representing high-dimensional information, is considered to learn the hidden correlation between tiling size and program performance, which is a result of complex interaction between program and hardware.A new group of features with 29 dimensions are extracted based on size of the problem, structure of the loop, locality of the operations within the loop.The experiments are carried out on hundreds of thousands of lines of six kinds of kernel programs (3D loop, 2D data) with random sizes in the range of 1024 to 2048.The sequential model (TSS-T6) achieves an average speedup of 6.64 compared with GCC-O2, 98.5% of the average maximum available performancecompared with the exhaustive search, and an average 9.9% performance improvement compared with Pluto.The parallel model (TSSP-T6-Search) achieves an average speedup of 2.41 compared with the OpenMP default optimization, 91.7% of the average maximum available performance compared with the exhaustive search, and an average 9% performance improvement compared with Pluto default parallel tiling optimization.

Reference | Related Articles | Metrics

Select

Design and Optimization of Two-level Particle-mesh Algorithm Based on Fixed-point Compression

CHENG Sheng-gan, YU Hao-ran, WEI Jian-wen, James LIN

Computer Science 2020, 47 (8): 56-61. DOI: 10.11896/jsjkx.200200112

Abstract （318）

PDF（pc）（1841KB）（823）

Save

Large-scale N-body simulation is of great significance for the study of modern physical cosmology.One of the most popular N-body simulation algorithms is particle-mesh(PM).However, the PM-based algorithms cost considerable amounts of memory, which becomes the bottleneck to scale the N-body simulations in the modern supercomputer.Therefore, this paper pro-poses to use fixed-point compression to reduce memory footprints per N-body particle to only 6 bytes, nearly an order of magnitude lower than the traditional PM-based algorithms.This paper implements the two-level particle-mesh algorithm with fixed-point compression and optimizes it with mixed-precision computation and communication optimizations.These optimizations significantly reduce the performance loss caused by fixed-point compression.The proportion of compression and decompression in the total time of the program reduces from 21% to 8% and achieves up to 2.3 times speedup on computing hotspots which make the algorithm maintain high efficiency and scalability with low memory consumption.

Reference | Related Articles | Metrics

Select

Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation

LI Yu-rong, LIU Jie, LIU Ya-lin, GONG Chun-ye, WANG Yong

Computer Science 2020, 47 (8): 49-55. DOI: 10.11896/jsjkx.190900202

Abstract （525）

PDF（pc）（2394KB）（892）

Save

Non-negative matrix factorization can preserve the non-negative features of the speech signal.It is an important method for speech separation.However, this method has the problems of complicated data operation and computation, it is necessary to propose a parallel method to reduce computation time.Aiming at the calculation problem of speech separation pre-training and separation process, this paper proposes a deep transductive non-negative matrix factorization multi-level parallel algorithm, which considers the data correlation of the iterative update process, and designs a multi-level parallel algorithm between tasks and within tasks.The parallel algorithm at the task level decomposes the training speech to obtain the corresponding base matrix as two independent tasks in parallel calculation.The matrix is divided into rows and columns at the internal process level of the task.The master process distributes the matrix blocks to the slave process, and the slave process receives the current sub-matrix, then the matrix block calculates the result matrix sub-block, and then sends the current process matrix block to the next process to achieve the traversal of each matrix block of the second matrix in all processes, calculating the product of the corresponding sub-block of the result matrix, and finally the sub-block is collected by the main process from the slave process.During the thread-level sub-matrix multiplication operation, an acceleration strategy of generating multiple threads and exchanging data through shared me-mory for sub-matrix block calculation is adopted.This algorithm is the first one to implement deep transduction non-negative matrix factorization algorithm.The experimentis performed on the Tianhe II platform.The test results show that when separating multi-speaker mixed speech signals without changing the separation effect, in the pre-training process, the proposed parallel algorithm performs well using 64 processes at a speed ratio of 18, and in the separation process, the corresponding speedup is 24.Compared to serial and MPI model separation, hybrid model separation time is greatly shortened, which proves that for the speech separation process, the parallel algorithm proposed in this paper can effectively improve the separation efficiency.

Reference | Related Articles | Metrics

Select

Computing Resources Allocation with Load Balance in Modern Processor

WANG Guo-peng, YANG Jian-xin, YIN Fei, JIANG Sheng-jian

Computer Science 2020, 47 (8): 41-48. DOI: 10.11896/jsjkx.191000148

Abstract （414）

PDF（pc）（1828KB）（936）

Save

To improve the efficiency of program, it is used to arrange multiple function units in modern superscalar processor, supporting to execute instructions in parallel.The allocation policy of computing resources plays an important role in taking full advantage of multiple function units.Although the policy of how to allocate computing resources and schedule instruction has been well studied in literature, the proposed solutions almost concentrate on optimization methods at compile time, which is mostlystatic, inflexible and inefficient because of lack of computing pipeline information at run time.To mitigate the negative impacts of improper computing resource allocation and maximize the power of multiple function units, this paper abstracts the mathematical model of resource allocation problem at run time and makes a study of hardware fine-grained automatic method based on symmetric and asymmetric configuration of function units, in order to make dynamic and wise computing resource allocating decision when instructions are issued in general situation.As a result, a load-balanced greedy resource allocation strategy is proposed and evaluated.The experimental results show that our policy is efficient to minimize blocking time caused by unfair allocation of computing resources.Furthermore, the more computing resources are provided, the better performance our policy can yield.

Reference | Related Articles | Metrics

Select

Parallelizing Multigrid Application Using Data-driven Programming Model

GUO Jie, GAO Xi-ran, CHEN Li, FU You, LIU Ying

Computer Science 2020, 47 (8): 32-40. DOI: 10.11896/jsjkx.200500093

Abstract （458）

PDF（pc）（3157KB）（1126）

Save

Multigrid is an important family of algorithms to accelerate the convergence of iterative solvers for linear systems, and it plays an important role in large-scale scientific computing.At present, distributed-memory systems have evolved to large scale systems based on multi-core nodes or heterogeneous nodes with accelerators.Legacy applications face the urgent need to be ported to modern supercomputers with diverse node-level architectures.In this paper, a data-driven programming language, AceMesh is introduced, and using this directive language, NAS MG is ported to two home-made supercomputers which are Tianhe-2 and Sunway TaihuLight supercomputer.This paper shows how to taskify computation loops and communication-related codes in AceMesh, and analyzes the characteristics on its task graph and on its computation-communication overlapping.Experimental results show that compared with traditional programming models, the AceMesh versions achieve relative speedup up to 1.19X and 1.85X on Sunway TaihuLight and Tianhe-2 respectively.Analyses show that performance improvements come from two main reasons, memory-related optimization and communication overlapping optimization.At last, future directions are put forward to further optimize inter-process communications for the AceMesh version.

Reference | Related Articles | Metrics

Select

Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing

WANG Zhe, TANG Qi, WANG Ling, WEI Ji-bo

Computer Science 2020, 47 (8): 26-31. DOI: 10.11896/jsjkx.200500110

Abstract （475）

PDF（pc）（1476KB）（863）

Save

Dynamically partially reconfigurable (DPR) technology based on FPGA has many applications in the field of high-performance computing because of its advantages in processing efficiency and power consumption.In the DPR system, the partition of the reconfigurable region and task scheduling determine the performance of the entire system.Therefore, how to model the lo-gic resource partition and scheduling of the DPR system and devising an efficient solution algorithm are the keys to ensure the performance of the system.Based on the establishment of the partition and scheduling model, a joint optimization algorithm of DPR system partition-scheduling based on simulated annealing (SA) is designed to optimize the reconfigurable region partitioning and task scheduling.A new method is proposed for skip infeasible solutions and poor solutions effectively which accelerates the search of solution space and increases the convergence speed of the SA algorithm.Experimental results show that, compared with mixed integer linear programming (MILP) and IS-k, the proposed algorithm based on SA has lower time complexity, and for the large-scale applications, it can solve better partition and scheduling results in a short time.

Reference | Related Articles | Metrics

Select

Communication Optimization Method of Heterogeneous Cluster Application Based on New Language Mechanism

CUI Xiang, LI Xiao-wen, CHEN Yi-feng

Computer Science 2020, 47 (8): 17-15. DOI: 10.11896/jsjkx.200100124

Abstract （655）

PDF（pc）（1901KB）（956）

Save

Compared with traditional cluster, heterogeneous cluster has obvious advantage in cost performance.However, compared with the rapidly developing hardware technology, the current software technology is still backward and cannot adapt to the constantly updated heterogeneous hardware and the super-large scale parallel computing environment.Currently, the common solution is to directly use parallel programming tools for different hardware.The disadvantages of this combination solution are that the programming level is low and it is difficult to develop, modify and debug.This paper introduces a new language mechanism to describe the multi-dimensional rule structure, arrangement and communication mode of data and threads.A new method of software migration and optimization between heterogeneous systems based on new language mechanism is proposed.Taking the direct normal turbulence simulation as an example, the communication optimization and fast migration for different heterogeneous systems are realized.

Reference | Related Articles | Metrics

Select

Survey of Heterogeneous Hybrid Parallel Computing

YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun

Computer Science 2020, 47 (8): 5-16. DOI: 10.11896/jsjkx.200600045

Abstract （1269）

PDF（pc）（3592KB）（5236）

Save

With the rapid increase in computing power demand of computer applications such as artificial intelligence and big data and the diversification of application scenarios, the research of heterogeneous hybrid parallel computing has become the focus of research.This paper introduces the current main heterogeneous computer architecture, including CPU/coprocessor, CPU/many-core processor, CPU/ASCI and CPU/FPGA heterogeneous architectures.The changes made by the heterogeneous hybrid parallel programming model with the development of various heterogeneous hybrid structures are briefly described, which is a transformation and re-implementation of an existing language, or an extension of an existing heterogeneous programming language, or heterogeneous programming using instructional statements, or container pattern collaborative programming.The analysis shows that the heterogeneous hybrid parallel computing architecture will further strengthen the support for AI, and will also enhance the versatility of the software.This paper reviewes the key technologies in heterogeneous hybrid parallel computing, including parallel task partitioning, task mapping, data communication, data access between heterogeneous processors, parallel synchronization of heterogeneous collaboration, and pipeline parallelism of heterogeneous resources.Based on these key technologies, this paper points out the challenges faced by heterogeneous hybrid parallel computing, such as programming difficulties, portability difficulties, large data communication overhead, complex data access, complex parallel control, and uneven resource load.The challenges faced by heterogeneous hybrid parallel computing are analyzed, and this paper concludes that the current key core technologies need to be integrated from general-purpose and AI-specific heterogeneous computing, seamless migration of heterogeneous architectures, unified programming model, integration of storage and computing, and intelligence breakthroughs in task division and allocation

Reference | Related Articles | Metrics

Select

Development of Parallel Computing Subject

CHEN Guo-liang, ZHANG Yu-jie,

Computer Science 2020, 47 (8): 1-4. DOI: 10.11896/jsjkx.200600027

Abstract （794）

PDF（pc）（1062KB）（2028）

Save

Computational science has become the third science in parallel with traditional theoretical science and experimentalscien-ce.They complement each other and promote the development of human science and technology and the progress of social civilization.The research frontiers of key scientific and economic problems in the 21st century may be solved by computing techniques and computing science.High-performance computing is a manifestation of a country’s comprehensive national strength, and it is one of the key technologies supporting the continuous development of national strength.It has an important strategic position in national defense security, high-tech development and national economic construction.Through more than 40 years of development, we have focused on parallel computers, parallel algorithms and parallel programming, integrating parallel computer architecture, numerical and non-numerical parallel algorithm design and parallel program design, forming parallel computing “architecture-algorithm-programming-application” disciplinary system and system curriculum framework.This article reviews the work we have done in the development of parallel computing, and introduces the calculation methods in non-numerical computing and the research of new non-von Neumann structured computer architecture.

Reference | Related Articles | Metrics

Select

Computer Science 2020, 47 (8): 0-0.

Abstract （375）

PDF（pc）（719KB）（917）

Save