Friday, December 22, 2017, 17:00 - 18:30


            Title: Performance Optimizations on a Many-core Processor

            Speaker: Xinhua (James) Lin, Ronpaku student, Department of Mathematical and Computing Sciences, Tokyo Institute of Technology

            Location:Global Scientific Information and Computing Center, Rm206, 2 Fl (GSIC 2F large meeting room), Ookayama Campus,

            Tokyo Institute of Technology


           【日時】12/22(金)     17:00~18:30

           【会場】学術国際情報センター(情報棟)     2F会議室





            The recent stop of frequency and Dennard scaling caused processor vendors to constantly add more cores to processors which

            lead to many-core processors. Compared with conventional multi-core processors, heterogeneous many-core processors have more

            energy-efficiency designs in their architecture, including the in-order cores, explicit memory hierarchy, and direct communications among

            cores. However, these three energy-efficient designs in heterogeneous many-core processors expose more explicit parallelism to

            programmers, resulting in increasing programming challenges.


            Without any loss of generality, this study takes the SW26010 many-core processor, used in the China's home-grown TaihuLight

            Supercomputer, as an example. First, the SW26010 adopts a dual-issue pipeline. Due to the lack of efficient compiler support,

            programmers need to manually schedule the instructions. Second, the SW26010 uses the 64KB SPM as the local data memory (LDM)

            for each CPE. The LDM is managed by the software, thus programmers need to use the DMA to explicitly transfer the data between

            the local storage and the main memory. Third, the register-level communication (RLC) adopted in SW26010 uses an anonymous

            producer-consumer protocol, which renders the RLC message ''contains no ID tag''. Addressing this constraint requires programmers

            to orchestrate the message sequences manually.


            This study aims to tackle the programming challenge in the explicit parallelism of heterogeneous processors. This has been achieved

            through three interrelated studies. The purpose of the first study was to figure out the underlying architecture of SW26010. This was

            conducted with a micro-benchmark suite. Specifically, we developed the micro-benchmark suite in assembly language and used it to

            evaluate the key features of SW26010, including its CPE pipelines, RLC, and on/off-chip memory bandwidth. We found that the latency

            of the broadcast mode of RLC is the same as that of the P2P mode. We then summary our key findings in comparison with the public data.


            Based on the findings from the first study, we conducted the second one in which we adopted two compute-bound kernels to identify the

            potential programming challenges. The first kernel is double-precision general matrix-multiplication (DGEMM). The initial implementation

            of DGEMM had much lower arithmetic intensity than the SW26010. We thus designed a novel algorithm for RLC to reuse the data that

            already reside in 64 CPEs, and applied several communication-oriented optimizations. These endeavors improved the efficiency to up to

            88.7% in one core group of the SW26010. The second kernel that we used is the direct N-body simulation. Due to the lack of efficient

            hardware support, the reciprocal square root (rsqrt) operations turned out to be the performance bottleneck of N-body on the SW26010.

            Guided by a semi-empirical performance model developed by us, we applied the computation-oriented optimizations including strength

            reduction and instruction mix. The optimizations achieved about 25% efficiency in one core group of the SW26010.


            Despite the comprehensive optimizations, the single memory-bound kernel, such as SpMV, still cannot perform well on the SW26010

            due to the limited memory bandwidth of the processor. Instead, the overall performance might be effectively improved by overlapping

            multiple memory-bound kernels within one application. The aim of the third study is to optimize the memory-bound PCG. First, we

            proposed the RLC-friendly Non-blocking PCG (RNPCG), to minimize the all_reduce communication cost. We designed the software

            RLC_allreduce operation on 64 CPEs and manually scheduled the RLC_allreduce instructions to overlap the communications with

            computations. Next, to improve the RNPCG performance further, we optimized the other three key kernels of PCG, including proposing

            a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. We implemented the RNPCG in OpenFOAM

            and the experiment results show that our RNPCG method and LDIC preconditioner can respectively achieve, at most, 8.9X and 3.1X

            speedup compared with the default implementations of OpenFOAM.


            About the Speaker

            Xinhua (James) Lin is a Ronpaku (thesis-only PhD) student in Tokyo Institute of Technology. He is also the vice director of High

            Performance Center at Shanghai Jiao Tong University and a visiting associate of Global Scientific Information and Computing Center

            at Tokyo Institute of Technology. His research interest mainly lies in the areas of high performance computing.



  • Prof. Satoshi Matsuoka, Professor, Global Scientific Information and Computing Center & Dept. of Mathematical and Computing

            Sciences at Tokyo Institute of Technology (Advisor)

  • Prof. Osamu Watanabe, Professor, Dean, School of Computing at Tokyo Institute of Technology
  • Prof. Ken Wakita, Associate Professor, Depart of Mathematical and Computing Sciences at Tokyo Institute of Technology
  • Prof. Toshio Endo, Associate Professor, Global Scientific Information and Computing Center at Tokyo Institute of Technology
  • Prof. Rio Yokota, Associate Professor, Global Scientific Information and Computing Center at Tokyo Institute of Technology
Copyright ©2013 SJTU Network & Information Center All rights reserved.