Friday, December 22, 2017, 17:00 - 18:30
Title: Performance Optimizations on a Many-core Processor
Speaker: Xinhua (James) Lin, Ronpaku student, Department of Mathematical and Computing Sciences, Tokyo Institute of Technology
Location:Global Scientific Information and Computing Center, Rm206, 2 Fl (GSIC 2F large meeting room), Ookayama Campus,
Tokyo Institute of Technology
The recent stop of frequency and Dennard scaling caused processor vendors to constantly add more cores to processors which
lead to many-core processors. Compared with conventional multi-core processors, heterogeneous many-core processors have more
energy-efficiency designs in their architecture, including the in-order cores, explicit memory hierarchy, and direct communications among
cores. However, these three energy-efficient designs in heterogeneous many-core processors expose more explicit parallelism to
programmers, resulting in increasing programming challenges.
Without any loss of generality, this study takes the SW26010 many-core processor, used in the China's home-grown TaihuLight
Supercomputer, as an example. First, the SW26010 adopts a dual-issue pipeline. Due to the lack of efficient compiler support,
programmers need to manually schedule the instructions. Second, the SW26010 uses the 64KB SPM as the local data memory (LDM)
for each CPE. The LDM is managed by the software, thus programmers need to use the DMA to explicitly transfer the data between
the local storage and the main memory. Third, the register-level communication (RLC) adopted in SW26010 uses an anonymous
producer-consumer protocol, which renders the RLC message ''contains no ID tag''. Addressing this constraint requires programmers
to orchestrate the message sequences manually.
This study aims to tackle the programming challenge in the explicit parallelism of heterogeneous processors. This has been achieved
through three interrelated studies. The purpose of the first study was to figure out the underlying architecture of SW26010. This was
conducted with a micro-benchmark suite. Specifically, we developed the micro-benchmark suite in assembly language and used it to
evaluate the key features of SW26010, including its CPE pipelines, RLC, and on/off-chip memory bandwidth. We found that the latency
of the broadcast mode of RLC is the same as that of the P2P mode. We then summary our key findings in comparison with the public data.
Based on the findings from the first study, we conducted the second one in which we adopted two compute-bound kernels to identify the
potential programming challenges. The first kernel is double-precision general matrix-multiplication (DGEMM). The initial implementation
of DGEMM had much lower arithmetic intensity than the SW26010. We thus designed a novel algorithm for RLC to reuse the data that
already reside in 64 CPEs, and applied several communication-oriented optimizations. These endeavors improved the efficiency to up to
88.7% in one core group of the SW26010. The second kernel that we used is the direct N-body simulation. Due to the lack of efficient
hardware support, the reciprocal square root (rsqrt) operations turned out to be the performance bottleneck of N-body on the SW26010.
Guided by a semi-empirical performance model developed by us, we applied the computation-oriented optimizations including strength
reduction and instruction mix. The optimizations achieved about 25% efficiency in one core group of the SW26010.
Despite the comprehensive optimizations, the single memory-bound kernel, such as SpMV, still cannot perform well on the SW26010
due to the limited memory bandwidth of the processor. Instead, the overall performance might be effectively improved by overlapping
multiple memory-bound kernels within one application. The aim of the third study is to optimize the memory-bound PCG. First, we
proposed the RLC-friendly Non-blocking PCG (RNPCG), to minimize the all_reduce communication cost. We designed the software
RLC_allreduce operation on 64 CPEs and manually scheduled the RLC_allreduce instructions to overlap the communications with
computations. Next, to improve the RNPCG performance further, we optimized the other three key kernels of PCG, including proposing
a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. We implemented the RNPCG in OpenFOAM
and the experiment results show that our RNPCG method and LDIC preconditioner can respectively achieve, at most, 8.9X and 3.1X
speedup compared with the default implementations of OpenFOAM.
About the Speaker
Xinhua (James) Lin is a Ronpaku (thesis-only PhD) student in Tokyo Institute of Technology. He is also the vice director of High
Performance Center at Shanghai Jiao Tong University and a visiting associate of Global Scientific Information and Computing Center
at Tokyo Institute of Technology. His research interest mainly lies in the areas of high performance computing.
- Prof. Satoshi Matsuoka, Professor, Global Scientific Information and Computing Center & Dept. of Mathematical and Computing
Sciences at Tokyo Institute of Technology (Advisor)
- Prof. Osamu Watanabe, Professor, Dean, School of Computing at Tokyo Institute of Technology
- Prof. Ken Wakita, Associate Professor, Depart of Mathematical and Computing Sciences at Tokyo Institute of Technology
- Prof. Toshio Endo, Associate Professor, Global Scientific Information and Computing Center at Tokyo Institute of Technology
- Prof. Rio Yokota, Associate Professor, Global Scientific Information and Computing Center at Tokyo Institute of Technology