Presentation
(RP28) Performance Tuning of Deep Learning Framework Chainer on the K Computer.
SessionResearch Posters Session
Event Type
Research Poster

AI/Machine Learning/Deep Learning
Performance Analysis and Optimization
TimeTuesday, June 18th8:30am - 10am CEST
LocationSubstanz 1, 2
DescriptionRecently the applications and research of machine learning by deep learning has become popular using GPU. However, it seems possible to do many calculations using CPUs of massively parallel computers. Here, we introduce some performance tuning procedures for Chainer, which is a representative framework for utilization of machine learning on the K computer.
Chainer expresses the hierarchical structure of deep learning using Python, and all calculations can be realized using numPy without special libraries. By optimizing floating point underflow exception when building Python, elapsed time was improved to 1/3.39. Moreover, by replacing the SSL2 gemm library called by Python with the thread-parallel version, section elapsed time was improved to 1/4.54, the total elapsed time was improved to 1/1.15, and the performance efficiency was improved about 47.0%.
Many of the cost was the calculation of the square root and the arithmetic when the filter was updated and activation functions. These operations are not optimized when calculated using numPy and are particularly slow on the K computer. By replacing the kernel with software pipelining and SIMD optimization by Fortran library, the kernel elapsed time was improved to 1/11.08 and total elapsed time was improved to 1/16.23.
There are some limitations on the use of Chainer on the K computer. However, it can be said that deep learning calculation became possible on the K computer and the Post-K computer using these tuning effect and CPU parallel version Chainer.
Chainer expresses the hierarchical structure of deep learning using Python, and all calculations can be realized using numPy without special libraries. By optimizing floating point underflow exception when building Python, elapsed time was improved to 1/3.39. Moreover, by replacing the SSL2 gemm library called by Python with the thread-parallel version, section elapsed time was improved to 1/4.54, the total elapsed time was improved to 1/1.15, and the performance efficiency was improved about 47.0%.
Many of the cost was the calculation of the square root and the arithmetic when the filter was updated and activation functions. These operations are not optimized when calculated using numPy and are particularly slow on the K computer. By replacing the kernel with software pipelining and SIMD optimization by Fortran library, the kernel elapsed time was improved to 1/11.08 and total elapsed time was improved to 1/16.23.
There are some limitations on the use of Chainer on the K computer. However, it can be said that deep learning calculation became possible on the K computer and the Post-K computer using these tuning effect and CPU parallel version Chainer.
Poster PDF