(RP21) Optimizing Deep Learning LSTM Topologies on Intel Xeon Architecture
AI/Machine Learning/Deep Learning
Performance Analysis and Optimization
TimeTuesday, June 18th8:30am - 10am CEST
LocationSubstanz 1, 2
DescriptionLong short-term memory (LSTM) is a type of recurrent neural network which is well-suited for processing temporal data. In this work, we present an optimized implementation of LSTM cell for Intel Xeon architecture. Typical implementations of the LSTM cell employ one or two large GEMM calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our forward pass can be up to 1.4x faster compared to MKL-DNN implementation, whereas the backward/update pass can be up to 1.3x faster. Furthermore, we modified TensorFlow framework to use our LSTM cell for end-to-end training of Google’s neural machine translation application and attained identical BLEU score in as many iterations as original TensorFlow implementation while showcasing 1.9x speed up for 8-layer German-to-English translation model.