Research Poster
(RP03) Design of an FPGA-based Matrix Multiplier with Task Parallelism
Research Poster Authors
Event Type
Research Poster
HPC Accelerators
TimeTuesday, June 18th8:30am - 10am
LocationSubstanz 1, 2
DescriptionMatrix multiplication is one of the fundamental building blocks of linear algebra. It requires computer systems have huge computing capability as problem size is increased. In this research, an FPGA-based matrix multiplier with task parallelism is designed and implemented by using the FPGA board DE5a-NET. The matrix multiplier is based on the systolic array architecture with 10 × 16 processing elements, all modules except the data loading modules are autorun to hide computation overhead, and data of matrix A are shifted from left to right while data of matrix B are moved from top to bottom in the systolic array to reuse data. After implementation by using FPGA, the proposed matrix multiplier utilizes more DSP blocks and achieves much higher clock frequency over the Intel’s OpenCL example with data parallelism on FPGA. When data are single-precision floating-points, the proposed matrix multiplier averagely achieves about 785 GFLOPs in computation throughput and 81 GFLOPs/W in energy efficiency. Compared with the Intel’s OpenCL example with data parallelism on FPGA, software simulations based on the Intel MKL and OpenBLAS libraries, the proposed matrix multiplier averagely outperforms by 3.2 times, 1.3 times, and 1.6 times in computation throughput, and by 3.4 times, 12.7 times, and 14.6 times in energy efficiency, respectively, even if the fabrication technology of FPGA is 20 nm while it is 14 nm in CPU.