Convolutional computation kernels are fundamental to today¿s edge computing applications. Multi-threading processor cores are an interesting approach to pursue the highest energy efficiency and lowest hardware cost in edge computing systems, yet they need hardware acceleration schemes to deal with heavy computational workloads like convolutional algorithms. Following a vector approach to accelerate convolutions, this study explores possible alternatives to implement vector coprocessing units, showing the application-dependence of the optimal balance among the hardware architecture parameters.
A set of over 12 different coprocessor acceleration schemes will be designed, implemented in RTL VHDL/SystemVerilog code, synthesized on FPGA and compared with respect to several performance figures. The performance figures will include total clock cycle count and absolute time to execute the convolution kernels, hardware resource utilization and energy efficiency. The goal of the research is to demonstrate that pure data-level parallelism is not the most efficient way to achieve speed, and a hybrid combination of data level parallelism and thread level parallelism is likely to be the optimal choice.