Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. In order to achieve maximum performance for such languages on CPUs one has to exploit both multi-threading and the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE, AVX, and LRBni instruction sets). This intra-core parallelism is becoming increasingly important with increasing SIMD register sizes (e.g. SSE = 128bit, AVX = 256bit, LRBni = 512bit).
Whole-Function Vectorization is an algorithm that transforms a scalar function in such a way that it computes W executions of the original code in parallel using SIMD instructions (W is the chosen vectorization factor which usually depends on the target architecture's SIMD width). Our implementation of the algorithm ("libWFV") is a language- and platform-independent code transformation that works on low-level intermediate code given by a control-flow graph in SSA form (LLVM bitcode).
Highlights of the implementation include the ability to deal with arbitrary control flow structures even on architectures without explicit predicated execution, advanced analyses and algorithms to exploit "uniform" control flow, robust handling of non-vectorizable operations, and a slim interface for efficient integration into LLVM-based compilers.
We have successfully integrated libWFV into various applications, including our own shading system and OpenCL driver as well as commercial systems of industry partners.
The basic algorithm has been published at CGO 2011, an extension appeared at CC 2012 (see Publications).
We have integrated libWFV into AnySL, a shading system for real-time ray tracers. Shaders (programs that compute the visual appearance of objects) written in the RenderMan shading language are automatically vectorized.
Compared to sequential execution, we obtain an average speedup factor of 3.9 of the entire rendering process of a packet ray tracer. At the same time, we reach over 90% of the performance of the few existing, hand-optimized SIMD shaders.
For more information on the shading system, please consider the AnySL project page.
We have implemented a custom CPU OpenCL driver on the basis of LLVM. This driver uses libWFV in addition to multi-threading to fully exploit the available data-parallelism by executing as many kernel instances in parallel as possible.
The driver currently implements a subset of the OpenCL API that is sufficient to run various benchmarks. In addition to vectorization via libWFV, the driver includes a unique implementation for barrier synchronization in software.
The performance of our driver is on par with Intel's implementation and outperforms all other available CPU drivers:
The LLVM-based implementation of the Whole-Function Vectorization algorithm (libWFV) as well as the OpenCL driver are publicly available under LLVM license (BSD style).
The first alpha versions of libWFV and the OpenCL driver are available on github.
If you are interested in trying out the second alpha version of libWFV, contact Ralf Karrenberg.Download libWFV Download WFVOpenCL
- Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages - FroCoS 2013
Karrenberg, R., Kosta, M. and Sturm, T.
Frontiers of Combining Systems, 2013. [bib]
- Improving Performance of OpenCL on CPUs - CC 2012
Karrenberg, R. and Hack, S.
Compiler Construction, 2012. [url] [bib]
- Whole Function Vectorization - CGO 2011
Karrenberg, R. and Hack, S.
Code Generation and Optimization, 2011. [doi] [url] [slides] [bib]
Jobs & Thesis Topics
Please contact Ralf Karrenberg for more details.