Whole-Function Vectorization

Vectorized RenderMan
Whole-Function Vectorization enables data-parallel languages to use SIMD instruction sets to exploit more data-level parallelism in addition to multi-threading. Our LLVM-based implementation achieves an average speedup factor of 3.9 for RenderMan shaders in a ray tracer and factors between 0.6 and 5.2 for OpenCL kernels (on hardware with SIMD width 4). The algorithm is based on SSA-form and is able to vectorize all kinds of control-flow graphs.

About

Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. In order to achieve maximum performance for such languages on CPUs one has to exploit both multi-threading and the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE, AVX, and LRBni instruction sets). This intra-core parallelism is becoming increasingly important with increasing SIMD register sizes (e.g. SSE = 128bit, AVX = 256bit, LRBni = 512bit).

Whole-Function Vectorization is an algorithm that transforms a scalar function in such a way that it computes W executions of the original code in parallel using SIMD instructions (W is the target architecture's SIMD width). Our implementation of the algorithm is a language- and platform-independent code transformation that works on low-level intermediate code given by an arbitrary control-flow graph in SSA form (LLVM bitcode).

The algorithm was developed and implemented as part of the master's and PhD theses of Ralf Karrenberg and has been published at CGO 2011.

Vectorized RenderMan Shaders

We have integrated the "packetizer"—the implementation of the Whole-Function Vectorization algorithm—into the shading system of the real-time ray tracer RTfact. Shaders (programs that compute the visual appearance of objects) written in the RenderMan shading language are automatically vectorized.

Compared to sequential execution, we obtain an average speedup factor of 3.9 of the entire rendering process of the ray tracer. At the same time, we reach over 90% of the performance of the few existing, hand-optimized SIMD shaders.

Vectorized OpenCL Driver

We have implemented a custom CPU OpenCL driver on the basis of LLVM and the AMD APP SDK. This driver employs Whole-Function Vectorization in addition to multi-threading to fully exploit the available data-parallelism by executing as many kernel instances in parallel as possible.

The driver currently implements a subset of the OpenCL API that is sufficient to run benchmarks from the SDK for comparison purposes. Some supported features include kernels working on multiple dimensions and barrier synchronization.

Our benchmarks show speedup factors between 0.6 and 5.2 on a variety of applications (e.g. BlackScholes, NBody, Mandelbrot, FastWalshTransform, Histogram).

Open Source

The LLVM-based implementation of the Whole-Function Vectorization algorithm as well as the OpenCL driver will be made publicly available in the near future.

If you are interested in trying out the "packetizer" and/or the OpenCL driver, contact Ralf Karrenberg for an alpha version licensed under the GPL.

Publications

Conferences

MSc Thesis

People

Links