Whole-Function Vectorization

Automatically transform arbitrary scalar functions into their SIMD equivalents.
Integrate into any data-parallel language to exploit SIMD instruction sets.
Implemented in LLVM: source-language and target-platform independent.

About

Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. In order to achieve maximum performance for such languages on CPUs one has to exploit both multi-threading and the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE, AVX, and LRBni instruction sets). This intra-core parallelism is becoming increasingly important with increasing SIMD register sizes (e.g. SSE = 128bit, AVX = 256bit, LRBni = 512bit).

Whole-Function Vectorization is an algorithm that transforms a scalar function in such a way that it computes W executions of the original code in parallel using SIMD instructions (W is the chosen vectorization factor which usually depends on the target architecture's SIMD width). Our implementation of the algorithm ("libWFV") is a language- and platform-independent code transformation that works on low-level intermediate code given by a control-flow graph in SSA form (LLVM bitcode).

Highlights of the implementation include the ability to deal with arbitrary control flow structures even on architectures without explicit predicated execution, advanced analyses and algorithms to exploit "uniform" control flow, robust handling of non-vectorizable operations, and a slim interface for efficient integration into LLVM-based compilers.

We have successfully integrated libWFV into various applications, including our own shading system and OpenCL driver as well as commercial systems of industry partners.

The basic algorithm has been published at CGO 2011, an extension appeared at CC 2012 (see Publications).

Download

The LLVM-based implementation of the Whole-Function Vectorization algorithm (libWFV) as well as the OpenCL driver are publicly available under LLVM license (BSD style).

The first alpha versions of libWFV and the OpenCL driver are available on github.

If you are interested in trying out the second alpha version of libWFV, contact Ralf Karrenberg.

Download libWFV Download WFVOpenCL

Publications

Conferences

Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages - FroCoS 2013 2013
Karrenberg, R., Kosta, M. and Sturm, T.
Frontiers of Combining Systems, 2013. [bib]

@CONFERENCE{KKS:2013:frocos,
 	author = {Ralf Karrenberg and Marek Kosta and Thomas Sturm},
 	title = {Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages},
 	booktitle = {Frontiers of Combining Systems},
 	booktitle_short = {FroCoS 2013},
 	year = {2013},
 }

Improving Performance of OpenCL on CPUs - CC 2012
Karrenberg, R. and Hack, S.
Compiler Construction, 2012. [url] [bib]

@CONFERENCE{KH:2012:opencl,
 	author = {Ralf Karrenberg and Sebastian Hack},
 	title = {Improving Performance of OpenCL on CPUs},
 	booktitle = {Compiler Construction},
 	booktitle_short = {CC},
 	year = {2012},
 	url = {http://www.cdl.uni-saarland.de/papers/karrenberg_opencl.pdf}
 }

Whole Function Vectorization
Karrenberg, R. and Hack, S.
International Symposium on Code Generation and Optimization, 2011. [doi] [url] [slides] [bib]

@CONFERENCE{KH:2011:cgo,
 	author = {Ralf Karrenberg and Sebastian Hack},
 	title = {{W}hole {F}unction {V}ectorization},
 	booktitle = {International Symposium on Code Generation and Optimization},
 	series = {CGO},
 	year = {2011},
 	doi = {10.1109/CGO.2011.5764682},
 	abstract = {
 		Abstract—Data-parallel programming languages are an important component
 		in today's parallel computing landscape. Among those are domain-
 		specific languages like shading languages in graphics (HLSL, GLSL,
 		RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL.
 		Current implementations of those languages on CPUs solely rely on multi-
 		threading to implement parallelism and ignore the additional intra-core
 		parallelism provided by the SIMD instruction set of those processors
 		(like Intel's SSE and the upcoming AVX or Larrabee instruction sets).
 		In this paper, we discuss several aspects of implementing data-parallel
 		languages on machines with SIMD instruction sets. Our main contribution
 		is a language- and platform-independent code transformation that
 		performs whole-function vectorization on low-level intermediate code
 		given by a control flow graph in SSA form.
 		We evaluate our technique in two scenarios: First, incorporated in a
 		compiler for a domain-specific language used in real-time ray tracing.
 		Second, in a stand-alone OpenCL driver. We observe average speedup
 		factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for
 		different OpenCL kernels.
 	},
 	webslides = {http://www.cdl.uni-saarland.de/projects/wfv/wfv_cgo11_slides.pdf},
 	url = {http://www.cdl.uni-saarland.de/papers/karrenberg_wfv.pdf},
 	acc_rate = {26.7},
 	accepted = {28},
 	submitted = {105},
 }

MSc Thesis

Automatic Packetization
Karrenberg, R.
M.Sc. Thesis, Saarland University, 2009. [pdf] [bib]

@MASTERSTHESIS{Karrenberg:2009:MSc,
     author  = {Ralf Karrenberg},
     title   = {{Automatic Packetization}},
     school  = {Saarland University},
     year    = {2009},
     month   = {July},
     webpdf  = {http://www.cdl.uni-saarland.de/publications/theses/karrenberg_msc.pdf},
 	abstract = {
 		Modern processor architectures provide the possibility to execute an
 		instruction on multiple values at once. So-called SIMD (Single
 		Instruction, Multiple Data) instructions work on packets (or vectors)
 		of data instead of scalar values. They offer a significant performance
 		boost for data-parallel algorithms that perform the same operations on
 		large amounts of data, e.g. data encoding and decoding, image
 		processing, or ray tracing.
 		However, the performance gain comes at a price: programming languages
 		provide no elegant means to exploit SIMD instruction sets. Packet
 		operations have to be coded by hand, which is complicated, unintuitive,
 		and error prone.  Thus, packetization - the transformation of scalar
 		code to packet form - is mostly applied automatically by local compiler
 		optimizations (e.g. during loop vectorization) or with a lot of manual
 		effort at performance-critical parts of a system.
 		This thesis describes an algorithm for automatic packetization that
 		allows a programmer to write scalar functions but use them on packets
 		of data. A compiler pass automatically transforms those functions to
 		work on packets of the target-architecture's SIMD width. The resulting
 		packetized function computes the same results as multiple executions of
 		the scalar code.
 		The algorithm is implemented in a source-language and target-
 		architecture independent intermediate representation (the Low Level
 		Virtual Machine (LLVM)), which enables its use in many different
 		environments. The performance of the generated code is shown in a real-
 		world case study in the context of real-time ray tracing: serial shader
 		code written in C++ is automatically specialized, optimized, and
 		packetized at runtime. The packetized shaders outperform their scalar
 		counterparts by an average factor of 3.6 on a standard SSE architecture
 		of SIMD width 4.
 	}
 }

Compiler Design Lab