Research Interests
-
Compiler Construction (Code Generation, Automatic Parallelization)
-
Computer Graphics (Ray Tracing, Shading Languages, Acceleration Structures)
-
SIMD Programming
Projects
Teaching
Summer Term 2011
Winter Term 2010
Summer Term 2010
Publications
Conferences
- Improving the Performance of OpenCL on CPUs - CC 2012
Karrenberg, R. and Hack, S.
Compiler Construction, 2012.
[bib]
@CONFERENCE{KH:2012:opencl,
author = {Ralf Karrenberg and Sebastian Hack},
title = {Improving the Performance of OpenCL on CPUs},
booktitle = {Compiler Construction},
booktitle_short = {CC 2012},
year = {2012},
},
- Whole Function Vectorization - CGO 2011
Karrenberg, R. and Hack, S.
Code Generation and Optimization, 2011.
[doi]
[slides]
[bib]
@CONFERENCE{KH:2011:cgo,
author = {Ralf Karrenberg and Sebastian Hack},
title = {{W}hole {F}unction {V}ectorization},
booktitle = {Code Generation and Optimization},
booktitle_short = {CGO 2011},
year = {2011},
doi = {10.1109/CGO.2011.5764682},
abstract = {
Abstract—Data-parallel programming languages are an important component
in today's parallel computing landscape. Among those are domain-
specific languages like shading languages in graphics (HLSL, GLSL,
RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL.
Current implementations of those languages on CPUs solely rely on multi-
threading to implement parallelism and ignore the additional intra-core
parallelism provided by the SIMD instruction set of those processors
(like Intel's SSE and the upcoming AVX or Larrabee instruction sets).
In this paper, we discuss several aspects of implementing data-parallel
languages on machines with SIMD instruction sets. Our main contribution
is a language- and platform-independent code transformation that
performs whole-function vectorization on low-level intermediate code
given by a control flow graph in SSA form.
We evaluate our technique in two scenarios: First, incorporated in a
compiler for a domain-specific language used in real-time ray tracing.
Second, in a stand-alone OpenCL driver. We observe average speedup
factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for
different OpenCL kernels.
},
webslides = {http://www.cdl.uni-saarland.de/projects/wfv/wfv_cgo11_slides.pdf}
acc_rate = {26.7},
accepted = {28},
submitted = {105},
}
- AnySL: Efficient and Portable Shading for Ray Tracing - HPG 2010
Karrenberg, R., Rubinstein, D., Slusallek, P. and Hack, S.
Proceedings of the Conference on High Performance Graphics, pages 97–105, Eurographics Association, 2010.
[more]
[slides]
[bib]
@CONFERENCE{KRSH:2010:hpg,
author = {Ralf Karrenberg and Dmitri Rubinstein and Philipp Slusallek and Sebastian Hack},
title = {{AnySL: Efficient and Portable Shading for Ray Tracing}},
booktitle = {Proceedings of the Conference on High Performance Graphics},
series = {HPG '10},
year = {2010},
location = {Saarbrucken, Germany},
pages = {97--105},
numpages = {9},
url = {http://portal.acm.org/citation.cfm?id=1921479.1921495},
acmid = {1921495},
publisher = {Eurographics Association},
address = {Aire-la-Ville, Switzerland, Switzerland},
booktitle_short = {HPG 2010},
abstract = {
While a number of different shading languages have been developed,
their efficient integration into an existing renderer is notoriously
difficult, often boiling down to implementing an entire compiler
toolchain for each language. Furthermore, no shading language is
broadly supported across the variety of rendering systems.
AnySL attacks this issue from multiple directions: We compile shaders
from different languages into a common, portable representation, which
uses subroutine threaded code: Every language operator is translated to
a function call. Thus, the compiled shader is generic with respect to
the used types and operators.
The key component of our system is an embedded compiler that
instantiates this generic code in terms of the renderer's native types
and operations. It allows for flexible code transformations to match
the internal structure of the renderer and eliminates all overhead due
to the subroutine threaded code. For SIMD architectures we
automatically perform vectorization of scalar shaders which speeds up
rendering by a factor of 3.9 on average on SSE. The results are highly
optimized, parallel shaders that operate directly on the internal data
structures of a renderer. We show that both traditional shading
languages such as RenderMan, but also C/C++-based shading languages,
can be fully supported and deliver high performance across different
CPU renderers.
},
webslides = {http://www.cdl.uni-saarland.de/projects/anysl/anysl_hpg10_slides.pdf}
}
MSc Thesis
- Automatic Packetization
Karrenberg, R.
M.Sc. Thesis, Saarland University, 2009.
[pdf]
[bib]
@MASTERSTHESIS{Karrenberg:2009:MSc,
author = {Ralf Karrenberg},
title = {{Automatic Packetization}},
school = {Saarland University},
year = {2009},
month = {July},
webpdf = {http://www.cdl.uni-saarland.de/publications/theses/karrenberg_msc.pdf},
abstract = {
Modern processor architectures provide the possibility to execute an
instruction on multiple values at once. So-called SIMD (Single
Instruction, Multiple Data) instructions work on packets (or vectors)
of data instead of scalar values. They offer a significant performance
boost for data-parallel algorithms that perform the same operations on
large amounts of data, e.g. data encoding and decoding, image
processing, or ray tracing.
However, the performance gain comes at a price: programming languages
provide no elegant means to exploit SIMD instruction sets. Packet
operations have to be coded by hand, which is complicated, unintuitive,
and error prone. Thus, packetization - the transformation of scalar
code to packet form - is mostly applied automatically by local compiler
optimizations (e.g. during loop vectorization) or with a lot of manual
effort at performance-critical parts of a system.
This thesis describes an algorithm for automatic packetization that
allows a programmer to write scalar functions but use them on packets
of data. A compiler pass automatically transforms those functions to
work on packets of the target-architecture's SIMD width. The resulting
packetized function computes the same results as multiple executions of
the scalar code.
The algorithm is implemented in a source-language and target-
architecture independent intermediate representation (the Low Level
Virtual Machine (LLVM)), which enables its use in many different
environments. The performance of the generated code is shown in a real-
world case study in the context of real-time ray tracing: serial shader
code written in C++ is automatically specialized, optimized, and
packetized at runtime. The packetized shaders outperform their scalar
counterparts by an average factor of 3.6 on a standard SSE architecture
of SIMD width 4.
}
}
BSc Thesis
- Memory Aware Realtime Ray Tracing: The Bounding Plane Hierarchy
Karrenberg, R.
B.Sc. Thesis, Saarland University, 2007.
[pdf]
[bib]
@BACHELORSTHESIS{Karrenberg:2007:BSc,
author = { Ralf Karrenberg },
title = { {M}emory {A}ware {R}ealtime {R}ay {T}racing: {T}he {B}ounding {P}lane {H}ierarchy },
school = {Saarland University},
year = { 2007 },
month = { August },
webpdf = {http://www.cdl.uni-saarland.de/publications/theses/karrenberg_bsc.pdf},
abstract = {
Realtime Ray Tracing has become available on single desktop environments
over the last few years, recently moving on to supporting dynamic
scenes. Approaches either use general-purpose CPUs, high-end
programmable graphics cards or specialised custom hardware. One of the
most important factors influencing the performance of all
implementations is the algorithm’s dependence on a spatial index
structure. Its random memory access is very slow compared to the
computational power of current computer hardware, often being the
bottleneck of the system.
The goal of this thesis was the design and implementation of a spatial
index structure that focuses entirely on improving memory efficiency in
trade for computational complexity. Two main starting points were
followed: The memory requirement of the whole structure and of each of
its parts on one hand and the caching efficiency on the other. The
resulting contributions are the following:
• The Bounding Plane Hierarchy (BPH) is a complete acceleration
structure equivalent to a Bounding Volume Hierarchy (BVH) with axis-
aligned bounding boxes (AABBs) that needs less than half the size of
current BVH-implementations.
• Treelets are groups of interconnected nodes that are traversed
independently from the rest of the acceleration structure by using a new
two-stage algorithm. They are employed to enhance cache efficiency -
especially on parallel and/or multi-threaded systems like the CELL -
and can be applied to any spatial index structure.
A first implementation of the BPH with Treelets using Packet Tracing
and SIMD-instructions renders animations in real-time while still
offering many possibilities for optimization.
}
}
Jobs
Please consider the project pages of AnySL and Whole-Function Vectorization for thesis topics and "HiWi" job offers.