GPGPU::Basic Math Tutorial

The goal of this tutorial is to explain the background and all necessary steps that are required to implement a simple linear algebra operator on the GPU: saxpy() as known from the BLAS library. For two vectors x and y of length N and a scalar value alpha, we want to compute a scaled vector-vector addition: y = y + alpha * x. The saxpy() operation requires almost no background in linear algebra, and serves well to illustrate all entry-level GPGPU concepts. The techniques and implementation details introduced in this tutorial can easily be extended to more complex calculations on GPUs.

Prerequisites

This tutorial is not intended to explain every single detail from scratch. It is written for programmers with a basic understanding of OpenGL, its state machine concept and the way OpenGL models the graphics pipeline. It is recommended to go through the examples and excercises of the OpenGL Programming Guide (Red Book). PDF and HTML versions and sample programs are freely available on the internet. Additionally, chapter 1 of the Orange Book ("OpenGL Shading Language"), titled "A Review of OpenGL Basics", provides an excellent summary. NeHe's OpenGL tutorials are also recommended.

This tutorial is based on OpenGL, simply because the target platform should not be limited to MS Windows. Most concepts explained here however translate directly to DirectX.

For a good overview and pointers to reading material, please refer to the FAQ at the community site GPGPU.org. The sections 'Where can I learn about OpenGL and Direct3D?', 'How does the GPU pipeline work?' and 'In what ways is GPU programming similar to CPU programming?' are highly recommended.

Hardware requirements

You will need at least a NVIDIA GeForce FX or an ATI RADEON 9500 graphics card. Older GPUs do not provide the features (most importantly, single precision floating point data storage and computation) which we require.

Software requirements

First of all, a C/C++ compiler is required. Visual Studio .NET 2003 and 2005, Eclipse 3.x plus CDT/MinGW, the Intel C++ Compiler 9.x and GCC 3.4+ have been successfully tested. Up-to-date drivers for the graphics card are essential. At the time of writing, using an ATI card only works with Windows, whereas NVIDIA drivers support both Windows and Linux (and in theory, also Solaris x86 and FreeBSD, though I never tested this myself).

The accompanying code uses two external libraries, GLUT and GLEW. For Windows systems, GLUT is available here, on Linux, the packages freeglut and freeglut-devel ship with most distributions. GLEW can be downloaded from SourceForge. Header files and binaries must be installed in a location where the compiler can locate them, alternatively, the locations need to be added to the compiler's include and library paths. Shader support for GLSL is built into the driver, to use Cg, the Cg Toolkit is required.

Alternatives

For a similar example program done in DirectX, refer to Jens Krügers Implicit Water Surface demo (there is also a version based on OpenGL available). This is however well-commented example code and not a tutorial.

GPU metaprogramming languages abstract from the graphical context completely. Both BrookGPU and Sh are recommended. Note that Sh is continued as RapidMind these days.

Setting up OpenGL

GLUT

GLUT, the OpenGL Utility Toolkit, provides functions to handle window events, create simple menus etc. Here, we just use it to set up a valid OpenGL context (allowing us access to the graphics hardware through the GL API later on) with as few code lines as possible. Additionally, this approach is completely independent of the window system that is actually running on the computer (MS-Windows or Xfree/Xorg on Linux / Unix and Mac).

// include the GLUT header file

#include <GL/glut.h>

// call this and pass the command line arguments from main()

void initGLUT(int argc, char **argv) {

glutInit ( &argc, argv );

glutCreateWindow("SAXPY TESTS");

}

OpenGL extensions

Most of the features that are required to perform general floating-point computations on the GPU are not part of core OpenGL. OpenGL Extensions however provide a mechanism to access features of the hardware through extensions to the OpenGL API. They are typically not supported by every type of hardware and by every driver release because they are designed to expose new features of the hardware (such as those we need) to application programmers. In a real application, carefully checking if the necessary extensions are supported and implementing a fallback to software otherwise is required. In this tutorial, we skip this to prevent code obfuscation.

A list of (almost all) OpenGL extensions is available at the OpenGL Extension Registry.

The extensions actually required for this implementation will be presented when we need the functionality they provide in our code. The small tool glewinfo that ships with GLEW, or any other OpenGL extension viewer, or even OpenGL itself (an example can be found when following the link above) can be used to check if the hardware and driver support a given extension.

Obtaining pointers to the functions the extensions define is an advanced issue, so in this example, we use GLEW as an extension loading library that wraps everything we need up nicely with a minimalistic interface:

void initGLEW (void) {

// init GLEW, obtain function pointers

int err = glewInit();

// Warning: This does not check if all extensions used

// in a given implementation are actually supported.

// Function entry points created by glewInit() will be

// NULL in that case!

if (GLEW_OK != err) {

printf((char*)glewGetErrorString(err));

exit(ERROR_GLEW);

}

Preparing OpenGL for offscreen rendering

In the GPU pipeline, the traditional end point of every rendering operation is the frame buffer, a special chunk of graphics memory from which the image that appears on the display is read. Depending on the display settings, the most we can get is 32 bits of color depth, shared among the red, green, blue and alpha channels of your display: Eight bits to represent the amount of "red" in a image (same for green etc.: RGBA) is all we can expect and in fact need on the display. This already sums up to more than 16 million different colors. Since we want to work with floating point values, 8 bits is clearly insufficient with respect to precision. Another problem is that the data will always be clamped to the range of [0/255; 255/255] once it reaches the framebuffer.

How can we work around this? We could invent a cumbersome arithmetics that maps the sign-mantissa-exponent data format of an IEEE 32-bit floating point value into the four 8-bit channels. But luckily, we don't have to! First, 32-bit floating poit values on GPUs are provided through a set of OpenGL extensions (see section 2). Second, an OpenGL extension called EXT_framebuffer_object allows us to use an offscreen buffer as the target for rendering operations such as our vector calculations, providing full precision and removing all the unwanted clamping issues. The commonly used abbreviation is FBO, short for framebuffer object.

To use this extension and to turn off the traditional framebuffer and use an offscreen buffer (surface) for our calculations, a few lines of code suffice. Note that binding FBO number 0 will restore the window-system specific framebuffer at any time. This is useful for advanced applications but beyond the scope of this tutorial.

GLuint fb;

void initFBO(void) {

// create FBO (off-screen framebuffer)

glGenFramebuffersEXT(1, &fb);

// bind offscreen buffer

glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fb);

}