What follows are a set of ideas I've been babbling to people over the past few days:
A recent call for papers to contribute to NVIDIA's GPU Computing Gems has prompted the brainstorming of potential uses for GPU Ocelot. It occurs to me that I've not done a post that addresses an important segment of the GPU Ocelot user space: CUDA developers.
While Ocelot makes a good infrastructure for researching GPU architecture, programming models, compiler transformations, and operating systems, it's also incredibly useful for CUDA application development and performance tuning. Ocelot consists of an implementation of the CUDA Runtime API with three backends for executing CUDA kernels on: a full-featured PTX emulator with instrumentation, an efficient translator to multicore CPU, and a CUDA-capable GPU. I'll be focusing on the capabilities enabled by Ocelot's PTX emulator for most of this post.
PTX is the virtual instruction set that the CUDA compiler targets. Ocelot includes an instruction-by-instruction emulation of PTX kernels in a manner that is faithful to the PTX execution model and very similar to how GPUs might execute the kernel. Since Ocelot explicitly computes the state of a PTX kernel for every instruction and every thread, detailed instrumentation and analysis is possible. Here are some of the features that greatly assist in developing CUDA applications and tuning them for efficient execution on GPUs.
Ocelot's PTX emulator is capable of checking for out-of-bounds memory accesses. Without Ocelot, your application might simply seg-fault on the GPU and possibly crash your video driver. There aren't really any tools currently available to help you find these problems, and padding added to memory allocations by the CUDA driver may conceal problems on the system you test with only to explode on your users whose systems may be different.
Beyond global memory accesses, Ocelot's emulator detects race conditions in shared memory that necessitate use of synchronization as well as deadlocks in which not all threads reach a synchronization point. To achieve maximum performance from the memory system, Ocelot includes a way of determining whether the application is taking advantage of spatial locality of data whose accesses can be coalesced in time. At present, we do not detect bank conflicts in which multiple concurrent threads attempt to access the same port of shared memory; on a GPU, these must be serialized greatly impacting performance. I plan to implement that soon.
Finally, Ocelot supports user-contributed trace generators. You could write, for example, a trace generator that watches a particular location in memory and identifies when that location was updated and by which thread(s). I believe I will write this and then, in a blog post, demonstrate how to extend it.
NVIDIA's CUDA does include an emulation target, but this is implemented completely differently using one host thread per logical CUDA thread. This MIMD approach to kernel execution is terribly inefficient, as most of the application runtime is spent by the OS switching contexts between threads. Moreover, it constitutes a radical departure from the way kernels are executed by GPUs to the extent that CUDA kernels that are invalid for subtle reasons may execute correctly on their emulator but incorrectly on a GPU. Ocelot's PTX emulator presents a more similar execution environment that captures the subtleties of the GPU execution model.
GPU Ocelot is fairly stable as of this writing. It has been validated with over a hundred applications from the CUDA SDK, third party benchmark suites, CUDA-based libraries such as GPU VSIPL and Thrust, and custom CUDA applications. Ocelot has been used to capture performance characteristics of these applications as well as find nuanced bugs in existing CUDA programs. It constitutes a great improvement over NVIDIA's emulation mode CUDA runtime in terms of both accuracy, observability of potential bugs, and emulation performance.
No comments:
Post a Comment