Spatchcock: February 2010

Wednesday, February 17, 2010

NP

I thought of this XKCD comic last night as I was showing problems were in NP. Resurrected for your amusement.

Monday, February 15, 2010

Upwelling Within

I have the urge to build something in ways that do not involve typing. Something mechanical with moving parts.

In potentially related news, this looked neat: Crower six-stroke diesel.

Sunday, February 14, 2010

My desktop is back. Evidently, one [of three] PCI-Express x16 lanes is dead as is one of the memory channels. The system refuses to POST with something occupying any of the damaged connectors. This leaves two [though one inaccessible] PCI-Express x16 slots and one memory channel still viable. I can make do with this configuration for the time being.

The experience has prompted the question of what I want in a home desktop. While there was a time when GPU VSIPL development and benchmarking demanded a high-end system, now I do most work on other resources purchased by the research budget. I don't game, and I do most home development on my laptop which is typically booted into Linux. I wouldn't have a need for Windows at all except to synchronize my iPhone, though it's nice to have that kind of compatibility for special interests and the occasional foray into 3D graphics programming. That said, a low-end multicore CPU and medium-range GPU could satisfy those needs quite well.

With a medium-performance laptop dual booting Windows and Linux, do I need the desktop at all?

The strongest reason I can see for wanting one is to have a stable place for an array of harddrives to ensure reliable access to a decade or so of accumulated data. That's a task easily serviced by network-attached storage [1].

This has disturbing implications for the value of my research interests at the consumer level...

[1] BUT NOT IF IT'S MADE BY LACIE - you're better off keeping your data on floppy disks under your mattress than shoveling it into that black hole of a storage medium; it's a place where entropy goes to die; fail fail fail.

Wednesday, February 10, 2010

GPU Ocelot

What follows are a set of ideas I've been babbling to people over the past few days:

A recent call for papers to contribute to NVIDIA's GPU Computing Gems has prompted the brainstorming of potential uses for GPU Ocelot. It occurs to me that I've not done a post that addresses an important segment of the GPU Ocelot user space: CUDA developers.

While Ocelot makes a good infrastructure for researching GPU architecture, programming models, compiler transformations, and operating systems, it's also incredibly useful for CUDA application development and performance tuning. Ocelot consists of an implementation of the CUDA Runtime API with three backends for executing CUDA kernels on: a full-featured PTX emulator with instrumentation, an efficient translator to multicore CPU, and a CUDA-capable GPU. I'll be focusing on the capabilities enabled by Ocelot's PTX emulator for most of this post.

PTX is the virtual instruction set that the CUDA compiler targets. Ocelot includes an instruction-by-instruction emulation of PTX kernels in a manner that is faithful to the PTX execution model and very similar to how GPUs might execute the kernel. Since Ocelot explicitly computes the state of a PTX kernel for every instruction and every thread, detailed instrumentation and analysis is possible. Here are some of the features that greatly assist in developing CUDA applications and tuning them for efficient execution on GPUs.

Ocelot's PTX emulator is capable of checking for out-of-bounds memory accesses. Without Ocelot, your application might simply seg-fault on the GPU and possibly crash your video driver. There aren't really any tools currently available to help you find these problems, and padding added to memory allocations by the CUDA driver may conceal problems on the system you test with only to explode on your users whose systems may be different.

Beyond global memory accesses, Ocelot's emulator detects race conditions in shared memory that necessitate use of synchronization as well as deadlocks in which not all threads reach a synchronization point. To achieve maximum performance from the memory system, Ocelot includes a way of determining whether the application is taking advantage of spatial locality of data whose accesses can be coalesced in time. At present, we do not detect bank conflicts in which multiple concurrent threads attempt to access the same port of shared memory; on a GPU, these must be serialized greatly impacting performance. I plan to implement that soon.

Finally, Ocelot supports user-contributed trace generators. You could write, for example, a trace generator that watches a particular location in memory and identifies when that location was updated and by which thread(s). I believe I will write this and then, in a blog post, demonstrate how to extend it.

NVIDIA's CUDA does include an emulation target, but this is implemented completely differently using one host thread per logical CUDA thread. This MIMD approach to kernel execution is terribly inefficient, as most of the application runtime is spent by the OS switching contexts between threads. Moreover, it constitutes a radical departure from the way kernels are executed by GPUs to the extent that CUDA kernels that are invalid for subtle reasons may execute correctly on their emulator but incorrectly on a GPU. Ocelot's PTX emulator presents a more similar execution environment that captures the subtleties of the GPU execution model.

GPU Ocelot is fairly stable as of this writing. It has been validated with over a hundred applications from the CUDA SDK, third party benchmark suites, CUDA-based libraries such as GPU VSIPL and Thrust, and custom CUDA applications. Ocelot has been used to capture performance characteristics of these applications as well as find nuanced bugs in existing CUDA programs. It constitutes a great improvement over NVIDIA's emulation mode CUDA runtime in terms of both accuracy, observability of potential bugs, and emulation performance.

Tuesday, February 2, 2010

JSON

Last October, I wrote a JSON parser in C++ that constructs a document object structure. A visitor class provides convenient access to the document object, and a number of tests validate the parser. I've also relaxed the specification to permit strings without quotation marks assuming they meet the same rules as variables in a C-like programming language. Multi-byte encodings aren't as well-tested or covered.

I'm preparing to make it available under a BSD license. Here's a sample usage scenario from the test suite.


//
// Example JSON usage
//
try {
  std::stringstream ss;
  json::Parser parser;

  // without quotes, and with weird spacing
  ss << "{objects:[{id:1, location : [ 0, 0,0]},{id:2,location:[1.25,0.25,4]}]}";
  
  json::Object *object = parser.parse_object(ss);
  json::Visitor visitor(object);
  
  json::Visitor obj0 = visitor["objects"][0];
  json::Visitor obj1 = visitor["objects"][1];
  
  if ((int)(obj0["id"]) != 1 || (double)(obj0["location"][0]) != 0) {
    if (verbose) {
      std::cout << "visitor test 0 - failed for obj0\n";
    }
    passes = false;
  }
  
  if ((int)(obj1["id"]) != 2 || (double)(obj1["location"][0]) != 1.25 ||
    (double)(obj1["location"][1]) != 0.25 || (double)(obj1["location"][2]) != 4) {
    if (verbose) {
      std::cout << "visitor test 0 - failed for obj1\n";
    }
    passes = false;     
  }
  
  delete object;
}
catch (common::exception &exp) {
  passes = false;
  std::cout << "parse failed for visitor test 0\n";
  std::cout << exp.message << std::endl;
}