Sunday, November 9, 2008

SSE3 and Manhattans

Now that the semester's coding for the PTX to SPU translator is complete, I spent the weekend researching some areas that I've been thinking about but haven't had much time to investigate. So, it was a weekend spent mostly* hacking.

1.)
Streaming SIMD Extensions are a set of instructions added to the x86 instruction set beginning with the Pentium II. These instructions apply the same operator, typically floating-point {*,+, -, /}, to corresponding elements of 128-bit 4-element vector registers. Since they are parallel, you can typically perform more operations in a given number of clock cycles than with scalar floating-point code.

SSE2 and SSE3 are revisions that have added additional instructions as programmers demanded them. If you have a Pentium 4 Prescott or better, you have SSE3. If you have a 2.2 GHz P4 Northwood, you only support SSE2, and you miss out on the faster-better-cheaper possibilities concomitant with SSE3.

I spent a few hours updating my hand-rolled matrix class with compiler instrinsics (statements with the semantics of C functions but direct correspondence to CPU instructions; portable too), and I achieved 2.5x speedup for matrix multiply. SSE3 provides support for horizontal operations - operators apply to elements within the same 128-bit register. This permits the implementation of dot products and complex arithmetic without shuffle instructions and makes the code a lot faster. If your CPU doesn't support SSE3, you should probably build a new system (and use the existing system as a dedicated build machine).

2.)
CUDA is interoperable with OpenGL and Direct3D9. I spent a few hours tonight writing a quick DirectX application that renders a textured quad then performs post processing (separable 2D convolution) with a CUDA kernel. The immediate application of this would be to produce efficient visualizations for GPU-based simulations. Other ideas are to perform 3D rendering with DirectX and post-processing image-space operations with CUDA though Cg/HLSL is still probably the right way to implement that.

Also, the fragmented nature of OpenGL distributions across versions and driver providers made it more of a debugging hassle to get working than CUDA-DirectX interoperability.

3.)
Identified the need for a new power supply. Apparently, a GeForce GTX 280 has been purchased for me. I'd like to use it along with the GeForce 9800 GX2 giving me a grand total of 3 GPUs and 2 GB of GDDR3 memory. I'm working on ways to leverage all three at once, so this isn't a fool's errand. Unfortunately, my power supply cannot source enough current on enough lines to power both cards. During Christmas break, I'll make the transition.

*
During a trip to Harry and Sons, I decided to modify my usual order of Chicken Larb (Thai salad). I still ordered it, but I augmented it with a Manhattan. For those of you who don't know, a Manhattan is a cocktail of whiskey and sweet vermouth. Typically, I avoid cocktails because (1) I'd only really had bad examples and (2) cola + {rum, whiskey} is difficult to beat. The stigma of cocktails being girly drinks may have originated during my freshman year's first experiences with alcohol; vodka, grenadine, and orange juice are simply not something I'll ever combine again.

A well-made Manhattan, on the other hand, is quite strong yet simple enough to order from a busy bartender. In terms of flavor, it is quite divine. The vermouth dulls all of the whiskey's edge leaving only the wonderful caramel flavoring. Typically, it's made with 3-4 shots of the principle, so one drink takes you a long way toward inebriation while looking classy the entire time. It's my new official drink.


CUDA, DirectX9, SSE3, and the Manhattan cocktail: all for the win!

1 comment:

Sara said...

Staying classy is the key....

:-)