Spatchcock: 2009

Monday, December 28, 2009

Southeastern Railway Museum

The day after Christmas, my parents, Emma, and I spent the afternoon at the Southeastern Railway Museum. Their large and growing collection includes a number of cosmetically restored rolling stock from railroads in the Southeast including several large steam locomotives. This post covers some of the photographs from that visit.

S&A #750

Savannah and Atlant #750 - This 4-6-2 "Light Pacific" built in 1910 pulled excursions all around the Southeastern U.S. including one in 1989 that I rode on. It was last operated that year and currently sits on static display, having been cosmetically restored.

Selection sort, 64-bit

Having haphazardly embarked on Agner Fog's informative C++ optimization guide [pdf], I decided it was time to write some x86-64 assembly to gain familiarity with the x86-64 instruction set.

Selection sort is an easy hello-world type application for assembly programming. On a Windows platform, getting Visual Studio and NASM to build a project together demanded a moment's tinkering, so I didn't want to be too ambitious.

The code listing follows. The best part is there is no stack frame to prepare or registers to spill. This doesn't do anything sophisticated like prefetch data, and of course isn't the best algorithm for sorting. It is pretty easy to spot the nested loops.

Posted for your amusement.


section .text code align=16

global selsort

; void selsort(int *A, int N)
;
; A: rcx
; N: rdx
;
selsort:
 xor rax, rax
 
L1:
 mov r8, rax
 inc r8
 mov r10, rax
 
L2:
 cmp r8, rdx
 jz L3
 
 mov r11d, dword[rcx+r8*4]
 mov r9d, dword[rcx+r10*4]
 
 cmp r9d, r11d
 cmova r10, r8

 inc r8
 jmp L2
 
L3:
 mov r11d, dword[rcx+rax*4]
 mov r9d, dword[rcx+r10*4]
 
 cmp r9d, r11d
 jge L4
 
 mov dword[rcx+r10*4], r11d
 mov dword[rcx+rax*4], r9d
 
L4:
 inc rax
 cmp rax, rdx
 jnz L1
 
 ret

Monday, December 14, 2009

netherp - share your naughty bits!

At Josh's request, I threw the source code to netherp v2 up on a public Github repository. Here is the URL:

http://github.com/kerrmudgeon/kerrutils/blob/master/netherp

netherp was originally intended to be a minimal-configuration HTTP server for sharing files among two machines of arbitrary make and model. One day a few weeks ago I became enthusiastic about asynchronous I/O and non-threaded server architectures, so I wrote a second version from scratch. It doesn't support HTTP PUT or POST or any method other than GET. There is no authentication, and I suspect there may be ways of compromising it.

Nevertheless, there it is. It's single-threaded, platform-independent [within reason], and relatively fast. On my quad-core 64-bit Linux platform in the lab, I ran http_load requesting a variety of files from my local GPU Ocelot repository ranging in size from a 1kB to 100s of kBs. It handled http_load's maximum of 1,000 requests per second without trouble; in this case, the network was the bottleneck. It has also served a 400 MB file over a residential cable connection, this time running on a Windows machine.

So, it's vetted. It works.

There are several barbaric components that ought to be hastily buried underneath a pile of complexity theory [i.e. the MIME type selector], but this does demonstrate working use of select() which lends some much-needed pedagogical value to this exercise.

Additional comments are certainly welcome.

One other facet of interest is I wanted this to be distributable as source, so I wrote a program to embed binary image files used as logos and icons into the source code of the application. It expresses a binary file as a C++ style array declaration. i.e.:


// 'images/file.png'
const unsigned int image_file[] = {
 0x474e5089, 0xa1a0a0d, 0xd000000, 0x52444849, 
 0x30000000, 0x28000000, 0x608, 0x7987b800, 
 ...
};

May not be the best way, but at least it's out the door.

Monday, November 16, 2009

Frigid Digit

Georgia Tech Sailing Club hosted its intra-club regatta on Saturday. Four 420s sailed out into Lake Lanier and completed a few races. It was very informal, and the light winds made things maddeningly slow at times, although things were never so calm as to prevent deliberate maneuvering.

I didn't win any of the races, but my crew and I came in second place a few times [which doesn't technically matter - sailing rewards winners only]. There are subtle tactical and strategic differences that distinguish experts that I began to pick up on. A strong start is key to dominating a pack of boats - you receive "puffs" of stronger winds earlier and can speed up to stronger positions. Additionally, good positioning and tacking strategies tend to grant a smart skipper right of way during those moments when two boats come upon each other.

There were some tactical matters as well. One boat just seemed to "go faster" in identical wind conditions. Since this regatta consisted of boats of the same class and make, presumably the fastest were those trimmed the best. After we stopped racing, I requested the experienced crews to sail a straight course up wind so that I could follow along and trim sails and crew locations to improve my speed relative to them. That helped considerably to identify the optimal sail positions.

Ralph's Picassa gallery of the day.

Here I am sailing downwind. This is the boring part of the race.

Image

The Nacra catameran that capsized after flying one of its hulls:

Image

Update

While getting my haircut today, I thought about what it might take to determine the optimal strategy for a particular boat under a given set of wind conditions in a regatta. Simplifying assumptions would be:

uniform wind direction and speed across the entire area

no other boats to create local calm areas with bad air or force suboptimal maneuvering

uniform sea state

ideal sail and boat trim

This hypothetical systematic approach would attempt to minimize total time around a pair of marks. Specifically, this would determine:

what course to make that minimizes time to reach the upwind mark

which tack to start on

when to tack

whether to "run" or "broad reach" during the downwind leg

Since being at the right place at the right time with the wind on the starboard side is advantageous when encountering other boats, the model might favor a starboard tack at certain points in the course.

It seems like this could be computed given

an accurate speed model of the Club 420 under typical loading conditions for each point of sail

performance model of a well-executed tack

performance model of a well-executed gybe

This would attack the problem that, I suppose, gets answered through intuition and experience by 'good' skippers. Of course, it completely ignores the influence of other boats whose presence could easily compromise a good single-boat strategy. It also assumes the skipper and crew are good enough to maneuver in ways that satisfy the performance models; this depends on training and skill. Nevertheless, it seems like having a good theory here would be a great way to quickly gain a practical intuition that could be deployed out on the water.

Someone get me a performance model!

Thursday, November 12, 2009

Git solves your problems

So last night, I wanted to install a Subversion server on Emma's darling quad-core behemoth hosting akerr.net among *other things*. Inspecting the installation, it seems mod_dav was not compiled into Apache2.x, and no one after Apache 1.3 seems to suggest loading it as a dynamic shared library. I successfully compiled several test builds of Apache2.2 in user space, but matching Emma's configuration in a maintainable way would have been a nightmare inside of doubtfulness wrapped in the promise of sleepless nights down the road if things go wrong.

Then I went looking for alternatives. It seems the only administrative prerequisites for hosting a remote Git repository are that the host runs sshd and you to have an account on said machine. After installing the lovely msysgit git client for Windows on my home desktop, I performed the following.

On the remote side:


$ ssh akerr@akerr.net:
$ mkdir akerr.git && cd akerr.git
$ git --bare init
Initialized empty Git repository in /home/akerr/akerr.git/
$

On the local side:


$ cd C:/research/akerr
$ git init
$ git remote add origin "ssh://akerr@akerr.net:<ssh port>/~/akerr.git"
$ git add infrastructure/
 ...
$ git commit -a -m "initial revision of libkerr"
$ git push origin master
 ...
$

This is all that is needed. To pull from the remote host, simply perform the following:


$ git pull origin master
 ...
$

The elegant components of this solution are:
- doesn't require superuser access or configuration changes on the host
- security is a function of SSH
- git is fast and functional in an isolated environment

Thank you, Emma, for the free hosting. This helps you too. : )

Thursday, November 5, 2009

Wheel of Reincarnation

I read an interesting paper describing the Wheel of Reincarnation - the phenomenon in which certain specialized computations are handled by coprocessors but then these computations are shifted back to the main processor as CPUs gain in performance.

General computing power, whatever its purpose, should come from the central resources of the system. If these resources should prove inadequate, then it is the system, not the display, that needs more computing power.

[Myer and Sutherland. "On the design of display processors"]

I guess this phenomenon is satisfied by general-purpose computing on GPUs that brings the graphics processor back into the 'system' instead of being a specialized co-processor for accelerating DirectX.

With GPU Ocelot, the system is getting more power.

Friday, October 9, 2009

Emma's Birthday!

Let us also observe that my darling Emma is twenty six years awesome as of yesterday. We celebrated with supper, exquisite company, Maker's Mark, and cookie cake. Presumably, celebratory ice skating is upcoming.

Happy birthday, Emma!

Atlanta Calling

By now, you're surely tired of seeing the evidence of my awesome summer. To keep you interested in lieu of a complete blog post chronicling my travels to HPEC'09 and then IISWC, here is a Google Image Search that should elicit a chuckle among some of you.

Example:

backhoe

Real post coming soon.

Tuesday, September 8, 2009

Post Summer 2009

I am back from Seattle. Here is a taste of what I did this summer. Special thanks go to all of the people I met along the way.

Flying a Cessna 172S:

link

Sailing on Lake Union:

link

Bike ride to Lake Washington:

link

Helicopter flight over Seattle:

link

10,000 ft climb of Mt. Rainier:
10kft climb of Mt. Rainier

Things I want to accomplish personally this Fall:

* nail down research agenda working toward thesis
* paper at PACT
* sail considerably, taking you and others with me
* bicycle over a notable distance
* hike somewhere notable
* undertake a slaying expedition
* gain upper body strength for rock climbing options class next Spring

Let's see if a more productive Andykerr can accomplish all of this. I'll have to put considerably less whiskey in my coffee the next few mornings.

Wednesday, May 13, 2009

VSIPL Design Thoughts

As an implementer of the VSIPL API, I've thought of several small criticisms of the library. By posting them here, Google will record them for posterity.

* VSIPL Core Lite should require implementations to include vsip_complete() -
This could easily be a no-op on any implementation that doesn't perform some sort of lazy evaluation. Applications could call it without fear that things may break if they move to another implementation.

* VSIPL API should include vsip_usermalloc() -
This would allocate user-accessible data blocks in a manner that could be used efficiently by the library. The alternative is let the user perform its own memory allocation; this has no regard to alignment or pinned paging.

* VSIPL should include more robust error reporting methods -
Reports might be implementation-dependent, but it would enable much better robust composition.

I am confident none of these requirements would break existing implementations or negatively impact users.

Monday, April 13, 2009

Inexpensive Performance

Taking a break from MemoryTraceGenerator.cpp to mention:

I'm replacing the motherboard and CPU in my dad's home desktop with an Athlon 64 x2 Kuma at 2.7 GHz and an $80 EVGA motherboard with a GeForce 8200 integrated video chipset. The total here is roughly $200 if you also purchase several GB of DDR2 RAM. This wouldn't be a terrible CUDA development platform, particularly if someone loaned you a higher-end GPU. Even without, its integrated GPU is still capable of accelerating typical applications such as video decoding.

I only mention this because what one might consider today's low-end desktop is astonishingly powerful.

It's also my first time touching AMD.

Monday, April 6, 2009

Winds

Design fail: high winds bristling down Peachtree St at the 130 ft level tend to pass over and around
the balcony railings of our condo creating quite a bit of commotion. I would estimate the power spectrum has peaks at the 10 kHz frequency as well as the 5 Hz and ~100 Hz frequencies. The noise was so loud and so stationary that, at Emma's suggestion, I slept wearing ear plugs from my shooting kit. That worked for roughly two hours.

Can't we just remove the railings from the balcony during high winds?

Saturday, April 4, 2009

Webcam

Emma's webcam is now operational. I am using webcamXP 5 [free!] to make HTTP POST calls to upload images. At present, the server will store only the most recent image, so don't expect a log. It is updated once every hour. I may enable serving the most recent 24 hours, but that functionality will not come online today.

I used a PHP script on the server to receive the files and authenticate the poster. If you guessed the passphrase and spoofed our IP address, I guess you could upload an image of a giant eyeball and it would appear on our webcam page.

Here it is. Bookmark this:

webcam

Note the billboard between the Viewpoint and the T-Mobile sales office. The nude male ad has been replaced for an ad for The Mighty Boosh on Adult Swim.

Thursday, April 2, 2009

GPU VSIPL Press

GPU VSIPL is now featured on NVIDIA's CUDA Zone: GPU VSIPL. NVIDIA lists it as an application, although we only wrote the application to demonstrate the library (and achieved a 75x speedup over the world-famous TASP VSIPL distribution).

In unrelated news, Emma and I have completed our move. I need to set up her webcam.

Monday, March 9, 2009

GPGPU'09

The presentation at GPGPU'09 went well yesterday. I lasted several minutes over budget [as usual], but I felt good about it. When I get home and find my credentials to the GPU VSIPL website, I'll upload my presentation and cite my paper.

Today was the first day of ASPLOS. I wound up sleeping in and missed the keynote, but I sat at the keynote speaker's table during lunch. He's a Google guy and regaled the group with clever ideas, charm, and wit.

The trouble with conferences is they fill you with a billion great ideas and no time to implement them. The wild and crazy idea session included:

* bubble wrap cores - disposable CPU cores so you can run several hotter and faster and burn them out quickly
* on-chip power - nuclear or piezoelectric
* neuro-implants - CPUs in your brain
* purely speculative cores - run parts of your program long before you ever get to them

2 of 3 beer vouchers remain.

Thursday, March 5, 2009

$QR$ Presentation

I finally finished my GPGPU '09 presentation. I need to script it and make sure I can present all 45 slides in 20 minutes. Some of them are closer to animations than "view graphs" so the actual number of slides I'd spend more than a few sentences on is lower.

How do you guys prepare for presentations?

Word-for-word scripting, bullet points you remember to mention in the context of a larger discussion you have with your audience, mix Adderall and a vodka tonic shortly before the presentation and hope for the best? Comment!

Friday, February 27, 2009

GPU VSIPL

Released a new GPU VSIPL. Our range-doppler map application achieved 75x speedup on my machine at work.

GPU VSIPL range-doppler map application

Thursday, February 19, 2009

Productivity

Installing Wubi, Ubuntu for Windows. I rather wish I had a 64-bit operating system installed on this machine.

Monday, February 16, 2009

Traveling

I've been fairly remiss in keeping up this blog and in keeping up appearances in real life. I'll try to at least provide an overview in this post.

I'm typing this from a hotel room in Raleigh, NC. I'm attending High-Performance Computer Architecture conference going on until Wednesday. Three other gentleman of acclaim from my lab are here, though none of us are presenting here. We're here to attend talks, ask questions, converse with presenters in the hallways, and cash in our liquor vouchers at the appropriate times.

Today's keynote was delivered by a fellow at HP and summarized the elements of HP Labs's research thrusts: intelligent infrastructure, content transformations, information management, analytics, cloud computing, and sustainability.

The first three "award" paper presentations covered (1) prefetching for linked data structures, (2) recovery and prediction from voltage emergencies, and (3) long-wire topologies for low-latency 3D networks on chip.

This afternoon, I attended several papers on multicore caching, coherence protocols, a software-managed cache implementation for IBM Cell [looking shockingly similar and not materially superior to either a software-managed cache I implemented for our PTX->Cell runtime or the SMC that ships with IBM's SDK; I didn't pipe up and notify him about this, as I don't think anyone expected it to be].

Georgia Tech is well represented here, with several full profs in attendance [not mine] and a number of students. The panel session today featured the legendary William "Bill" Dally, now of NVIDIA. As Mark Richards described, he is a brilliant and charismatic engineer who portrayed GPUs in very positive light, having lead a number of streaming architecture projects at Stanford in the past. They all agreed that multicore is important, computer architecture needs to inform computer science systems projects at various levels of abstraction [i.e. compiler writers, OS designers, and CPU architects need to have lunch at the same table every day], research projects need to be more aggressive and less product-like [i.e. make crazy and wild assumptions about the future target platforms and build from that; if your results show significant gains even though you couldn't actually build it today, maybe that's how future systems will actually be developed]. And apparently the difference between parallel computing and distributed computing is in distributed computing, you don't assume all machines are working.

Overall, this conference is far more academic than HPEC, which is very much targeted for DoD types. HPCA is very much on the academic side with some very deep and nuanced computer architecture topics occupying many of the papers. The award paper covering prefetching for linked data structures, for instance, discusses a method by which pointers may be identified on cache misses and automatically prefetched with the assumption that they point to nodes in a data structure likely to be fetched shortly anyway. The complete proceedings are presented to attendees in both print form and cleverly stored on a complimentary 1 GB USB drive / laser pointer / ballpoint pen.

That's all for now. I'll be back at some point before Thursday at 10am. I've just submitted a revised edition of my paper to GPGPU 2009. I'll fly to DC on March 7 and come back Mar 12. For future reference, make sure your ACM submission has the copyright notice on the first page; they get mad if you don't.

See you soon, dear reader(s).

Wednesday, January 21, 2009

New Work

Just built the new PC at work.

andrew@:~$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
stepping : 10
cpu MHz : 2833.268
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 5666.53
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

It says the same for the other three cores.