I'm low on ideas for getting VSIPL Lite + Integer functions to run any faster. The plan I was investigating was hoped to mitigate the inefficient memory accesses you face when accessing vectors with nonunit stride [i.e. consecutive elements in a vector are not consecutive in memory].
--
I'm investigating several parallel programming languages, "paper languages" as they have been called due to their lack of substantial toolchain support and only significant utility being you can write papers about them. They all begin with well-constructed preambles regarding the woeful state of affairs in the parallel programming community in which only a single style of parallelism is well-represented by a particular language or solution. OpenMP, for instance, is really only suitable for data-level parallelism on a shared memory machine. MPI is suitable for task-level parallelism on a cluster. Neither one satisfies both domains well if even at all. The ones I've encountered so far (StreamIt, Brook, Chapel) sound reasonable. But claims of productivity need to be backed up, so we're all going to implement useful things with them and then compare notes.
Here is a fairly comprehensive list of languages that capture the sentiment. Some of them you wouldn't actually want to use in general. Some are domain specific (SystemC being the interesting one of those I think), and others are narrowly-scoped extensions of an existing paradigm (Intel SSE, for example, are SIMD instructions added to x86 - useful, but you'll probably never touch them yourself).
Parallel Languages
I'm going to try to implement FFT using StreamIt, Unified Parallel C (UPC), and perhaps Brook or some other streaming language in the next few days.
--
I am now the proud owner of a GeForce 9800 GX2. Help me think of things to do with it besides use it as an alternative to central heating.
TSMC plays fast and loose with process design rules so overclockability may be limited, though I'm still inclined to try. I didn't with the GF8800GTX because I needed an easily-reproduced platform on which to perform benchmarks. One interesting aspect of the GX2 card is that while it has a total of 1GB of video memory, they are partitioned into two address spaces, each accessible by one GPU. Each device can only allocate <512MB buffers, and exchanging data among GPUs is fairly slow [via the SLI]. Nevertheless, task-level parallelism and pipelining ought to perform well.
--
Grand purchase will be delayed one more week but no longer. This has strategic goals in mind.
No comments:
Post a Comment