Programming The Cell Processor

Part 2: SPE Programming

SPE Programming

The PPE is a general purpose processor so programming it is much the same as programming any other processor.  The big difference is in programming the SPEs, programming these is similar to programming SSE or Altivec but with a few differences.

The SPEs do not include Out-of-order execution, branch prediction or caches so it is left to the programmer to effectively do those functions in code, this may sound like a pain and will indeed require more work but it is a trade off which gives you more computing power.  This trade off means the cores are small and consume roughly 1/10 of the power of a conventional processor. This is why the Cell can cram in 8 SPEs per chip in addition to the PPE.

Managing Local Stores

Each SPE contains a small fast memory called a “local store”.  The local stores are not the same as conventional CPU caches because a cache is pretty much invisible to programmers and it does it’s thing in the background, you do not normally use the cache directly unless you make use of cache pre-fetch commands.

The local store is visible to Cell developers and you have no choice but to deal with it (at least until advanced compilers come along which can hide it).  The SPEs use the local store for holding their data and programs, as they cannot directly access main memory.  The SPEs can however move data to or from the rest of the system via DMA (Direct Memory Access) commands which usually move 16 bytes (or multiples thereof) up to 16 Kilobytes in one go.  DMA commands can move less (as little as 1 byte) but this is best avoided as it’s very inefficient.

Manual control of the local store may sound like a royal pain but it’s very useful since you can use it at times to completely hide memory access, something conventional processors have difficulty with even with large caches and aggressive pre-fetching engines.  If you can mange to hide memory accesses the SPE will keep processing data without interruption.  For example: by hiding 99% of the memory access, IBM managed to get a 50x speedup compared to a 2GHz G5 in their terrain renderer.

You hide the memory access by using the common techniques of double or multi-buffering.  You DMA in a block of data (lets call it A) and start processing it, while data A is being processed you DMA in a second block (B) of data.  Once data A is processed the SPE starts processing block B, while this is happening the results of A are DMA’d back to memory and a third block of data, C is read into where block A was sitting.  Once B is completed the SPE starts processing C and the process repeats.  Multi-buffering is much the same as double buffering but you use more than 2 buffers.

Double buffering is a very common technique so there’s nothing magic about it.  It is not typically used to read data into traditional processors though as there are no buffers to read data into, you can try using cache but it’s not quite the same thing as it’s not addressable and can be flushed.

It may sound like it will only work with array type of data but it can also be used on certain types of tree structures so it’s not as limited as it might seem.

DMA Lists

The SPEs can read both structured and unstructured data data into the local stores.  The way this is done is through a DMA list.  The programmer specifies what data is required from where in a list and requests the contents of the list in one go.  If you are using double buffering this can all be done in the background while something else is being processed and thus does not hold up processing.  When the data is required it’ll all be sitting in the local store ready for use.

A program for a traditional processor assumes data is available as it needs it, this is not the case as the processor will need to get it from memory.  Prefetching and out of order execution / speculation are used to guess which data is required next and move it into cache, the idea is to fetch data before it is needed to prevent processor stalls while waiting for data.  These techniques are limited in what can be fetched, require complex and power hungry hardware and can end up reading the wrong data.

Doing pre-loads in software again may seem like extra work but all it really involves is working out what is needed next in advance and moving the fetch to earlier in the program.  Doing this in software is more efficient than doing it in hardware as it allows better use of the memory bandwidth, this allows potential performance gains.

SPE Optimisation

Vectorisation is the most obvious method of boosting performance and this is why SPEs handle vectors natively.  There are a number of programming techniques also used to get more performance out of SPEs.  The SPE’s job will be to handle the heavy processing and most of this will involve intensive loop processing, thus most of these techniques apply to loops and specifically, to removing branches from them.

Branches are liberally spread throughout software but they have the disadvantage that they can stall the processor.  Most modern microprocessors include branch prediction hardware so the penalty associated with branches can be minimised.  The Cell includes branch prediction in the PPE but the SPEs do not include this hardware.  The way to get around missed branches in the SPEs is to either use branch hints or remove the branches altogether.  There are a number of methods by which you can do this.

Function Inlining

Function calls are often made in loops and each of these calls and returns imposes a time penalty.

If you are calling a small function regularly in the loop, you will pay the call penalty every time you call that function.  Inlining is the process whereby you move the code from the external function and place it into the loop.  Doing this means there is no time penalty when using that code.

Loop Unrolling

At the end of a loop a test is done and a branch taken or not depending on the outcome, this happens every single time the loop is iterated.  The aim of loop unrolling is to reduce the number of times those instructions are executed and more importantly remove a potentially costly branch.

Unrolling a loop involves duplicating the functionality of a loop so for each iteration it now executes twice.  There is no test done at the end of the first half so you have removed those 2 instructions for half the total iterations.  If the original loop had 10 instructions and executes 1000 times the total number of instructions computed is 10 000 instructions, Unrolling once gives 18 instructions executed 500 times or 9 000 total instructions.  That’s a 10% reduction in instructions for a single loop unroll.

Doing the test will be dependant on the result of a previous instruction and this could potentially cause a stall while that instruction is executed.  Unrolling the loop also thus removes half of these stalls.

Loop unrolling doesn’t have to be done once, it can be done many times if you wish but you need to be careful you don’t run out of registers or make the code too big.  IBM’s Cell simulator gives you these details so you can monitor them.

Software Pipelining

This is a technique used to avoid stalls caused by instruction dependancies.  Software piplelining works by interleaving non dependant instructions.

Every instruction executed must go through the pipeline and this takes a number of cycles, if one instruction (B) is dependant on data from the other (A) it will stall until that data becomes available.  The idea of software piplelining is to reorder instructions in such a way that while B is waiting on A to finish, other instructions are issued to the execution units keeping them busy.

Software pipelining works well with unrolled loops because you can interleave the loops together.  If you have unrolled the loops a number of times, interleaving them can potentially remove all the stalls and your loop will run at it’s maximum speed.

Predication

Predication is the process of choosing something based on the value of something else.  This is used in the SPEs as a way of removing some types of branches.

You may have a simple if-else construct which does a simple operation depending on the value of a variable.

if (A > B) C = C + 1;

else C = C + 2;

When compiled the code will include a branch, the select instruction can be used to avoid branches between a pair of simple choices.  In this case the can be avoided by executing both sides of the branch then selecting the correct result.  This may seem to be a slow way of doing things but in this case doing both is actually faster than missing the branch.

Hinting Branches

A branch hint allows the developer to indicate if a branch is likely to be taken.  If the hint is correct there is no penalty when the branch is executed.

The predication example above uses select instruction to remove the branch.  If the code conditionally executed is more complex predication is not be a good technique to use as the cost is greater than the branch, an alternative to this is to hint what the outcome of the test is.

Hinting can also be a dynamic process so, for example, you can use a branch hint in a loop.  You first indicate the loop will always iterate again.  At the end of the second to last iteration the hint is switched so it hints the loop is not iterated again.  The loop will not need to iterate again after the final iteration so the hint will thus be correct 100% of the time.

Data Structure Selection

Cell will perform very differently depending of the data structures you use so you need to be careful when selecting what structures you use.

Ideally you want structures which holds data in chunks which can be moved into an SPE’s local store.  Arrays are an obvious example but some sorts of tree structures can also work like this.  A big array of vectors is a perfect match for an SPE.

Structures which hold data irregularly scattered around memory are less ideal but these can be handled by issuing a list of DMA commands.

Structures which involve following a pointer randomly around memory should be avoided, reading through these will involve suffering the full memory latency every time you read an element.  Fragmented linked lists are pretty much the enemy, they are not an ideal structure for using with the Cell (or any other processor for that matter).  This is not to say you should never use them, but you should be aware of their limitations and avoid their use in compute sensitive code.

Searching through a series of arrays may be considerably faster than following a few links in a tree.  This may seem counter intuitive but if the nodes in a list or tree are small the potential difference between the read speed of these trees and read speed of arrays may be many thousand fold.

There are however techniques which can be used to alleviate problems with awkward data structures.

One of the Cell programming models allows you to run multiple threads in a single SPE simultaneously, if one thread is waiting for data the SPE switches to a different thread.  This is similar to the way the UltraSPARC T1 deals with latency and it works very effectively.

If the data is read repetitively in a manner difficult to predict one potential solution is to use a software cache. IBM have demonstrated this in a real time ray tracing application.

Another method is to use hybrid structures, instead of a or tree with an element per node, put a small array at each node, it involves adding code to deal with the arrays but will cut down on the number of latency hits you’ll suffer.

Another composite data structure could be the same in reverse.  You could use an array as an index to another data structure, searching through this array would be very, very fast and you could avoid a lot of memory hits this way.  If the index array is sufficiently small it could be held in an SPE’s local store, if so searching would be very fast indeed.

Even linked lists can be sped up.  If you rebuild the list and ensure the elements are close in memory, speculatively reading data should reduce any negative performance impact.  Rebuilding data structures can also be used to reduce computation.  If you have multiple threads accessing the structure you will need to prevent access during reconstruction to ensure threads don’t end up reading bad data.

The ultimate way to deal with a problematic data structure is to change it.  This may involve holding data in a way which appears inefficient and may involve the writing of extra code to periodically reorder data but the gain is extra performance, potentially a lot of extra performance.

And If All Else Fails...

If all else fails and you are stuck with branchy code or a data structure the SPE doesn’t like, the best idea is to give that section of code to the PPE.  The PPE was designed to handle less compute intensive conventional code which is often full of branches and all manner of data structures.

--

Programming The Cell Processor

Part 1: What You Need To Know

Part 2: SPE programming

Part 3: Programming Models

Other Articles

© Nicholas Blachford 2006