Programming The Cell Processor

Part 1: What You Need to Know

Introduction

The Cell processor has seemingly picked up a reputation for being incredibly difficult to program.  However, this seems to be based entirely on expectation rather than experience.  What’s the real story?  Is it really that difficult?  This article looks at the concepts involved in Cell programming.

While many say Cell is difficult to develop for, I’ve yet to hear an actual Cell developer describe it as “difficult”, the terms used tend to be along the lines of “more involved” or “requires a different way of thinking”.

I myself am guilty of being pre-judgemental, in my first coverage of the Cell processor I said I expected the Cell would be a beast to program.  This however was based on reading the patent and programming the raw hardware, something very few people will ever do directly.  For most developers, compilers, operating systems, libraries and middleware will take care of a lot of the fine detail.  As a new architecture though, much of this still in development so for now you will have to do at least some of the hard work yourself.

One thing it is important to note is that while Cell offers huge potential computing performance, it doesn’t come free.  Don’t expect to get a magic speedup from existing legacy code.  If you just recompile code optimised for the PC it will only run on the PPE (Power Processor Element) and may actually run slower.

Cell is optimised for certain types of code and not everything will be able to take advantage of it immediately. If your code does not run well on Cell you may need to look at changing the algorithms and / or data structures it uses.  Some applications may seem ill suited to Cell but this may not be the case, it may only be the implementation which is ill suited, a different implementation may work very well.

Control And Compute

The vast majority of code running on a Cell will be “control” code running on the PPE.  This is programmed exactly the same way as a desktop processor, in fact it uses the same instruction set as the PowerPC 970 (aka G5) found in PowerMacs.

To save space and power the PPE hardware was made simpler than most desktop processors.  There will as such be some gotchas when developing for it.  However, providing you avoid certain types of algorithms or data structures and are careful with compiler settings it should be fine, just don’t expect it to outgun a top end AMD / Intel chip.  The PPE usually isn’t so important for performance though because the performance sensitive code will mostly run on the SPEs (Synergistic Processor Elements).  

SPE development will be more complex but performance sensitive code is usually only a tiny percentage of the code which runs.  A lot of the code in an application is just glue code tying things together, an example of this is GUI code which mostly consists of calls to perform a function when you click on a gadget.

SPE programming is more involved because unlike most desktop processors the SPEs do not have things like Out-Of-Order execution, branch predictors and caches.  When you’re programming the SPEs you have full manual control, like it or not.  You’ll be glad to know you will not have to jump into assembly to program the SPEs, you can use C (with other languages following).  Compilers will eventually take care of a lot of the manual control for you but they’re still a work in progress.

Higher level languages (e.g. Perl, Python, Ruby etc.) will already run on the PPE as it’s a standard PowerPC but will not take advantage of the SPEs until support is added.  You can be forgiven if you get the impression these languages are not suited to the SPEs but this is not the case, many of the functions these languages perform (e.g. text processing) could potentially run very fast on the SPEs.  An SPE based XML parser already exists [XML] and Java is said to be working well.

What You Need To Know

The SPEs are mainly designed as parallel vector / SIMD processors, to make proper use of it you need to understand and use certain programming techniques.  Multithreading is used to spread a problem across multiple processors, vector programming is used to make use of the Cells SIMD capabilities.

Multithreading

A Cell contains 8 SPEs (7 on the PS3 and one is used by the OS).  To use them all you are going to have to learn about multithreading.  Multithreading is a technique by which you break processing into a series of sub programs (threads) which work independently of each other.

Mutithreading requires planning as it involves changing the architecture of an application.  Applications break up in different ways, the threads can be anything from completely independent to needing extensive data sharing, some multithreaded applications are time dependent so threads will need synchronisation.

You also need to know what each thread depends on and where data is going to be shared.  If something is shared you may need to put a lock on it so only one thread can access it at a time.

The SPEs and PPE are designed to run different types of code so you will need to keep this in mind when breaking up operations into threads, you obviously don’t want to run PPE suited code on the SPEs.

In the case of Cell the hardware supports 2 threads on the PPE and one thread per SPE giving you a total of 10 hardware threads.  Software threads are not limited so you can have as many of these as you wish, compiler support for software threads on the SPEs are also in development.

There’s a lot involved in multithreaded programming and while it may seem complex it’s neither a new or obscure technique.  In fact you’re probably not going to be able to avoid it as all processor vendors are now going multicore and / or multithreaded.  Getting the maximum performance out of any of these processors will require multithreaded programming.

Communication

When the threads are running on the SPEs All these threads need to communicate with each other.  In a traditional multiprocessor system communication is usually done via a shared memory.  Cell can use traditional techniques but also have addition communication methods as well.

Mailboxes and signals are methods of sending smaller messages or signals between different parts of the chip.  If you want an SPE to wait for an event you can set an SPE to wait for a mailbox or signal, this will be useful for real-time processing.

Data can also be DMA’d directly between SPEs but this technique is more likely to be used for data rather than communication.

The details of message passing will most likely be hidden by the threading model provided so for the most part you won’t get into these details unless you need to.

Vector Programming

Vector (or SIMD) programming is applicable to a very wide range of problems, not just media codecs and the like.  Searching, pattern matching, networking and even OS functions such as memory operations can be accelerated.  The performance difference is not small either, Altivec programmers regularly talk of factors of 3x to 16x better performance in functions they’ve modified.

The SPEs have a reputation of being weird DSP type things which can only handle vector code.  This is not true as they are also quite capable of handling scalar (read: normal) code.  It’s better to use vector code however because processing 4 (or more) pieces of data at a time is obviously going to be faster.

SPEs can be programmed in assembly if you feel adventurous enough but it’s not necessary, it is normal to program the SPEs in C, C doesn’t have direct support for the SPE instruction set so you use a series of additional commands called “intrinsics”.

To vectorise an algorithm it will need to be restructured so it can deal with multiple elements in one go.  

This is can be pretty easy:

for (count = 0; count < 100; count++)

{

C[count] = A[count] + B[count]

}

Vectorises as:

for (count = 0; count < 25 ; count++)

{

VC[count] = spu_add(VA[count], VB[count])

}

The “spu_add” function is the intrinsic you use for add on an SPE.

For vector processing data should be aligned properly and have a regular access pattern.  An array is obviously ideal, (“vector processing” literally means “array processing”).

If you data is in an array but not properly aligned you can use the shuffle instruction to fix this.  Shuffle commands allow you If you want, to manipulate two unaligned arrays and write back the results to another unaligned array.  Only a few shuffle shuffle commands are necessary per iteration to do this, obviously it’s better to use aligned data as you don’t need extra operations but it’s useful nevertheless.

If you’ve ever programmed MMX / SSE or AltiVec you’ll immediately feel right at home working on an SPE as you use many of the same techniques, AltiVec in particular is very similar.   It should be noted however that the SPE instruction set and AltiVec while very similar, are not identical.  There are different instructions, different data types, the SPEs have more registers and of course the SPEs have a local store.

--

Programming The Cell Processor

Part 1: What You Need To Know

Part 2: SPE programming

Part 3: Programming Models

Other Articles

© Nicholas Blachford 2006