Programming The Cell Processor

Part 3: Programming Models

Cell Programming Models

There are different ways of spreading out tasks across the PPE and SPEs. Different applications have different computation properties and they may suit different models.

Normally you write a program, compile it and run it. Cell offers all sorts of different options for programming models (i.e. how a program is organised) so you have many options open to you when architecting your application. The following discusses some of the programming models available for Cell:

Some applications use little program and data so both can fit in a single SPE.

Others are a bit bigger so data and / or code is stored in main RAM and DMA’d in as necessary.

Existing applications will likely use the “off load” model where the main application runs on the PPE while compute sensitive parts utilise the SPEs. An example of this would be a video editing app which would have the GUI and control functions running on the PPE while the video processing and encoders / decoders run on the SPEs.

Other apps may not have much in the way of control functions so will will mostly run on the SPEs with the PPE having only a very small roll. If you are involved in distributed computing projects such as BOINC all the processing would run on the SPEs while the main controlling app which really only monitors things and does communications would run on the PPE.

One model involves the PPE very little indeed: The job queue model sets up a queue of tasks and each SPE takes a task as it becomes free. This model is useful where jobs vary in size a lot, at least one Playstation 3 game use this model.

A variation on job queues keeps the program the same but the data changes. Data to be processed is held in memory and each SPE takes part and then writes the results back, when a SPE finishes with one chunk of data it picks up the next free one and works on it.

Stream processing involves setting up a number of SPEs in a chain. Each takes some data in, processes it then passes it along to the next SPE in the chain. This is useful for tasks which can be broken in equal sized chunks and processed in serial. An advantage of this model is that only the beginning and end SPEs are communicating with memory so memory bandwidth usage is minimised. This model is possible because DMA transfers can be done directly local store to local store between different SPEs. These utilise the very high speed (200 Gigabytes per second) internal bus system and do not generate any memory traffic. Streaming can also involve the PPE by locking parts of it’s cache and streaming to or from those them.

While the SPEs do not support multiple hardware threads they can still utilise software threads, multiple threads are held in a local store and the SPE switches between them when a thread stalls on memory

access. This model is useful for running small threads with unpredictable memory access patterns on SPEs, a lot of commercial applications are like this.

Orchestrating Data Flow

In order to get the maximum performance data will need to flow through the processor much as water flows along a river. This is an unusual way of thinking about programming as we normally think in terms of program flow but in many cases (e.g. media apps) data does exactly this.

The best illustration of this is the stream processing model. Data is read from memory into an SPE, it is processed and then sent to the next SPE this can then repeat through a number of other SPEs each processing a different stage of the algorithm. At the end the results are passed back to memory.

In this model data can be thought of as flowing from memory through the SPEs and then back to memory. Obviously data will flow in different ways in the different models.

If data does not flow properly performance on the application in question will not be as good as it can be. If one SPE takes longer to do it’s task than others the others will stall while they are waiting for it to catch up. If the amount of data DMA’d is to small to keep the SPEs busy again there could be stalls.

The Cell Development Process

There’s a lot involved in getting a Cell working properly so a development flow has been devised to help you along the way.

Study Algorithm

First thing is understand what you are working on, will the algorithm in question work well on the PPE or SPE? Can it be broken into threads? Is the algorithm Cell friendly? If not are there others that can be used instead?

Experimental partitioning and mapping to Cell

Much of an application is control code which doesn’t do much processing, this should run on the PPE. The compute intensive parts should be broken up and executed on the SPEs. You don’t do that at this stage but you do need to decide what will be running on the PPE and what will be using the SPEs.

Data layout and flow analysis

In order for the Cell to work it’s best data needs to be read, processed and results written as fast as possible. This may involve changing the format or layout of data prior to processing. Once you have decided on the layout you need to see if it will flow around the processor properly, you also need to find out if there will be any bottlenecks and alter things accordingly.

Write code for PPE

Once you have decided how things are going to work you then sit down and write your program. This is done entirely on the PPE at first using scalar (non-vector) code. This is so you can verify the program works properly. At this stage the code should be fully functional but only running on the PPE, it should however be partitioned so moving parts to the SPEs is easy.

Break SPE code off

Once the code is working properly you move the parts for the SPEs onto the SPEs. This stage is to get the PPE, SPEs and all communication between them in place and working correctly.

Vectorise SPE code

Only after you have verified everything is working fully utilising the PPE and SPEs do you vectorise your code.

Balance computation / data flow

Once everything is working and vectorisation complete, you will need to check how data is flowing around the system. Computation and data flow should be in balance otherwise the application will not execute as fast as it can. If the application is data bound it’ll do nothing while it’s waiting for data, if it is compute bound the opposite occurs and memory bandwidth goes underused. It should be noted however that efficiency is not the goal here, only performance. Some applications are by their nature compute or data bound and there’s no point trying to change this if it slows them down.

Optimisation

Optimisation goes last, as it should be. This is where you unroll loops and pipeline your code. You can also look at vectorising some of the code on the PPE if possible. The PPE includes a VMX (aka Altivec) unit so you can use this if you wish. You may also want to have a final check of your compute and data balance after doing this to ensue the optimisation hasn’t knocked things off balance.

The Ideal Solution Might Not Seem Ideal

It is important not to assume there is one way to code a solution to a problem. In the programming world there are usually multiple ways to express a solution, not all of these will work well on Cell so you should always bear this in mind and be prepared to consider alternative solutions. The solution which you may consider best may not be the best solution for using on Cell. In fact the best solution for Cell may be one you might not usually use or even consider at all.

Conclusion: Is Cell Programming Difficult?

It’s pretty obvious that Cell programming is going to be a pretty involved process and there’s a lot of things to be considered. However, is it really that different from the way you might program other processors?

The part of an application running on the SPEs is likely to be only a small fraction of an application so you are not likely to be dealing with a large chunk of code. Most of the code you write will most likely be just normal code and will run on the PPE, this will be pretty much the same as programming any system, that said as the initial Cell’s PPE is an in-order core and uses the same memory system you’ll probably want to be careful with anything compute sensitive. Profiling for bottlenecks will of course be helpful.

The closest thing to programming the Cell today is programming Altivec on a top end PowerMac. The current top end PowerMacs have 4 G5 cores, each of these have Altivec / VMX engines so you’ll be using the multithreading and vector programming techniques. Not only are the instruction sets very similar but you’ll use many of the same techniques (loop unrolling, pipelining, branch removal etc.). The G5s do not have local stores but Altivec programming uses cache control instructions to ensure good data flow, these are analogous to the DMA commands on the Cell.

I think the ultimate answer to the question will be dependant on the individual programmer.

If you are used to programming with high level languages such as Perl or Python this will all seem rather alien to you and you’ll be more likely to get the impression Cell development will be difficult.

If you’ve developed optimised C or C++ you should have a better understanding of what’s involved and will recognise that extra work is involved.

If you’ve developed SSE and / or Altivec code in C and understand multithreading you’ll probably recognise almost everything here and be wondering what all the fuss is about.

If you’ve developed for the Playstation 2 or tried general purpose programming on GPUs you may be thinking Cell development looks easy!

Getting something running on Cell should be pretty trivial process - if you don’t use the SPEs. Getting the best out of it however is a different matter as there will be a lot of work involved. Most of what is involved is pretty straight forward but there are a lot of things to be learned and considered which you don’t usually need to bother with. Cell also offers new programming models, new thread communication techniques and DMA lists, all of these can be utilised to improve performance, but you will need to understand them in order to utilise them.

The part of Cell programming which look hairiest is multithreading, it is already known to difficult. However, this is not a Cell specific problem, with all vendors going multicore you’ll most likely going to have to learn multithreading anyway.

The upside to all this is you can potentially get enormous speed ups for what is now a rapidly increasing range of applications. Only you can judge if the extra work is worth it.

References And Further Reading

[Cell]

Cell Architecture Explained

[XML]

XML

Cell Simulator

Cell Simulator and other downloads

Cell Documentation

Official Cell Docs

Programming The Cell Processor

Part 1: What You Need To Know

Part 2: SPE programming

Part 3: Programming Models