Cell Architecture Explained Version 2

Part 3: Programming The Cell

Developing for the Cell

While developing for the Cell may sound like it could be an exquisite from of torture, fortunately this is not going to be the case.  If you can understand multithreading, cache management and SSE / VMX / AltiVec [AltiVec/VMX] development it looks like you will have no problems with the Cell.

The primary language for developing on the Cell is expected to be C with normal thread synchronisation techniques used for controlling the execution on the different cores.  C++ is also supported to a degree and other languages are also in development (including apparently, Fortran).

Various systems are in development for controlling execution on the Cell so developers should have plenty of options, this compares well with PS2 development which was primarily done in assembly and was highly restrictive.

Task distribution to the SPEs can be handled by the OS, middleware, compiled into applications or if you're brave (mad?) enough you can drop into assembly and roll your own system.

One method described by Sony uses the SPEs to do programmer defined Jobs [Jobs] .  Each job is put into a queue, when an SPE becomes free the next job in line is assigned to that SPE for execution.  In this scenario job assignment is controlled by the PPE but other schemes have the SPEs running a tiny OS which allows it to assign itself jobs.  The SPEs are then completely autonomous and operate with no guidance from the PPE at all.  The PPE just puts jobs into the job queue.  Jobs are self contained mini-programs, the SPE loads up the data, DMAs in the data and gets computing.

In all the above cases the SPEs are dynamically assigned, the developer does not need to worry about which SPE is doing what.

SPEs can multitask like normal CPUs and have running tasks switch in and out but this is not an entirely good idea as the context switch time is likely to be pretty horrendous - the entire state of the SPE needs to be saved, this includes not just the contents of the registers but also the entire local store.

Hello Tosh, gotta Toshiba?

Toshiba are developing software to run on Cells in their consumer goods.  They have talked about a “custom” OS (actually Linux) in which tasks are divided into SPE and PPE "modules".  Each SPE module is a sub-task which can operate using one or more SPEs depending on the compute power required, modules can also stream data to one another.

A complete task may consist of a number of modules which each do some processing then pass the results to the next module.  Toshiba have talked about a digital TV system which uses a set of SPE modules to decode digital TV signals.  To demonstrate they showed a single Cell (clock speed unknown) decoding 48 standard definition MPEG2 streams and scaling the results to a single HDTV screen.  The demo is said to run without frame drops and one of the SPEs was left idle while all this was going on.

Such a demo sounds fairly useless to most people but TV stations and media megalomaniacs usually pay a lot of money to do exactly this sort of thing.  That said one of the demos at the PS3 launch show PS3 showed how this type of capability could be used on a TV to select which channel to watch.

Celling Penguins: Linux on Cell

Linux is already used in house by IBM as a “bring-up” platform for new hardware so unsurprisingly Linux was one of (if not the) first OS to get up and running on the Cell, in fact Linux support was added so early it was running on a simulated Cell before real hardware Cells had even been produced.

The fact Linux already runs on 64 bit PowerPC ISA based systems (PowerPC 970, POWER 4/5) is a big help as the PPE is compatible with these.

A release for Linux is expected at some point and full ISA details should be released then.  Experimental patches have already been posted for the Linux kernel under the name BPA (Broadband Processor Architecture [BPA] .  Cell support is being also added to the GCC compiler and GDB debugger [Dev] .  In the meantime if you are lucky enough to have an NDA with one of the companies you can do a programming course and get access to Cell simulators until the real thing starts shipping.

The system currently being implemented into Linux treats the SPEs as a virtual file system [CellLinux] and you write data and programs into them via standard “write” operations.  There is also a Library interface which abstracts the interactions partially and this looks likely to be extended in the future.

Cell doesn’t appear to have any limitations regarding which Operating Systems can run, indeed the library abstraction has been designed to be portable.  Other Operating Systems than Linux are said to be running already, pretty much nothing has been said about what these others are though.

If you want to run different OSs or want to run multiple copies of one (useful for stability and OS development) Cell also supports a hypervisor allowing you to run several Operating Systems at the same time.  IBM has an open source hypervisor [Hyper] which was used in the validation of the Cell.

Converting Applications for Cell

As with any new architecture an application specifically designed for it will run best.  However this is not going to be an option in many cases, especially existing large applications.

The first step in getting an application running on the Cell is to port it to the PowerPC ISA.  Depending on the application this can be anything from pressing recompile to rewriting a whole heap of code.  Once the application is on PowerPC it should run on the Cell’s PPE without problem.

The next stage is to find out what should run on the SPEs.

SPEs are best suited to small repetitive tasks so pretty much any application should be able to make use of them.  However, profiling the code is important to find out what exactly gets used the most and this needs to be analysed to see if it is suitable for running on an SPE.

Large algorithms (>100KB compiled) or algorithms which jump randomly around memory accessing little pieces of data will likely be ill suited to running on the SPEs, pseudo-random accesses are not a problem and can gain a large boost.  Vectorisable and/or parallelisable algorithms are good bets for targeting the SPE.

Once the code to be moved has been identified, it needs to be partitioned away from the rest of the code so it is self-contained.  Only once it is fully self contained can it be moved over to the SPE.  If this code is already accessed as a plug-in or similar architecture this should be relatively easy.  If not making the code self contained could require rather more work.

The initial port to the SPE is generally done as scalar (i.e. non-vector) code in order to get it up and running.  This will involve getting the synchronisation and communication code working.

Once it’s up and running the code can then be vectorised and the SIMD units used properly.  This isn’t the final stage however as the computations and data flow need to be balanced to make the most efficient use of the SPEs.  After this, other optimisations can then be investigated for inclusion.

There are numerous methodologies in development for Cell, one such development flow presented at Power.org in Barcelona [CellDev] .

Targeting SPEs

The SPEs will be considerably better at some things than others.  You probably don’t want to run the OS on one, but it appears that if you were sufficiently persistent you could get them to do pretty much anything you want.  I think the SPEs might be better described as “algorithm accelerators”.

The reason I say this is because while most code may be big and branchy the stuff which actually does the majority of the work are small repetitive loops, exactly the sort of thing that SPEs should be good at.

Don’t believe me?  I’ll use my own desktop as an example:

I’m using a PowerBook running OS X Tiger, it’s currently running the Pages word processor, Safari (browser), Mail, Preview (Image / PDF viewer), Terminal and Desktop Manager.  I also regularly run various video players, iPhoto, Photoshop, SETI, Skype, OmniGraffle (diagramming) and Camino (browser).  E-UAE (Amiga emulator), GarageBand and even Xcode get fired up once in a while.  iTunes is pretty much permanently on (currently playing Garbage’s “Bleed like me”).  In addition to this lot the OS itself is doing umpteen jobs at any one time, I wont bother listing them as it’ll just sound like a dodgy Apple ad...

It might surprise you to learn that almost everything in that list can be accelerated to some degree by the SPEs - even the OS itself can benefit in several areas.  Anything which uses graphics, video or audio are good targets as they work on chunks of data and are in many cases parallelisable and vectorisable.  Text rendering and anti-aliasing can be done in an SPE, as can searching.  Even if operations are not vectorisable scalar operations ran across different SPEs will still be of use.  Sorting and encryption are also potential targets.

There’s not many applications in that list which will drive the system at 100% CPU usage for more than a few seconds at a time.  For the most part it’s the these types of applications which actually need power are the very ones which will benefit the most from running on SPEs.  Some of these kinds of applications can be accelerated to a very high degree (Photoshop, SETI, GarageBand).

I predicted SETI would be a good target in first version of this article as it is based on FFTs.  Cells have been shown to run FFTs at lubriciously high speeds, a single SPE managed 19 GFLOPS at 3.2GHz [CellDev] , I think my prediction of high SETI performance will hold!  If you are not interested in looking for aliens though you will find FFTs and similar algorithms are very widely used in compute intensive applications.

Note:  I said in the first version of this article that I thought Apple would be stark raving mad if they didn’t use the Cell.  I am still of that opinion.  However, if Apple wont go to the Cell, you can be pretty sure the Cell will go to Apple, the aptly named Cell-industries [CI] is planning an add-on.

SPE Instruction Set

The exact specification of the SPE’s ISA hasn’t been released yet but it appears to be a cross between VMX and the PS2’s Emotion Engine.  It doesn’t contain the full VMX instruction set as some stuff was removed and other stuff added.  According to IBM using a subset of the VMX “intrinsics” you can compile to both the SPEs and standard VMX, the only difference is the local stores.  The actual code is different at the binary level so existing AltiVec / VMX code needs to be recompiled before it will run [CellLinux] .

Since the SPEs can act as independent processors they also include instructions used when running programs such as branches, loads and stores.  They can also operate as scalar processors so non-vectorisable code can also run.

SPE programs cannot directly access anything other than local store though data can be DMA’d to and from the local stores from other system addresses.  Exceptions can be generated by the SPEs but are handled by the PPE, even then the type of exceptions generated are limited.

Here is a roundup of what is currently known of the SPE’s Instruction set (this should not be considered an exhaustive list):

  1. Based on VMX / AltiVec - some instructions added, some removed.

  1. Includes some (all?) of the PS2’s Emotion Engine ISA.

  1. Supports vector or scalar operations.

  1. Includes loads, stores, branches and branch hints.

  1. 8, 16, 32 and 64 bit integer operations.

  1. Single and dual precision floating point.

  1. Saturation arithmetic for FP (not integer).

  1. Simplified rounding modes for single precision FP.

  1. IEEE 754 support for double precision FP (not precise mode).

  1. Logical operations.

  1. Byte operations: Shuffle, Permute, Shift and Rotate (Shift / Rotate per Qword or slot).

  1. 128 x 128 bit Registers.

  1. Local Store DMA I/O (to / from any address in system).

  1. Commands for mailbox access, interrupts etc.

SPE Simulation

Until simulators or hardware becomes available it appear the best way of understanding SPE development would be to learn parallel programming techniques and VMX.  The local stores could probably be simulated by doing everything in a 256K block of RAM and only allowing access to the rest of RAM via aligned 128 byte transfers to / from it.  This will at least give you some idea of the issues you have to face.  You do however need to note the 256K will include your code so don’t use all of it for data. This will not simulate the speed of the local store or the impact of the additional registers but should be useful nonetheless.  The impact of a smaller number of registers and CPU caching will mean performance of the final code will be very difficult to estimate until hardware or real simulators are available.

Cell Programming Issues

One article had suggested that some developers had found the performance of both the Cell and the XBox360’s processors to be low.  This article was highly controversial and the original was removed but that didn’t stop it spreading,   [Comment] .  It appears the developers have taken some decidedly single threaded game code and ran it on pre-preproduction hardware with a processor optimised for multi-threading and stream processing (most likely with an immature compiler).  That the performance wasn’t what they expected is not so much news as blindingly obvious.  Of course the more cynical might suggest that’s exactly what these developers expected...

Getting the full potential from a Cell will be more difficult than programming a single threaded PC application.  Multiple execution threads will be necessary as will careful choice of algorithms and data flow control.

There are not exactly new problems and solutions have long existed to them.  Multiprocessor or even uni-processor servers have been doing this sort for thing for donkey’s years. It’s all old hat to BeOS programmers and many other programmers for that fact.  You will likely see technologies appearing from other areas (e.g. application servers) which take the pain out of thread management.

The same will also happen to the PC in time so they’re not going to get off lightly.  The problems that need to be solved for programming a Cell are exactly the same problems that need to be solved for programming a multi-core PC processor.

The types of algorithms which a Cell will be bad at are the very same algorithms that a PC processor is bad at.  Any algorithm which reads data from memory in a non-linear manner will cause your CPU to twiddle its thumbs as it sits waiting for the relevant memory to get pulled in, such algorithms are also likely to cause cache thrashing so it won’t help much.  Branch prediction is not much use either as it works on instructions, not data.

Cell will suffer here (possibly more so) but smart programmers can do tricks like splitting data into blocks and reading all the relevant data into the SPE’s local memories in one go, this will make algorithms on which the Cell is supposedly bad, vastly quicker.  The SPEs can communicate internally and read from each others local stores, if all SPEs are used for this nearly 2 MB can conceivably be used at once.  This will not be possible on a PC as cache works in a completely different way.

The Cell also supports stream processing via the local stores, this can drastically reduce the need to go to memory and they will perform best in this configuration.  The Xenon has special cache modifications (locking cache sets) to allow streaming but PC CPUs appear to have no direct way to support this currently.

SPE Performance Issues

For the PPE to get round its issues it just needs software to be compiled for it with a decent compiler.  With the SPEs the issues are more complex as they are more optimised for specific types of code.  As with the PPE there is no OOO hardware so the compiler is again important, but with 128 registers there’s plenty of room to unroll loops.  Scalar (i.e. single operation) processing can work, the SPEs were really designed for vector processing and will perform best (and considerably faster) when doing vector operations.

To get full use of an SPE the algorithm in use and at least some of the data needs to fit in a local store.  Anything which deals with chunks of data which fits entirely into the local stores should pretty much go like a bat out of hell.

All CPUs are held back by memory accesses, these can take hundreds of cycles leaving the CPU sitting there doing nothing for long periods.  A processor will only run as fast as it can get data to process.  If the data is held in a low latency local memory getting data is not going to be a problem.  It is in conditions like these that the individual SEPs may approach their theoretical maximum processing speed.

Programs which need to access external memory can move data to and from the local stores but there are restrictions in that transfers need to be properly aligned and should be in chunks of 128 Bytes (transfers can be smaller but there’s no point as it’s designed to handle 128 Bytes).  Additionally, due to there being multiple processors in a Cell the memory access is shared so access requests need to be put in a queue.  Scheduling memory access early will be important for maximising memory throughput.

While conventional CPUs try to hide memory access with caches, the SPEs puts it under the control of the programmer / compiler.  This adds complexity, but not all view compiler controlled memory access to be a hindrance:

An argument has been growing that processor instructions sets are just broken because they hide ‘memory’ (now in truth L2 cache) from ‘I/O’ (better known as memory) and the speed differences are now so huge that software can manage this better than hardware guesswork. ” - Alan Cox [AC]

The SPEs are dual issue but only a single calculation instruction can be issued per cycle.  Clever compilers may make use of vector processing as it should be possible to schedule multiple scalar operations in a single vector operation if they are the same.

The PPE does have some branch prediction hardware but the SPE has none.  To get around this the SPE includes a “branch hint” instruction which the compiler can use.  In addition to this in some cases instructions can be used which remove the need for branches altogether.  Developers using GPUs for general purpose programming have more constraints than the SPEs and have developed techniques for reducing the costs of branches.  It’s quite possible that at least some of these can be applied to SPE programs.

In the future, instead of having multiple discrete computers you'll have multiple computers acting as a single system.  Upgrading will not mean replacing an old system anymore, it'll mean enhancing it.  What's more your "computer" may in reality also include your PDA, TV, printer and Camcorder all co-operating and acting as one. The network will quite literally be the computer.

The Future: Multi-Cell'd Animals

One of the main points of the entire Cell architecture is parallel processing.  The original idea for Cells working across networks as mentioned in the patent appears to still be in development but probably won’t be in wide use for some time yet.  The idea is that “software cells” can be sent pretty much anywhere and don't depend on a specific transport means.

Want more computing power?  Plug in a few more Cells and there you have it.  If you have a few cells sitting around talking to each other via WiFi connections the system can use it to distribute software cells for processing.  The idea is similar to the the above mentioned job queues but rather than jobs being assigned locally they are assigned across a network to any Cell with spare processing capability.

Cell_Distributed.gif

The mechanism present in the software Cells makes use of whatever networking technology is in use, this allows ad-hoc arrangements of Cells to be made.  This system essentially moves a lot of complexity which would normally be handled by hardware and moves it into the system software.  This usually slows things down but the benefit is flexibility, you give the system a set of software cells to compute and it figures out how to distribute them itself.  If your system changes (Cells added or removed) the OS should take care of this without programmer needing to worry about it.

Writing software for parallel processing is usually highly difficult and this helps get around the problem.  You still of course have to parallelise the program into software cells / jobs but once that's done you don't have to worry if you have one Cell or ten (unless you’ve optimised for a specific number).

It's not clear how this system will operate in practice but it would have to be adaptive to allow resending of jobs when Cells appear and disappear on a network.  That said such systems already exist and are in very wide use.

This system was not designed to act like a “big iron” machine, that is, it is not arranged around a single shared or closely coupled set of memories.  All the memory may be addressable but each Cell has its own memory and they will work most efficiently in it.

It appears the IBM workstation / blade will have 2 Cells but it can act as a SMP system (i.e. the Cells can share each others memory).  The patent specified a way to connect 8 Cells but given the size and likely cost of the first generation of Cells I doubt anything like this will appear soon.

Programming The Cell: Conclusion

From reading various articles and conference papers / presentations it’s clear that there are many different Cell development models under active development, some of these are complex, others designed to make things easy. These developments will take considerable time to become mature so won’t show their full potential for some time to come.

The patent described a highly complex system which would be highly complex to program, a lot of work is going on to make sure this is not the case.  Tools and middleware will be developed to make things easier still.

The Cell was designed as a general purpose system but tuned for specific classes of algorithms.  These kinds of algorithm are very common and are utilised in a major chunk of the kinds of software which require high performance.  PCs vendors have nothing to worry about, it’s the workstation vendors that should be worried.

In Part 4...

Why does the Cell work the way it does how and why did they come up with this design?

In part 4 I look into the reason for the Cell’s design decisions and the impact the they are likely to make.

Introduction and Index

Part 1: Inside The Cell

Part 2: Again Inside The Cell

Part 3: Programming the Cell

Part 4: Revenge of the RISC

Part 5: Conclusion and References

© Nicholas Blachford 2005.