This article has been updated, see Version 2

Cell Architecture Explained (version 1) - Part 1: Inside The Cell

Getting the details on Cell is not that easy. The initial announcements were vague to say the least and it wasn't until a patent [Cell Patent] appeared that any details appeared, most people wouldn't have noticed this but the inquirer ran a story on it [INQ].

Unfortunately the patent reads like it was written by a robotic lawyer running Gentoo in text mode, you don't so much read it as decipher it. On top of this the patent does not give the details of what the final system will look like though it does describe a number of different options.

With the recent announcements about a new Cell workstation and some details [Recent Details] and specifications [Specs] being revealed it's now possible to have a look at how a Cell based system may look like in the flesh.

The patent is a long and highly confusing document but I think I've managed to understand it sufficiently to describe the system. It's important to note though that the actual Cell processors may be different from the description I give as the patent does not describe everything and even if it did things can and do change.

Background

Although it's been primarily touted as the technology for the PlayStation 3, Cell is designed for much more. Sony and Toshiba, both being major electronics manufacturers buy in all manner of different components, one of the reasons for Cell's development is they want to save costs by building their own components. Next generation consumer technologies such as BluRay, HDTV, HD Camcorders and of course the PS3 will all require a very high level of computing power and this is going to need chips to provide it. Cell will be used for all of these and more, IBM will also be using the chips in servers and they can also be sold to 3rd party manufacturers [3rd party].

Sony and Toshiba previously co-operated on the PlayStation 2 but this time the designs are a more aggressive and required the help of a third partner to help design and manufacture the new chips. IBM brings not only it's chip design expertise but also it's industry leading silicon process and their ability to get things to work - when even the biggest chip firms in the industry have problems it's IBM who get the call to come and help. the companies they've helped is a who's who of the semiconductor industry.

The amount of money being spent on this project is vast, two 65nm chip fabrication facilities are being built at billions each and Sony has paid IBM hundreds of millions to set up a production line in Fishkill. Then there's a few hundred million on development - all before a single chip rolls of the production lines.

So, what is Cell Architecture

Cell is an architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as apulets), these are sent out to the hardware Cells where they are computed and results returned.

This architecture is not fixed in any way, if you have a computer, PS3 and HDTV which have Cell processors they can co-operate on problems. They've been talking about this sort of thing for years of course but the Cell is actually designed to do it. I for one quite like the idea of watching "Contact" on my TV while a PS3 sits in the background churning through a SETI@home [SETI] unit every 5 minutes. If you know how long a SETI unit takes your jaw should have just hit the floor, suffice to say, Cells are very, very fast [Calc].

It can go further though, there's no reason why your system can't distribute software Cells over a network or even all over the world. The Cell is designed to fit into everything from PDAs up to servers so you can make an ad-hoc Cell computer out of completely different systems.

Cell APU Architecture diagram

Scaling is just one capability of Cell, the individual systems are going to be potent enough on their own. The single unit of computation in a Cell system is called a Processing Element (PE) and even an individual PE is one hell of a powerful processor, they have a theoretical computing capability of 250 GFLOPS (Billion Floating Point Operations per Second) [GFLOPS]. In the computing world quoted figures (bandwidth, processing, throughput) are often theoretical maximums and rarely if ever met in real life. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure.

Specifications

An individual Processing Element (i.e. Hardware Cell) is made up of a number of elements:

Cell Architecture diagram

The full specifications haven't been given out yet but some details [Specs] are out there:

All those internal processing units need to be fed so a high speed memory and I/O system is an absolute necessity. for this purpose Sony and Toshiba have licensed the high speed "Yellowstone" and "Redwood" technologies from Rambus [Rambus], the 6.4 Gb/s I/O was also designed in part by Rambus.

The Processor Unit (PU)

As we now know [Recent Details] the PU is a 64bit "Power Architecture" processor. Power Architecture is a catch all term IBM have been using for a while to describe both PowerPC and POWER processors. Currently there's only 3 CPUs which fit this description: POWER5, POWER4 and the PowerPC 970 (aka G5) which itself is a derivation of the POWER4.

The IBM press release indicates the Cell processor is "Multi-thread, multi-core" but since the APUs are almost certainly not multi-threaded it looks like the PU may be based on a POWER5 core - the very same core I expect to turn up in Apple machines in the form of the G6 [G6] in the not too distant future, IBM have acknowledged such a chip is in development but as if to confuse us call it a "next generation 970".

There is of course the possibility that IBM have developed a completely different 64 bit CPU which it's never mentioned before. This isn't a far fetched idea as this is exactly the sort of thing IBM tend to do, i.e. the 440 CPU used in the BlueGene supercomputer is still called a 440 but is very different from the chip you find in embedded systems.

If the PU is based on a POWER design don't expect it to run at a high clock speed, POWER cores tend to be rather power hungry so it may be clocked down to keep power consumption down.

The PlayStation 3 is touted to have 4 Cells so a system could potential have 4 POWER5 based cores. This sounds pretty amazing until you realise that the PUs are really just controllers - the real action is in the APUs...

Attached Processor Units (APU)

Each Cell contains 8 APUs. An APU is a self contained vector processor which acts independently from the others. They contain 128 X 128 bit registers, there are also 4 floating point units capable of 32 GigaFlops and 4 Integer units capable of 32 GOPS (Billions of Operations per Second). The APUs also include a small 128 Kilobyte local memory instead of a cache, there is also no virtual memory system used at runtime.

Independent processing
The APUs are not coprocessors, they are complete independent processors in their own right. The PU sets them up with a software Cell and then "kicks" them into action. Once running the APU executes the apulet in the software Cell until it is complete or it is told to stop. The PU sets up the APUs using Remote Procedure calls, these are not sent sent directly to the APUs but rather sent via the DMAC which also performs any memory reads or writes required.

Vector processing
The APUs are vector [Vector] (or SIMD) processors, that is they do multiple operations simultaneously with a single instruction. Vector computing has been used in supercomputers since the 1970s and modern CPUs have media accelerators (e.g. SSE, AltiVec) which work on the same principle. Each APU appears to be capable of 4 X 32 bit operations per cycle, (8 if you count multiply-adds). In order to work, the programs run will need to be "vectorised", this can be done in many application areas such as video, audio, 3D graphics and many scientific areas.

AltiVec?
It has been speculated that the vector units are the same as the AltiVec units found in the PowerPC G4 and G5 processors. I consider this highly unlikely as there are several differences. Firstly the number of registers is 128 instead of AltiVec's 32, secondly the APUs use a local memory whereas AltiVec does not, thirdly Altivec is an add-on to the existing PowerPC instruction set and operates as part of a PowerPC processor, the APUs are completely independent processors. There will no doubt be a great similarity between the two but don't expect any direct compatibility. It should however be relatively simple to convert between the two.

Cell APU Architecture diagram

APU Local memory

The lack of cache and virtual memory systems means the APUs operate in a different way from conventional CPUs. This will likely make them harder to program but they have been designed this way to reduce complexity and increase performance.

Conventional Cache
Conventional CPUs perform all their operations in registers which are directly read from or written to main memory, operating directly on main memory is hundreds of times slower so caches (a fast on chip memory of sorts) are used to hide the effects of going to or from main memory. Caches work by storing part of the memory the processor is working on, if you are working on a 1MB piece of data it is likely only a small fraction of this (perhaps a few hundred bytes) will be present in cache, there are kinds of cache design which can store more or even all the data but these are not used as they are too expensive or too slow.

If data being worked on is not present in the cache the CPU stalls and has to wait for this data to be fetched. This essentially halts the processor for hundreds of cycles. It is estimated that even high end server CPUs (POWER, Itanium, typically with very large fast caches) spend anything up to 80% of their time waiting for memory.

Dual-core CPUs will become common soon and these usually have to share the cache. Additionally, if either of the cores or other system components try to access the same memory address the data in the cache may become out of date and thus needs updated (made coherent).

Supporting all this complexity requires logic and takes time and in doing so this limits the speed that a conventional system can access memory, the more processors there are in a system the more complex this problem becomes. Cache design in conventional CPUs speeds up memory access but compromises are made to get it to work.

APU local memory - no cache
To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any. Instead they used a series of local memories, there are 8 of these, 1 in each APU.

The APUs operate on registers which are read from or written to the local memory. This local memory can access main memory in blocks of 1024 bits but the APUs cannot act directly on main memory.

By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache. The local memory can only be accessed by the individual APU, there is no coherency mechanism directly connected to the APU or local memory.

This may sound like an inflexible system which will be complex to program and it most likely is but this system will deliver data to the APU registers at a phenomenal rate. If 2 registers can be moved per cycle to or from the local memory it will in it's first incarnation deliver 147 Gigabytes per second. That's for a single APU, the aggregate bandwidth for all local memories will be over a Terabyte per second - no CPU in the consumer market has a cache which will even get close to that figure. The APUs need to be fed with data and by using a local memory based design the Cell designers have provided plenty of it.

Coherency
While there is not coherency mechanism in the APUs a mechanism does exist. To prevent problems occurring when 2 APUs use the same memory, a mechanism is used which involves some extra data stored in the RAM and an extra "busy" bit in the local storage. There are quite a number of diagrams to look at and a detailed explanation in the patent if you wish to read up on the exact mechanism used. However the system is a much simpler system than trying to keep caches up to date since it essentially just marks data as either readable or not and lists which APU tried to get it.

The system can complicate memory access though and slow it down, the additional data stored in RAM could be moved on chip to speed things up but may not be worth the extra silicon and subsequent cost at this point in time.

Little is known at this point about the PUs apart from being "Power architecture" but being a conventional CPU design I think it's safe to assume there will be perfectly normal cache and coherency mechanism used within them (presumably modified for the memory subsystem).

APUs on their own being well fed with data will make for some highly potent processors. But...

APUs can also be chained, that is they can be set up to process data in a stream using multiple APUs in parallel. In this mode a Cell may approach it's theoretical maximum processing speed of 250 GigaFlops. In part 2 I shall look at this, the rest of the internals of the Cell and other aspects of the architecture.

 


Introduction and Index
Part 1: Inside The Cell
Part 2: Again Inside The Cell
Part 3: Cellular Computing
Part 4: Cell Vs the PC
Part 5: Conclusion and References
Part 6: Updates, Clarifications and Missing Bits

 

© Nicholas Blachford 2005.