This article has been updated, see Version 2
Cell Architecture Explained (Version 1) - Part 5: Conclusion and References
Short Overview
The Cell architecture consists of a number of elements:
The Cell Processor
This is a 9 core processor, one of these cores is something similar to a PowerPC G5 and acts as a controller. The remaining 8 cores are called APUs and these are very high performance vector processors. Each APU contains it's own block of high speed RAM and is capable of 32 GigaFlops (32bit). The APUs are independent processors and can act alone or can be set up to process a stream of data with different APUs working on different stages. This ability to act as a "stream processor" gives access to the full processing power of a Cell which is more than 10 times higher than even the fastest desktop processors.
In addition to the raw processing power the Cell includes a high performance multi-channel memory subsystem and a number of high speed interconnects for connecting to other Cells or I/O devices.
Distributing Processing
Cells are specifically designed to work together. While they can be directly connected via the high speed interconnects they can also be connected in other ways or distributed over a network. The Cells are not gaming or computer specific, they can be in anything from PDAs to TVs and all can be used to effectively act as a single system. The infrastructure for this is built into each Cell as they operate on "Software Cells" which contain routing information as well as programs and data.
Parallel programming is usually complex but in this case the OS will look at the resources it has and distribute tasks accordingly, this process does not involve re-programming. If you want more processing power you simply add more Cells, you do not need to replace the existing ones as the new Cells will augment the existing ones.
Overall the Cell architecture is an architecture for distributed, parallel processing using very powerful computational engines developed using a highly aggressive design strategy. These devices shall be produced in vast numbers so they will provide vast processing resources at a low cost.
Conclusion
The first Cell based desktop computer will be the fastest desktop computer in the industry by a very large margin. Even high end multi-core x86s will not get close. Companies who produce microprocessors or DSPs are going to have a very hard time fighting the power a Cell will deliver. We have never seen a leap in performance like this before and I don't expect we'll ever see one again, It'll send shock-waves through the entire industry and we'll see big changes as a result.
The sheer power and low cost of the Cell means it will present a challenge to the venerable PC. The PC has always been able to beat competition by virtue of it's huge software base, but this base is not as strong as it once was. A lot of software now runs on Linux and this is not dependant on x86 processors or Microsoft. Most PCs now provide more power than is necessary and this fact combined with fast JIT emulators means that if necessary the Cell can provide PC compatibility without the PC.
It will not just attack the PC industry but expect it to be widely used in embedded applications where high performance is required. This means it will be made in numbers potentially many times that of x86 CPUs and this will reduce prices further. This will also hurt PC based vendors' desires to enter the home entertainment space as PC based solutions [Entertainment] will be more complex and cost more than Cell based systems.
This is going to prove difficult for the PC as CPU and GPU suppliers will have essentially nothing to fight back with. All they can hope to do is match a Cell's performance but even that is going to be incredibly difficult given the Cell's aggressive Cray-esqe design strategy.
Cell is going to turn the industry upside down, nobody has ever produced such a leap in performance in one go and certainly not at a low price. The CPU producers will be forced to fight back and irrespective of how well the Cell actually does in the market you can be sure that in a few short years all CPUs will be providing vastly more processing resources than they do today. Even if the Cell was to fail we shall all gain from it's legacy.
Not all companies will react correctly or in time, this will provide opportunities for newer, smaller and smarter companies. Big changes are coming, they may take years but the Cell means a decade from now the technology world is going to look very different.
References and Further Reading
Microprocessor Report Article on the Cell
I'm not the only one to decipher the patents as Microprocessor Report have published an article on the subject (of course just as I finished this..) There is a short version but you have to pay $50 to read the full article. I can't comment on the full article as I haven't read it but it's probably a good read.
[Cell Patent]
The original Cell patent application by Masakazu Suzuoki and Takeshi Yamazaki of Sony Computer Entertainment inc.
This article is based on the interpretation of this document.
The updated patent
[INQ]
The inquirer ran a story on the Cell patent in early 2003
here.
[Recent Details]
Press release 1.
Press release 2.
[Specs]
Cell production specifications
Photo.
[3rd party]
Companies can sell cell to their own customers, mentioned in this
Toshiba press release.
[SETI]
SETI@home could benefit from the power of cells.
SETI uses a lot of FFTs and the should benefit tremendously from the Cell's design. The use of local memory in the APUs will especially help.
SETI
[Rambus]
Sony and Toshiba licensed Rambus technology for use in the Cell.
Rambus
[GFLOPS]
It's not entirely clear how these are being counted, I assume this is 32 bit floating point operations.
Floating point operations can be 16 bit, 32 bit (single precision) or 64 bit (double precision). The Top500 supercomputer list counts double precision GFLOPS so these are not comparable. Assuming the APUs are capable of it a single PE should be capable of 128 GFLOPS (double precision), still over twenty times faster than any "normal" CPU.
[G6] I speculated on a POWER5 based G6 here
[Vector]
What is
vector processing
[Stream]
ACM Queue on
Streaming Processors
[PC + GPU]
Nvidia with
16 pixel processors.
[GPU]
Interesting papaer on GPUs and clustering them to produce very high performance systems
GPU Cluster (Pdf)
[GPGPU]
GPUs can be used for general purpose computations. All these applications will also be useable on a Cell.
GPGPU
ShaderTech
[Nvidia]
PS3 to use Nvidia graphics technology.
Nvidia
[Top500]
The Top 500 Supercomputer list, expect Cell to rapidly take over.
Top500.org
[PCShare]
Computer Market share, C64 was biggest in 1984
Market Share
[Intel+Nvidia]
Intel and Nvidia get cosy.
invidia?
[Project Z]
Intel does a deal with Nvidia, gets a new team from HP and then project Z is discovered.
There is talk of massive parallel CPUs but who knows what Intel is planning.
Project Z
[MultiCore]
8 core opterons
Opteropteropteron
[Transmeta]
Transmeta is considering getting out of chips.
Transmeta de-chips?
[Alien]
PCs can be expensive as well
Aliens
[Future]
PCs under threat in "The Future of Computing".
Future
[DirectX Next]
NextX
[EE-GPU]
EETimes has a series of articles on using GPU for general purpose work.
EE-GPU
[Cray]Supercomputer designer Seymour Cray also talked about Cells, of the biological kind: Interview
[Entertainment]
Intel are preparing PC based media
centres.
[Calc]
At this point we do not have any performance figures so any figures given are derived from either the figures in the patent or the 4.6GHz given for the chip itself. All performance figures are thus estimates based on hypothetical maximums.
5 minutes for a SETI unit? This is based on the difference between a 1.33GHz G4 (6 Hours / unit @ 10 GFlops) and a 4 X 250 GFlops Cells, this assumes the SETI client is using Altivec on the G4 at full speed and could do the same in the PS3.
SETI uses lots and lots of FFTs and these involve lots and lots of Multiply-Add instructions. If we assume the G4 is processing at it's maximum rate and the 4 Cells are doing the same the 4 Cells should be delivering a result in well under 5 minutes.
G4 @ 1.33GHz gives 10.64 Giga Flops (8 instructions (Mul-Add) per cycle using AltiVec).
6 hours = 21,600 seconds
21,600 X 10.64 = 229,824
229,824 / 1,000 (4 Cells @ 250 GFlops)
= 229 seconds
= 3 minutes 49 seconds.
I rounded up to 5 minutes to be conservative.
Other figures (Opteron V's Cell, Top500) should be taken with similar amounts of salt.
Introduction and Index
Part 1: Inside The Cell
Part 2: Again Inside The Cell
Part 3: Cellular Computing
Part 4: Cell Vs the PC
Part 5: Conclusion and References
Part 6: Updates, Clarifications and Missing Bits
About the Author
Nicholas Blachford lives in Paris. He is currently loitering with intent on the Yoper Linux disto, learning French and writing a softsynth on OS X.
© Nicholas Blachford 2005.