This article has been updated, see Version 2
Cell Architecture Explained (Version 1) - Part 2: Again Inside The Cell
Stream Processing
A big difference in Cells from normal CPUs is the ability of the APUs in a Cell to be chained together to act as a stream processor [Stream]. A stream processor takes data and processes it in a series of steps. Each of these steps can be performed by one or more APUs. A Cell processor can be set-up to perform streaming operations in a sequence with one or more APUs working on each step. In order to do stream processing an APU reads data from an input into it's local memory, performs the processing step then writes it to a pre-defined part of RAM, the second APU then takes the data just written, processes it and writes to a second part of RAM. This sequence can use many APUs and APUs can read or write different blocks of RAM depending on the application. If the computing power is not enough the APUs in other cells can also be used to form an even longer chain. Steam processing does not generally require large memory bandwidth but Cell will have it anyway. According to the patent each Cell will have access to 64 Megabytes directly via 8 bank controllers (it indicates this as an "ideal", the maximum may be higher). If the stream processing is set up to use blocks of RAM in different banks, different APUs processing the stream can be reading and writing simultaneously to the different blocks. So you think your PC is fast...
If over clocked sufficiently (over 3.0GHz) and using some very optimised code (SSE assembly), 5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing as a single Cell - Admittedly, this is purely theoretical and it depends on the Cell achieving it's performance goals and a "perfect" application being used, it does however demonstrate the sort of processing capability the Cell potentially has. The PlayStation 3 is expected to have have 4 Cells. General purpose desktop CPUs are not designed for high performance vector processing. They all have vector units on board in the shape of SSE or Altivec but this is integrated on board and has to share the CPUs resources. The APUs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory. Add to this the fact there are 8 of them and you can see why their computational capacity is so large. Such a large performance difference may sound completely ludicrous but it's not without precedent, in fact if you own a reasonably modern graphics card your existing system is be capable of a lot more than you think:
"For example, the nVIDIA GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing. In comparison, the theoretical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6GFlops." [GPU] The 3D Graphics chips in computers have long been capable of very much higher performance than general purpose CPUs. Previously they were restricted to 3D graphics processing but since the addition of shaders people have been using them for more general purpose tasks [GPGPU], this has not been without some difficulties but Shader 4.0 parts are expected to be a lot more general purpose than before. Existing GPUs can provide massive processing power when programmed properly, the difference is the Cell will be cheaper and several times faster.
It is where multiple memory banks are being used and the APUs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform over an order of magnitude more calculations per second than any desktop processor currently available.
Hard Real Time Processing
Some stream processing needs to be timed exactly and this has also been considered in the design to allow "hard" real time data processing. An "absolute timer" is used to ensure a processing operation falls within a specified time limit. This is useful on it's own but also ensures compatibility with faster next generation cells since the timer is independent of the processing itself. Hard real time processing is usually controlled by specialist operating systems such as QNX which are specially designed for it. Cell's hardware support for it means pretty much any OS will be able to support it to some degree. This will however only to apply to tasks using the APUs so I don't see QNX going away anytime soon.
The DMAC
The DMAC (Direct Memory Access Controller) is a very important part of the Cell as it acts as a communications hub. The PU doesn't issue instructions directly to the APUs but rather issues them to the DMAC and it takes the appropriate actions, this makes sense as the actions usually involve loading or saving data. This also removes the need for direct connections between the PU and APUs. As the DMAC handles all data going into or out of the Cell it needs to communicate via a very high bandwidth bus system. The patent does not specify the exact nature of this bus other than saying it can be either a normal bus or it can be a packet switched network. The packet switched network will take up more silicon but will also have higher bandwidth, I expect they've gone with the latter since this bus will need to transfer 10s of Gigabytes per second. What we do know from the patent is that this bus is huge, the patent specifies it at a whopping 1024 bits wide. At the time the patent was written it appears the architecture for the DMAC had not been fully worked out so as well as two potential bus designs the DMAC itself has different designs. Distributed and centralised architectures for the DMAC are both mentioned. It's clear to me that the DMAC is one of the most important parts of the Cell design, it doesn't do processing itself but has to contend with 10's of Gigabytes of memory flowing through it at any one time to many different destinations, if speculation is correct the PS3 will have 100GByte / second memory interface, if this is spread over 4 Cells that means each DMAC will need to handle at least 25 Gigabytes per second. It also has to handle the memory protection scheme and be able to issue memory access orders as well as handling communication between the PU and APUs, it needs to be not only fast but will also be a highly complex piece of engineering.
Memory
As with everything else in the Cell architecture the memory system is designed for raw speed, it will have both low latency and very high bandwidth. As mentioned previously memory is accessed in blocks of 1024 bits. The reason for this is not mentioned in the patent but I have a theory: While this may reduce flexibility it also decreases memory access latency - the single biggest factor currently holding back computers today. The reason it's faster is the finer the address resolution the more complex the logic and the longer it takes to look it up. The actual looking up may be insignificant on the memory chip but each look-up requires a look-up transaction which involves sending an address from the bank controller to the memory device and this will take time. This time is significant itself as there is one per memory access but what's worse is that every bit of address resolution doubles the number of look-ups required. If you have 512MB in your PC your RAM look-up resolution is 29 bits*, however the system will read a minimum of 64 bits at a time so resolution is 26 bits. The PC will probably read more than this so you can probably really say 23 bits. * Note: I'm not counting I/O or graphics address space which will require an extra bit or two. In the Cell design there are 8 banks of 8MB each and if the minimum read is 1024 bits the resolution is 13 bits. An additional 3 bits are used to select the bank but this is done on-chip so will have little impact. Each bit doubles the number of memory look-ups so the PC will be doing a thousand times more memory look-ups per second than the Cell does. The Cell's memory busses will have more time free to transfer data and thus will work closer to their maximum theoretical transfer rate. I'm not sure my theory is correct but CPU caches use a similar trick. What is not theoretical is the fact the Cell will use very high speed memory connections - Sony and Toshiba licensed 3.2GHz memory technology from Rambus in 2003 [Rambus]. If each cell has total bandwidth of 25.6 Gigabytes per second each bank transfers data at 3.2 Gigabytes per second. Even given this the buses are not large (64 data pins for all 8), this is important as it keeps chip manufacturing costs down. 100 Gigabytes per second sounds huge until you consider top end graphics cards are in the region of 50 Gigabytes per second already, doubling over a couple of years sounds fairly reasonable. But these are just the theoretical figures and never get reached, assuming the system I described above is used the bandwidth on the Cell should be much closer to it's theoretical figure than competing systems and thus will perform better. APUs may need to access memory from different Cells especially if a long stream is set up, thus the Cells include a high speed interconnect. Details of this are not known other than they transfer data at 6.4 Gigabits / second per wire. I expect there will be busses of these between each Cell to facilitate the high speed transfer of data to each other. This technology sounds not entirely unlike HyperTransport though the implementation may be very different. In addition to this a switching system has been devised so if more then 4 Cells are present they too can have fast access to memory. This system may be used in Cell based workstations. It's not clear how more than 8 cells will communicate but I imagine the system could be extended to handle more. IBM have announced a single rack based workstation will be capable of up to 16 TeraFlops, they'll need 64 Cells for this sort of performance so they have obviously found some way of connecting them. Memory Protection
Existing CPUs include hardware memory protection system but it is a lot more complex than this. They use page tables which indicate the use of blocks of RAM and also indicate if the data is in RAM or on disc, these tables can become large and don't fit on the CPU all at once, this means in order to read a memory location the CPU may first have to read a page table from memory and read data in from disc - all before the data required is read. In the Cell the APU can either issue a memory access or not, the table is held in a special SRAM in the DMAC and is never flushed. This system may lack flexibility but is very simple and consistently very fast. This simple system most likely only applies to the APUs, I expect the PU will have a conventional memory protection system.
The memory system also has a memory protection scheme implemented in the DMAC. Memory is divided into "sandboxes" and a mask used to determine which APU or APUs can access it. This checking is performed in the DMAC before any access is performed, if an APU attempts to read or write the wrong sandbox the memory access is forbidden.
Software Cells
Software cells are containers which hold data and programs called apulets as well as other data and instructions required to get the apulet running (memory required, number of APUs used etc.). The cell contains source, destination and reply address fields, the nature of these depends on the network in
use so software Cells can be sent around to different hardware Cells. There are also network independent addresses which will define the specific Cell exactly. This allows you to say, send a software Cell to hardware Cell in a specific computer on a network. The APUs use virtual addresses but these are mapped to a real address as soon as DMA commands are issued. The software Cell contains these DMA commands which retrieve data from memory to process, if APUs are set up to process streams the Cell will contain commands which describe where to read data from and where to write results to. Once set up, the APUs are "kicked" into action. It's not clear how this system will operate in practice but it would appear to include some adaptively so as to allow Cells to appear and disappear on a network. This system is in effect a basic Operating System but could be implemented as a layer within an existing OS. There's no reason to believe Cell will have any limitations regarding which Operating Systems can run.
Multi-Cell'd animals
One of the main points of the entire Cell architecture is parallel processing.
Software cells can be sent pretty much anywhere and don't depend on a specific transport means. The ability of software Cells to run on hardware Cells determined at runtime is a key feature of the Cell architecture. Want more computing power? Plug in a few more Cells and there you are. If you have a bunch of cells sitting around talking to each other via WiFi connections the system can use it to distribute software cells for processing.
The system was not designed to act like a big iron machine, that is, it is not arranged around a single shared or closely coupled set of memories. All the memory may be addressable but each Cell has it's own memory and they'll work most efficiently in their own memory or at least in small groups of Cells where fast inter-links allow the memory to be shared. Going above this number of Cells isn't described in detail but the mechanism present in the software Cells to make use of whatever networking technology is in use allows ad-hoc arrangements of Cells to be made without having to worry about rewriting software to take account of different network types. The parallel processing system essentially moves a lot of complexity which would normally be handled by hardware and moves it into software. This usually slows things down but the benefit is flexibility, you give the system a set of software Cells to compute and it figures out how to distribute them itself. If your system changes (Cells added or removed) the OS should take care of this without user or programmer intervention. Writing software for parallel processing is usually highly difficult and this helps get around the problem. You still, of course have to parallelise the program into cells but once that's done you don't have to worry if you have one Cell or ten. In the future, instead of having multiple discrete computers you'll have multiple computers acting as a single system. Upgrading will not mean replacing an old system anymore, it'll mean enhancing it. What's more your "computer" may in reality also include your PDA, TV and Camcorder all co-operating and acting as one.
Concrete Processing
The Cell architecture goes against the grain in many areas but in one area it has gone in the complete opposite direction to the rest of the technology industry. Operating systems started as a rudimentary way for programs to talk to hardware without developers having the to write their own drivers every time. As time went on operating systems have evolved and taking on a wide variety of complex tasks, one way it has done this is by abstracting more and more away from the hardware. Object oriented programming goes further and abstracts individual parts of programs away from each other. This has evolved into Java like technologies which provide their own environment thus abstracting the application away from the individual operating system. Web technologies do the same thing, the platform which is serving you with this page is completely irrelevant, as is the platform viewing it. When writing this I did not have to make a Windows or Mac specific version of the HTML, the underlying hardware, OSs and web browsers are completely abstracted away. Even hardware manufacturers have taken to abstraction, the Transmeta line of CPUs are sold as x86 CPUs but in reality they are not. They provide an abstraction in software which hides the inner details of the CPU which is not only not x86 but a completely different architecture. This is not unique to Transmeta or even x86, the internal architecture of most modern CPUs is very different from their programming model. If there is a law in computing, Abstraction is it, it is an essential piece of today's computing technology, much of what we do would not be possible without it. Cell however, has abandoned it. The programming model for the Cell will be concrete, when you program an APU you will be programming what is in the APU itself, not some abstraction. You will be "hitting the hardware" so to speak. While this may sound like sacrilege and there are reasons why it is a bad idea in general there is one big advantage: Performance. Every abstraction layer you add adds computaions and not by some small measure, an abstraction can decrease performance by a factor of ten fold. Consider that in any modern system there are multiple abstraction layers on top of one another and you'll begin to see why a 50MHz 486 may of seemed fast years ago but runs like a dog these days, you need a more modern processor to deal with the subsequently added abstractions. The big disadvantage of removing abstractions is it will significantly add complexity for the developer and it limits how much the hardware designers can change the system. The latter has always been important and is essentially THE reason for abstraction but if you've noticed modern processors haven't really changed much in years. The Cell designers obviously don't expect their architecture to change significantly so have chosen to set it in stone from the beginning. That said there is some flexibility in the system so it can change at least partially. The Cell approach does give some of the benefits of abstraction though. Java has achieved cross platform compatibility by abstracting the OS and hardware away, it provides a "virtual machine" which is the same across all platforms, the underlying hardware and OS can change but the virtual machine does not. Cell provides something similar to Java but in a completely different way. Java provides a software based "virtual machine" which is the same on all platforms, Cell provides a machine as well - but they do it in hardware, the equivalent of Java's virtual machine is the Cells physical hardware. If I was to write Cell code on OS X the exact same Cell code would run on Windows, Linux or Zeta because in all cases it is the hardware Cells which execute it. It should be pointed out that this does not mean you have to program the Cells in assembly, Cells will have compilers just like everything else. Java provides a virtual machine but you don't program it directly either.
DRM In The Hardware
Some will no doubt be turned off by the fact that DRM is built into the Cell hardware. Sony is a media company and like the rest of the industry that arm of the company are no doubt pushing for DRM type solutions. It must also be noted that the Cell is destined for HDTV and BluRay / HD-DVD systems, any high definition recorded content is going to be very strictly controlled by DRM so Sony have to add this capability otherwise they would be effectively locking themselves out of a large chunk of their target market. Hardware DRM is no magic bullet however, hardware systems have been broken before - including Set Top Boxes and even IBM's crypto hardware for their mainframes.
Other Options And The Future
There are plans for future technology in the Cell architecture, optical interconnects appear to be planned, it's doubtful that this will appear in PS3 but clearly the designers are planning for the day when copper wires hit their limit (thought to be around 10GHz) Other materials than Silicon also appear to be being considered for fabrication but this will be an even bigger undertaking. The design of Cells is not entirely set in stone, there can be variable numbers of APUs and the APUs themselves can include more floating point or integer calculation units. In some cases APUs can be removed and other things such as I/O units or graphics processor placed in their place. Nvidia are proving the graphics hardware for the PS3 so this may be done within a modified Cell at some point. As Moore's law moves forward and we get yet more transistors per chip I've no doubt the designers will take advantage of this. The idea of having 4 Cells per chip is mentioned in the patent but there are other options also for different applications of the Cell. When multiple APUs are operating on streaming data it appears they write to RAM and read back again, it would be perfectly feasible however to add buffers to allow direct APU to APU writes. Direct transfers are mentioned in the patent but nothing much is said about them.
To Finish Up
The Cell architecture is essentially a general purpose PowerPC CPU with a set of 8 very high performance vector processors and a fast memory and I / O system, this is coupled with a very clever task distribution system which allows ad-hoc clusters to be set up.
What is not immediately apparent is the aggressiveness of the design. The lack of cache and runtime virtual memory system is highly unusual and has not done on any modern general purpose CPU in the last 20 years. It can only be compared with the sorts of designs Seymour Cray produced. The Cell is not only going to be very fast, but because of the highly aggressive design the rest of the industry is going to have a very hard time catching up with it*.
To sum up there's really only one way of saying it:
This system isn't just going to rock, it's going to play German heavy metal.
* I expand on this line of thought in Part 4
Introduction and Index
Part 1: Inside The Cell
Part 2: Again Inside The Cell
Part 3: Cellular Computing
Part 4: Cell Vs the PC
Part 5: Conclusion and References
Part 6: Updates, Clarifications and Missing Bits
© Nicholas Blachford 2005.