Cell Architecture Explained Version 2

Part 2: Again Inside The Cell

Steam Processing

A big difference in Cells from normal CPUs is the ability of the SPEs in a Cell to be chained together to act as a stream processor [Stream] . A stream processor takes data and processes it in a series of steps.

A Cell processor can be set-up to perform streaming operations in a sequence with one or more SPEs working on each step. In order to do stream processing an SPE reads data from an input into it's local store, performs the processing step then stores the result into it's local store. The second SPE reads the output from the first SPE's local store and processes it and stores it in it's output area.

This sequence can use many SPEs and SPEs can access different blocks of memory depending on the application. If the computing power is not enough the SPEs in other Cells can also be used to form an even longer chain.

Stream processing does not generally require large memory bandwidth but Cell has it anyway and on top of this the internal interconnect system allows multiple communication streams between SPEs simultaneously so they don’t hold each other up.

So you think your PC is fast...

It is when the SPEs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform an order of magnitude more calculations per second than any desktop processor currently available.

On the other hand if the stream uses large amounts of bandwidth and the data blocks can fit into the local stores the performance difference might actually be bigger. Even if conventional CPUs are capable of processing, the data at the same rate the transfers between the CPUs will be held up while they wait for chip to chip transfers. The Cell’s internal interconnect system allows transfers running into hundreds of Gigabytes per second, chip to chip interconnects allows transfers in the low 10’s of Gigabytes per second.

While conventional processors have vector units on board (SSE or VMX / AltiVec) they are not dedicated vector processors. The vector processing capability is an add-on to the existing instruction sets and has to share the CPUs resources. The SPEs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory (and not even this much if the data can fit in the local stores). Add to this the fact there are 8 of them and you can see why their potential computational capacity is so large.

Such a large performance difference may sound completely ludicrous but it's not without precedent, in fact if you own a reasonably modern graphics card your existing system is already capable of similar processing feats:

"For example, the Nvidia GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing. In comparison, the theoretical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6GFlops." [GPU]

“GPUs are >10x faster than CPU for appropriate problems” [GPU10]

The 3D Graphics chips in computers have long been capable of very much higher performance than general purpose CPUs. Previously they were restricted to 3D graphics processing but since the addition of vertex and pixel shaders people have been using them for more general purpose tasks [GPGPU] , this has not been without some difficulties but Shader 4.0 parts are expected to be even more general purpose than before.

Existing GPUs can already provide massive processing power when programmed properly but this is not exactly an easy task. The difference with the Cell is it will be cheaper, considerably easier to program and will be useable for a much wider class of problems.

The EIB and DMAC

The original patent application describes a DMAC (Direct Memory Access Controller) which controlled memory access for the SPEs / PPE (then known as the APUs and the PU) and connected everything with a 1024 bit bus. The original DMAC design also included the memory protection system.

The DMAC has changed considerably as the system evolved from the original design into the final product. The final version is quite different from the original but many parts still exist in one way or another. A DMAC still exists and controls memory access but it no longer controls memory protection. The 1024 bit interconnect bus was also replaced with a series of smaller ring busses called the EIB (Element Interconnect Bus).

The EIB consists of 4 x 16 byte rings which run at half the CPU clock speed and can allow up to 3 simultaneous transfers. The theoretical peak of the EIB is 96 bytes per cycle (384 Gigabytes per second) however, according to IBM only about two thirds of this is likely to be achieved in practice.

The original 1024 bit mentioned in the patent bus no longer connects the system together but it still exists in between the I/O buffer and the local stores. The memory protection was replaced completely by MMUs and moved into the SPEs.

It's clear to me that the DMAC and EIB combination is one of the most important parts of the Cell design, it doesn't do processing itself but has to contend with potentially hundreds of Gigabytes of data flowing through it at any one time to many different destinations as well as handling a queue of some 128 memory requests.

Memory and I/O

All the internal processing units need to be fed so a high speed memory and I/O system is an absolute necessity. For this purpose Sony and Toshiba licensed the high speed "Yellowstone" and "Redwood" technologies from Rambus [Rambus] , these are used in the the XDR RAM and FlexIO.

Both of these are interesting technologies not only for their raw speed but they have also been designed to simplify board layouts. Engineers spend a lot of time making sure wires on motherboards are all exactly the same length so signals are synchronised. FlexIO and XDR RAM both have a technology called "FlexPhase" [FlexPhase] which allow signals to come in at different times reducing the need for the wires to be exactly the same length, this will make life considerably easier for board designers working with the Cell.

As with everything else in the Cell architecture the memory system is designed for raw speed, it will have both low latency and very high bandwidth. As mentioned previously the SPEs access memory in blocks of 128 bytes. It’s not clear how the PPE accesses memory but 128 bytes happens to be a common cache line size so it may do the same.

The Cell will use high speed XDR RAM for memory. A Cell has a memory bandwidth of 25.6 Gigabytes per second which is considerably higher than any PC but necessary as the SPEs will eat as much memory bandwidth as they can get. Even given this the buses are not large (72 data pins in total), this is important as it keeps chip manufacturing costs down. The Cells runs it’s memory interface at 3.2 Gigabits per second per pin though memory in production now is already capable of higher speeds than this. XDR is designed to scale to 6.4 Gigabits per second so memory bandwidth has the potential to double.

Memory Capacity

The total memory that can be attached is variable as the XDR interface is configurable, Rambus’ site shows how 1GB can be connected [Xdimm] . Theoretically an individual Cell can be attached to many Gigabytes of memory depending on density of the the RAM chips in use and this may involve using one pin per physical memory chip which reduces bandwidth.

SPEs may need to access memory from different Cells especially if a long stream is set up, thus the Cells also include a high speed interconnect. This consists of a set of 12 x 8 bit busses which run at 6.4 Gigabit / second per wire (76.8 Gigabytes per second total). The busses are directional with 7 going out and 5 going in.

The current systems allows 2 Cells to be connected glue-less (i.e. without additional chips). Connecting more Cells requires an additional chip. This is different from the patent as it specified 4 Cells could be directly connected and a further 4 could be connected via a switch.

IBM have announced a blade system made up of a series of dual Cell “workstations”. The system is rated at up to 16 TeraFlops, which will require 64 Cells.

Memory Management Units

Memory management is used to stop programs interfering with each other and to move data to disc when it is not in use.

The original Cell patent had two mechanisms for memory protection one of which was very simple and very fast while the second was very complex and slow. These have been replaced by a more conventional system used when accessing main memory.

While there is no protection or coherency mechanisms used within the local stores (again for simplicity and speed), the PPE and SPEs do contain Memory Management Units (MMUs) used when accessing memory or other SPEs’ local stores.

These act just like a conventional multiprocessor system and as such are much more sophisticated than the method described in the original patent. Multiprocessor systems is an area IBM are long experienced in so this change appears to have come from them. While this may be slower than the fast system in the patent the difference in reality is likely to be insignificant.

Processing Concrete

The Cell architecture goes against the grain in many areas but in one area it has gone in the complete opposite direction to the rest of the technology industry.

Operating systems started as a rudimentary way for programs to talk to hardware without developers having to write their own drivers every time. As time went on operating systems have evolved and taken on a wide variety of complex tasks, one way it has done this is by abstracting more and more away from the hardware.

Object oriented programming goes further and abstracts individual parts of programs away from each other. This has evolved into Java like technologies which provide their own environment thus abstracting the application away from the individual operating system, many other languages including C#, Perl and Python do the same. Web technologies do the same thing, the platform which is serving you with this page is completely irrelevant, as is the platform viewing it. When writing this I did not have to make a Windows or Mac specific version of the HTML, the underlying hardware, OSs and web browsers are completely abstracted away.

If there is a law in computing, abstraction is it, it is an essential piece of today's computing technology, much of what we do would not be possible without it.

Cell however, has gone against the grain and actually removed a level of abstraction. The programming model for the Cell will be concrete , when you program an SPE you will be programming what is in the SPE itself, not some abstraction. You will be "hitting the hardware" so to speak. The SPEs programming model will include 256K of local store and 128 registers, the SPE itself will include 128 registers and 256K of local store, no less, no more.

In modern x86 or PowerPC / POWER designs the number of registers present is different from the programming model, in the Transmeta CPUs it’s not only the registers, the entire internal architecture is different from what you are programming.

While this may sound like sacrilege and there are reasons why it is a bad idea in general there is one big advantage: Performance. Every abstraction layer you add adds computations and not by some small measure, an abstraction can decrease performance by a factor of ten fold. Consider that in any modern system there are multiple abstraction layers on top of one another and you'll begin to see why a 50MHz 486 may of seemed fast years ago but runs like a small dog these days, you need a more modern processor to deal with the subsequently added abstractions.

The big disadvantage of removing abstractions is it will add complexity for the developer and it limits how much the hardware designers can change the system. The latter has always been important and is essentially THE reason for abstraction but if you've noticed modern processors haven't really changed much in years. AMD64 is a big improvement over the Athlon but the overall architecture is actually very similar, Intel’s Pentium-M traces it’s linage right the way back to the Pentium Pro. The new dual core devices from AMD, Intel and IBM have not changed the internal architecture at all.

The Cell designers obviously don't expect their architecture to change much either so have chosen to set it in stone from the beginning. That said there is some flexibility in the system so it can still evolve over time.

The Cell approach does give some of the benefits of abstraction though. Java has achieved cross platform compatibility by abstracting the OS and hardware away, it provides a "virtual machine" which is the same across all platforms, the underlying hardware and OS can change but the virtual machine does not.

Cell does this but in a completely different way. Java provides a software based "virtual machine" which is the same on all platforms, Cell provides a machine as well - but they do it in hardware, the equivalent of Java's virtual machine is the Cells physical hardware. If I was to write code for SPEs on OS X the exact same Cell code would run on Windows, Linux or Zeta because in all cases it is the hardware Cells which execute it.

This does not however mean you have to program the Cells in assembly, Cells have compilers just like everything else. Java provides a “machine” but you don't program it directly either.

By actually providing 128 real registers (32 for the PPE) and 256K of local store they have made life more difficult for compiler writers but in some respects they’ve also made it easier, it’ll be a lot easier to figure out what’s going on and thus optimise for. There’s no need to try and figure out what will or will not be put in rename registers since these don’t exist.

Hard Real Time Processing

Some stream processing needs to be timed exactly and this has also been considered in the design to allow "hard" real time data processing. An "absolute timer" is used to ensure a processing operation falls within a specified time limit. This is useful on it's own but also ensures compatibility with faster next generation Cells since the timer is independent of the processing itself.

Hard real time processing is usually controlled by specialist operating systems such as QNX which are specially designed for it. Cell's hardware support for it means pretty much any OS will be able to support it to some degree or another. This will not however magically provide everything an RT OS provides (they are designed for bomb proof reliability) so things like QNX won’t be going away anytime soon from safety critical areas.

An interesting aspect of the Cell is that it includes IBMs virtualisation technology [Hyper] , this allow multiple operating systems to run simultaneously. This ability can be combined with the real time capability allowing a real time OS to run alongside a non-real time OS. Such an ability could be used in industrial equipment allowing a real time OS to perform the critical functions while a secondary non real time OS can act as a monitoring, display and control system. This could lead to all sorts of interesting possibilities of hybrid operating systems - how about Linux doing the GUI and I/O while QNX handles the critical stuff?

To DRM or not to DRM?

Some will no doubt be turned off by the fact that DRM (Digital Rights Management) is said to be built into the Cell hardware. Sony is a media company and like the rest of the industry that arm of the company are no doubt pushing for DRM type solutions. It must also be noted that the Cell is destined for HDTV and BluRay / HD-DVD systems. Like it or not, all high definition recorded content is going to be very strictly controlled by DRM so Sony have to add this capability otherwise they would be effectively locking themselves out of a large chunk of their target market. Hardware DRM is no magic bullet however, hardware systems have been broken before - including Set Top Boxes and even IBM's crypto hardware for their mainframes.

But is it really DRM?

While the system is taken to mean DRM I don't believe there is a specific scheme designed in, rather I think it may be more accurate to say the system has "hardware security" facilities. Essentially there’s a protection mechanism which can be used for DRM, but you could equally use it to protect your on-line banking.

The system enables an SPE to lock most of it’s local store for it's own use only. When this is done nothing else can access that memory. In this case nothing really means nothing, neither the OS or even the Hypervisor can access this memory while it is locked, only a small portion of the store remains unlocked so the SPE can still communicate with the rest of the system.

To give an example of how this could be useful, it could be used to address a perceived security weakness of the old WAP (Wireless Application protocol) system - think internet for mobile phones (note: I don't know current versions of the system so I'm referring to what they did in early versions). Originally the phones didn't have enough computing power to use the web's encryption system so a different system was used. Unfortunately this meant that the data would have to be decrypted from the web system then then re-encrypted into the new WAP system, this poses an immediate security problem. If a hacker could get into the system and gain root access they may be able to monitor the data when it was decrypted.

Cell could solve such a problem as the decryption and re-encryption could be done in a protected memory segment. Even if a hacker could get into the system getting root access would not help, even root cannot access that memory. Even if they were to get into hypervisor space they still couldn’t access the data, nothing gets in when it’s protected.

I doubt this system is still used by WAP gateways but if it is I’d expect WAP gateway manufacturers to suddenly become interested in Cell processors...

No details of how the system works have been released, I doubt if any system can be made un-hackable, but it doesn't seem like a direct attack would work in this case.

Other Options And The Future

There are plans for future technology in the Cell architecture. Optical interconnects appear to be planned, it's doubtful that this will appear anytime in the near future but clearly the designers are planning for the day when copper wires hit their limit (thought to be around 10GHz). Optical connections are quite rare and expensive at the moment but they look like they will become more common soon [Opti] .

The design of Cells is not entirely set in stone, there can be variable numbers of SPEs and the SPEs themselves can include more floating point or integer calculation units. In some cases SPEs can be removed and other things such as I/O units or graphics processor placed in their place. Nvidia are proving the graphics hardware for the PS3 so this may be done within a modified Cell at some point.

As Moore's law moves forward and we get yet more transistors per chip I've no doubt the designers will take advantage of this. The idea of having 4 Cells per chip is mentioned in the patent but this isn’t likely to happen for 3-4 years yet as they are too big to fit on a single reasonably priced chip right now..

The first version of the Cell has high double precision floating point performance but it could be higher and I expect a future version to boost it significantly, possibly by as much as 5x.

In Part 3...

Just how difficult is it going to be to code for this thing? Not as bad as it appears.

In part 3 I look at the different options for programming the Cell.

Introduction and Index

Part 1: Inside The Cell

Part 2: Again Inside The Cell

Part 3: Programming the Cell

Part 4: Revenge of the RISC

Part 5: Conclusion and References