Cell Architecture Explained Version 2

Part 4: Revenge of the RISC

Branching Orders: Is The Cell General Purpose?

There has been a lot of debate about how the Cell will perform on general purpose code with many saying it will not do well as it is a “specialised processor”.  This is not correct, the Cell was designed as a general purpose processor, but optimised for high compute tasks.  The PPE is a conventional processor and will act like one.  The big difference will be in the SPEs as they were designed to accelerate specific types of code and will be notably better in some areas than others, however even the SPEs are general purpose.  The Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed.

The PPE is a normal general purpose core and should have no problems on most code.  That said PPE has a simplified architecture compared to other desktop processors and this seems to be taken in some quarters as a reason for not just low performance but very low performance on general purpose code.  Few care to explain why this is or even what this “general purpose” code is.  Some do make vague references to the lack of Out-Of-Order hardware or the limited branch prediction hardware.

While I am of the opinion that the relative simplicity may be a disadvantage in some areas I doubt it’s going to make that much of a difference as what the PPE may lose in Instructions Per Cycle (IPC) it will make up for with a higher clock speed.  However the PPE does not exist in isolation, the heavy lifting work will be handed off to the SPEs.  Even if there is a performance difference removing the heavy duty work will make a bigger difference than any loss due to simplification.

As for the SPEs, not everything can be vectorised but that doesn’t mean SPEs suddenly become useless.  Just because you can issue 4 instructions per bundle doesn’t mean you have to.  There’s no reason why you can’t issue an instruction which performs one calculation and three nothings.  In this case the SPE will in effect become a dual issue, in-order RISC.  The SPEs should be quite capable of acting as general purpose CPUs even on non-vectorisable code.  How well they will perform will largely be down to the programmer, compiler and memory usage pattern, if the memory usage is predictable they should do well, if not they will do badly.  They will be somewhat inefficient running some types of general purpose code but there are plenty of small intensive loops in general purpose applications and much processing is concentrated in them.  It is in these areas where the SPEs will be fast and efficient.

RISC Strikes Back

The trend in CPUs over the last 15 years has been to increase performance by not only increasing the clock rate but also by increasing the Instructions Per Cycle.  The designers have used the additional transistors each process shrink has brought to create increasingly sophisticated machines which can execute more and more instructions in one go by executing them Out-Of-Order (OOO), many modern desktop CPUs do this, exceptions are Transmeta, VIA and Intel’s Itaniums.  OOO is something of a rarity in the embedded world as power consumption considerations don’t allow it.

The PPE is completely different however.  It uses a very simple design and contains no OOO hardware, this is the complete opposite approach to IBM's last PowerPC core, the 970FX (aka G5).

The reason for the complete change in direction in design philosophy is due to the physical limitations CPU designers are now hitting.  OOO CPUs are highly complex and use large numbers of transistors to achieve their goals, these all require power and need to be cooled.  Cooling is becoming an increasingly difficult problem as transistors are beginning to leak electrons making them consume power even when they are not in active use.  The problem has got to the point where all the desktop CPU manufacturers have pretty much given up trying to gain performance by boosting clock speeds and have taken the multi-core approach instead.  They haven’t taken to simplification yet but almost certainly will have to in the future.

While this may appear to be a new problem it’s not, it has been predicted and investigated for many years.

The following quote is from a research paper by IBM “master inventor” Philip Emma in 1996:

If the goal of a microarchitecture is low power ... only those features that pervasively provide low CPI* are included. Features that only help CPI sometimes (and that hurt cycle time all of the time) are eliminated if low power is a goal. Those elements should also be eliminated if high performance is a goal.

As the industry pushes processor design into the GHz range and beyond, there will be a resurgence of the RISC approach. While superscalar design is very fashionable, it remains so largely because its impact on cycle time is not well understood. Complex superscalar design stands in the path of the highest performance; he who achieves the highest MHz runs the fastest. [Emma]

*CPI: Cycles per instruction, a large part of the referenced paper argues this is a much better measure than IPC.

IBM, like any large technology company does research.  In the following year (1997), long before GHz or 64 bit CPUs arrived on the desktop IBM developed an experimental 64 bit PowerPC which ran at 1GHz.  Its snappy title was guTS (GigaHertz unit Test Site) [guTS] .

The guTS and a later successor were designed to test circuit design techniques for high frequency, not low power.  However since it was only for research the architecture of the CPU was very simple, unlike other modern processors it is in-order and can only issue a single instruction at a time.  The first version only implemented part of the PowerPC instruction set, a later version in 2000 implemented it all.

It turns out that the power consumption problem has become so bad that if you want high clock frequency you now have to simplify, there is simply no choice in the matter.  If you don’t simplify the CPU will consume so much power it will become very difficult to cool and thus the clock speed will be limited.  Both IBM and Intel have discovered this rather publicly, try buying 3GHz G5 or a 4GHz P4.

When a low power, high clocked general purpose core was required for the Cell, this simple experimental CPU designed without power constraints in mind turned out to be perfect.  The architecture has since been considerably modified, the now dual issue, dual-threaded PPE is a descendant of the guTS.

The XBox360’s “Xenon” [Xbox360] processor cores also appear to be derived from the guTS processor although they are not quite the same as the PPE.  In the Cell the PPE uses the PowerPC instruction set and acts as a controller for the more specialised SPEs.  The Xenon cores uses a modified version of the PowerPC instruction set with additional instructions and a beefed up 128 register VMX unit.

Discrete PowerPC parts based on this technology have been long rumoured, some rumours even suggest POWER6 might even use similar technology to get the high frequency boost it’s promising.  While rising clock speeds seem to have been declared dead by the rest of the industry, evidently somebody forgot to tell IBM... [Ultra]

There is an unwritten assumption that the OOO hardware and large branch predictors (the PPE has relatively simple branch predictors) play a major part in CPU performance.  This is not the case, just throwing transistors at a CPU doesn’t massively raise its speed, Gelsinger’s law: “Doubling the number of transistors increases performance by 40%”.  However, most of this 40% is not coming from OOO hardware, most can be traced directly to increasing cache sizes.

While OOO CPUs allow more instructions to run concurrently than the PPE, they can't sustain this higher issue rate for any length of time.  IPC is an average of the number of instructions per cycle that a CPU can perform.  Measured IPC ranges wildly going to above 3 in some instances to well below 1 in others, the average IPC figure is below 2 [LowIPC] .  In accordance with this both the PPE and SPEs are designed to issue 2 instructions per cycle.

OOO CPUs are specifically designed to increase IPC but they can only do this in some types of code.  Code which contains dependencies has to be executed in order, OOO is no help on this sort of code. OOO may actually decrease performance in some areas since CPUs with aggressive OOO hardware cannot run at as high clock rates as in-order CPUs.  Algorithms which scale with clock speed are consequently held back.  OOO takes a lot of room and a lot of power but only increases performance in specific areas, increasing the clock speed increases the performance of everything.

While it may be possible to increase the IPC of the PPE with OOO* hardware the performance gain would be at best limited.  The PPE could be replaced by something like a 970FX but this has a larger core and either the die size would have to grow to around 250mm square or a pair of SPEs removed.  The 970 also consumes hefty amounts of power at high clock speeds - more than the entire Cell.  In order to fit a 970 and keep it cool enough to run in a PS3 the frequency would probably be reduced to well under 3GHz, probably under 2.5GHz.  The end result would be a CPU which runs hotter, slower and costs more to build.

*Hope you’re keeping up with all these TLAs (Three Letter Acronyms).

OOO was a good way to boost performance in the past but as power consumption has become a limiting factor the entire industry has been looking for new ways of boosting performance.  x86 vendors have taken to using 2 cores instead of attempting to boost clock speed or IPC further. Future generations are expected to have 4 then 8 cores.  Even then I expect x86 to simplify their OOO capabilities for the same reason, I doubt OOO will be removed completely though as x86 CPUs gain more from OOO hardware than PowerPCs due to the smaller number of architectural* registers [x86vsPPC] .

*Architectural registers are internal registers the programmer can use directly.  OOO CPUs have many more “rename” registers but the programmer cannot directly use them.

Predicting Branches

Advanced branch prediction hardware is like OOO hardware, it only boosts performance sometimes.

As an example, lets say the CPU is processing a 100 iteration loop and it takes 10 cycles to complete an iteration.  Execution at the end of the loop can be assumed to branch to the beginning of the loop so the loop continues without stalling the pipeline.  On the final iteration of the loop the guessed branch will be incorrect, and the PPE will incur a 7 cycle penalty.  A large branch predictor might avoid this but in doing so saves just 7 cycles out of 1000.  In this case the branch predictor delivers a performance boost of less than 1%.

The usefulness of the branch predictor is dependant on the type of code being run but paradoxically the OOO hardware can act against it.  The branch predictor can only work efficiently if it remains ahead of the execution hardware in the instruction stream.  If the OOO hardware works effectively the gap closes and the CPU effectively runs out of instructions [Emma] , in that case the branch predictor is sitting doing nothing.

Branch predictors are not useless (they wouldn’t be used otherwise) so the PPE does include one.  However, it’s not as large as those found in some desktop processors.

Duel Threading

Unlike the SPEs the PPE is dual threaded.  Both threads issue alternately but if one gets held up the other takes over.   Two instructions can be issued per cycle (four are fetched every other cycle).  Hardware multithreading is similar to Intel’s “Hyperthreading”, it is already present in POWER5 and you can be sure it will feature in many other processors in the future.

While the lack of OOO and a large branch predictor will have some impact, the ability to run a second thread will make up for it at least partially as a second thread can utilise the full execution resources while the first thread is waiting.

The Law of Diminishing Returns

Single chip CPUs were first produced in 1971 with Intel’s 4004.  In the 34 years since then various features have been added to enhance performance and functionality.  

  1.  Single chip CPUs started at just 4 bits but rapidly went upwards through 8, 16 and 1979’s Motorola 68000, a 32 bit processor.  The first 64 processors did not appear until 1992.

  1.  Cache was next to be added, first in very small quantities (a few bytes in the 68010) but this has been rising ever since then.

  1.  Soon after external devices such as Memory Management Units and Floating Point Units were also integrated.

  1.  The 68040 and 80486 delivered pipelining, this was the beginning of the integration of RISC technologies into CISC chips.  Pipelining allows the CPU to operate on different stages of different operations simultaneously - e.g. it can be reading one while executing another.

  1.  Superscalar execution was next, this gave the processors the ability to execute multiple instructions simultaneously.  This arrived in the 80586 (aka Pentium) and 68060.

  1.  OOO (Out of Order) execution appeared in the Pentium Pro.  Along with it came things like speculative execution and pre-fetching.

  1.  x86 got SIMD / Vector capabilities in increments from MMX onwards.  PowerPC got it in one go with the introduction of AltiVec.

  1.  Intel introduced Hyperthreading in a version of the Pentium 4.

  1.  Most recently AMD have lead the way with point to point busses, 64 bits, on-die memory controllers and more recently dual cores.

While the above list focuses on desktop processors, most of technologies were previously used in high end RISC processors who had in turn had taken the technologies from the mainframe and supercomputer worlds.  Many of these technologies were developed or implemented in the 1960s.

As these technologies have been added the performance of CPUs has gone up.  However the vast majority of that performance has come not from architecture enhancements but from clock speed.

Some enhancements have made a lot more difference than others.  Cache in particular has made a very big difference due to the disparity between CPU clock speeds and memory speeds, this gap is only growing larger so cache is becoming more important.

Going superscalar allows more than one instruction be calculated at once and this potentially doubles performance, however this will not happen all the time as dependancies mean instructions often have to be processed one at a time.

OOO hardware attempts to raise the IPC to higher levels but fundamental constraints in software mean that is not consistently possible, average IPC still remains below 2.

This is the law of diminishing returns, each new feature adds performance but the improvement each time gets smaller and smaller.  Unfortunately the opposite happens with power consumption as each improvement also seems to get progressively more complex.  Going superscalar increases performance and power consumption but OOO hardware increases performance less and power consumption more.

In the Cell the choice was between an OOO design with large branch predictors or a simpler, smaller and higher clock speed design.  Complex features like OOO only give a relatively small performance boost, they increase power consumption disproportionally to the performance they add.  in order to attain the performance and power consumption aims these sorts or features could not used in the Cell design.

The use of a reasonable sized register set along with a good compiler can schedule the instructions so dependancies have less of an impact.   The compiler can in effect do the job of OOO hardware, Intel’s Itanium has clearly shown just how effective a compiler can be at doing this.

A General Purpose Conclusion

Cell will not magically accelerate a general purpose system, it will require considerable work to get the best out of it.  It’s not even clear if anyone actually intends to build a desktop system using the Cell after Apple went off and got some “Intel inside”.

The Cell designers have deliberately produced a simple design in order to get around the heat problems complex designs now suffer [HPCA] .  This has allowed them to produce a processor with 9 cores while the rest of the industry is shipping just 2.  On top of that they have also been able to boost the clock speed.

In order to create this kind of system and get around the rising power consumption problems the industry is facing, the Cell designers have produced cores with a relatively simple architecture.  In order to do this they have removed some common features which while useful are not critical to performance.

The resulting simplicity will impact performance in some areas but a combination of higher clock speeds, new compiler technology and smart algorithm selection should largely get around these problems.

The SPEs will be particularly sensitive to the type of code running as they are more “tuned” than conventional processors.  In these cases it will be important to allocate the correct type of code to the right processor.  This is why the PPE and SPEs are different.

Introduction and Index

Part 1: Inside The Cell

Part 2: Again Inside The Cell

Part 3: Programming the Cell

Part 4: Revenge of the RISC

Part 5: Conclusion and References

© Nicholas Blachford 2005.