The Big Crunch: The Downside Of Multicore

Part 2 : Solutions?

Various solutions to these problems do exist and different vendors are trying out different spins on them. None are to be considered lightly as not all these options are possible with all architectures.

Multithreading

Intel’s “Hyperthreading” in the Pentium 4 is the best known name of hardware multithreading but it features in other processors such as POWER5, Intel’s Itanium “Montecito”, the XBox360’s Xenon and the Cell’s PPE. Hardware threading is used very aggressively to great effect in Sun’s UltraSPARC T1.

Suns UltraSPARC T1 processor has 8 cores each with support for 4 threads. These cores are very simple, they use in-order execution and there is no speculative execution or branch prediction. There are small L1 caches (8K data 16K instruction) per core and a shared 3MB L2 cache.

Sun’s answer is to in effect completely ignore the problem by only targeting applications which are highly parallel and not latency sensitive. The processor performs spectacularly well on these jobs despite having very simple cores with very little compute resources. These sorts of applications generally have very low IPC (Instructions Per Cycle) so the lack of resources has little or no effect.

The ability to have multiple threads running simultaneously can get around high latency because when one thread stalls waiting for data another thread can be activated.

This ability is obviously less useful if you only want to run a single thread - the sort of thing you are likely to do on a desktop today. This will need to change though, according to Intel’s Pat Gelsinger, Intel has 4 to 8 core processors under development which execute up to 4 threads per core [Intel] . Sun’s 32 thread UltraSPARC T1 may seem unusual today but this will be the norm within a few years.

Don’t Increase The Cores

One solution to the problems with lots of cores is to not use lots of cores, sounds obvious but it means you need to find other methods of boosting performance and these days that’s not easy.

IBM was the first to go multicore with the dual core POWER4 in 2001. POWER5 added multithreading capability but kept the same number of cores. The natural evolution would be to give up on frequency boosts but increase the number of cores like everyone else.

With POWER6 they are instead doing the exact opposite, keeping the number of cores at 2 while doubling the frequency. Boosting the frequency of such a complex design as POWER5 would make it run very hot indeed so they’ve reduced the complexity of the POWER6 core to allow this. Much of this complexity is hardware designed to increase the number of instructions per cycle the core can complete, however since IPC is strongly subject to the law of diminishing returns much of this complexity can be reduced with relatively little effect on IPC.

By keeping to 2 cores there will be no additional complexity in cache coherence operations and thus no more latency. Boosting the frequency on the other hand is a direct attack on Amdahl’s law since serial application components will run faster, all applications will benefit from this approach.

Rock With Aggressive Caching

Sun’s next generation “Rock” processor [Rock] will have a whopping 16 cores. The processor is divided into 4 and each of those 4 parts contains 4 “processing engines” which can handle 2 threads each. Each set of 4 engines shares a pair of 32K L1 caches, these in turn connect to a crossbar switch which hooks up to 4 x 512K L2 caches.

This is a somewhat unusual arrangement and at first glance looks relatively weak for single threaded applications having shared and relatively small caches. On the other hand it looks like exactly the sort of design that’ll be necessary to keep latency from spiralling out of control. By sharing L1s between 4 cores they only have 4 sets of L1 caches, keeping these coherent is a much less demanding job than keeping 16 sets of L1s coherent.

Keeping the L2 caches relatively small reduces their latency and by breaking them into 4 they can communicate with the L1s more efficiently. If there was a single large L2 the latency would be higher and it would be limited to communicating with a single set of L1s at any one time, leaving the rest waiting for data.

This sharing however will take a toll on single threaded applications so it will not run them as fast as a processor with less cores. You can expect this to happen in pretty much all large scale multicore processors so as core numbers increase in other processor families they too will face the same problems.

Another interesting technology Sun will be using in Rock is called “Run-ahead”. This is a software system designed to scan through running threads pre-fetching data before it is needed. Doing this will also help keep down memory latency.

Build A New Architecture

Only one architecture will not have these scaling issues. The STI Cell processor will largely get around them because the main computation engines, the Synergistic Processing Elements (SPEs), do not use cache.

The SPEs use a block of low latency on-board memory called “local store”, the crucial difference between it and cache is that it does not “represent” memory. Data may be copied from memory to the local store but if it changes in memory there is no mechanism to change the local store copy, there is no coherence mechanism in the SPEs’ local stores. The local store design places an additional burden on the developer / compiler but the advantage is that a Cell processor can be built with as many SPEs as you like yet there will be no impact on local store latency whatsoever.

The Cell does have some coherence requirements however as the Cell’s Power Processor Element (PPE) needs to remain coherent with any interactions with memory by other parts of the system. This may partially explain the high latency the PPE’s caches have.

The Cell is also designed for a high clock rate which will help in avoiding the problems described by Amdahl’s law. That said the current Cell implementation was designed primarily for the Playstation 3 so the PPE isn’t as powerful as it potentially could be and the clock is being kept relatively low.

The Cell’s design does not on the other hand attempt to avoid memory latency, the memory controller subsystem is instead designed to stream memory getting maximum bandwidth. The SPEs can’t even access memory directly as they only work from the local store, which this may give developers a potential headache it also means they are not likely to use algorithms or data structures likely to be latency sensitive. When this is unavoidable mutithreading will be used, unlike other processors this is not done in hardware but rather through the compiler.

Cell has been criticised for it’s unusual architecture but it makes a lot of sense when you look into the problems that large scale multicore processors will face. The Cell’s architecture has been designed to avoid many of the problems other multicore processors are going to face.

A higher clocked Cell with the PPE replaced by something more sophisticated such as a POWER6 core would offer a combination of high parallel processing capability and high serial processing capability. This would be a processor which addresses all the problems that will occur in large scale multicore processors.

Future Processors

Cell is an example of a hybrid architecture - the silicon is shared by two distinct kinds of cores each of which has it’s own purpose (PPE for control, SPEs for computation).

I think we can expect to see this kind of design becoming common in the future as conventional cores hit a brick wall. Intel’s ten year plan “ Platform 2015” [2015] document explicitly talks about having different types of cores on a single die.

AMD have also hinted at this sort of arrangement for a future processor but in the mean time their “AMD accelerator” program allows hardware accelerators to be directly plugged into a PC motherboard.

Also expect to see “Local Store” type memories becoming more common, they are already common on existing large scale multicore chips such as the PhysX physics accelerator and the Clearspeed maths accelerators. They require more work on the part of the developer / compiler but are an obvious solution to the problems caches will suffer.

I think Sun’s Rock processor shows us a glance at what is coming and the sorts of compromises necessary to keep latencies down to manageable levels, it will also have hardware acceleration functions. It may appear a little odd today but it’s probably a good indication of where desktop processors are headed.

Further Future Processors

The types of processors discussed thus far will be appearing in the next few years. After this processors will be arriving which have hundreds of cores. Some processors already exist with hundreds or even thousands of cores but these are specialist processors usually used for a very limited set of problems. Building a general purpose processor with hundreds of cores is going to be a challenge to say the least but vendors are already researching these types of devices.

Quite how you program a processor with hundreds of cores is another matter, these will require another completely different method of programming even from large scale multi-cores. Today’s serial languages are problematic enough for multicore processors [languages] but they will be completely useless for such processors.

Parallel stream processing (stream processing breaks an operation into stages) sounds like a viable programming model but this is a completely alien form of processing for current desktop developers. Currently the only widespread use of this technique is graphics chips.

Conclusion

For the last five decades, if your software was running too slow upgrading to a new processor would rapidly fix the problem without any disadvantages to the end user. Those days are now coming to an end, the performance advantages of large scale multicore processors will only come at the cost of rewriting today’s software.

From now on if you want your software to run faster it will have to be through multiple threads with modified or completely different algorithms. If you do not do this your software will have a hard time scaling or it may even run slower. Throwing hardware at a problem will not be an answer for much longer.

Unfortunately software is very rarely rewritten, it usually only gets maintained over time. You can expect to see a lot of the very same single threaded code to be in use when 8+ core processors appear on the market (around 2008 / 2009 for desktop systems). Subsequent generations of processors are going to have ever more problems with the very legacy code that has kept the PC industry going for the last three decades.

These issues will affect all existing general purpose architectures, anything in fact which uses cache. Expect new approaches and new kinds of processors to appear as the industry grapples with these issues.

The negative effects of multicores are not just going to happen soon, in some cases they are already appearing. Some games have already been found to run slower with just 2 cores [Slower] .

While hardware will continue to get faster, getting software to take advantage will take a lot of more work. The free ride on the back of the hardware people is over, it’s hard work for now on.

References and Further Reading

[Core] Intel’s Core Architecture

http://arstechnica.com/articles/paedia/cpu/core.ars

[8P] Intel shows 8 core system (actually 4 dual core chips)

http://translate.google.com/translate?u=http%3A%2F%2Fcpu.zol.com.cn%2F27%2F275562.html&langpair=zh-CN%7Cen&hl=en&safe=off&ie=UTF-8&oe=UTF-8&prev=%2Flanguage_tools

[VLIW86] Badly timed but I still think they’re up to something

http://www.theinquirer.net/?article=25496

[Rock] Early details on Sun’s “Rock” Processor

http://www.theregister.com/2006/03/14/sun_rock_deets/

[Intel] Intel’s Multithreaded future

http://www.reed-electronics.com/electronicnews/index.asp?layout=articlePrint&articleID=CA6329160

[2015] Intel’s Platform 2015