The Big Crunch: The Downside Of Multicore

Part 1 : Problems

Can you imagine getting a new PC and finding your software runs no faster than before?

You probably can’t imagine it running slower. For some types of software however, that is exactly what is going to happen in the not too distant future.

Faster processors running slower may sound bizarre but if you’re using certain types of data structures or code on large scale (8+) multicore processors it might actually happen. In fact it might already happen today if you were to run legacy software on one of today’s processors which feature a large number of cores.

Over the years CPUs have become faster and faster but memory speed has hardly moved and memory latency (the time it takes to access memory) is actually rising. The traditional way to solve this is to use cache, a block of fast on-chip memory which pretends to be main memory. Your program gets it’s data from the fast, low latency cache but it “thinks” it’s getting it from main memory.

As processor speeds have soared upwards the relative gap has increased, going to memory once only took a few cycles one, now it takes hundreds. Processor architects have been able to hide this rising memory latency by using ever more cache. Cache sizes are still increasing and are likely to increase more with multiple cores become more common.

As cache size increases the latency when accessing the cache also increases. This has been hidden in the past by utilising techniques such as pre-fetching and Out-of-Order (OOO) execution. Unfortunately this approach is running out of steam as processor architects have other problems to deal with nowadays. With the processor scaling problems faced since 90nm making high clocked OOO engines more powerful is becoming exceedingly difficult, complex circuitry gets hot, high clocked complex circuitry gets even hotter.

There is clearly some room in optimising the OOO engines without generating huge amounts of heat, Intel have shown this with their next generation “core architecture” processors [Core] , but there is only so far this process can go.

This has never been a problem before but without the means of hiding memory or cache latency it will become noticeable in some types of code. That would be enough of a problem to consider but it gets worse, with all processors going multicore, their complexity means cache and memory latency is now going to start increasing faster.

The Problem With Cache

A cache is a representation of memory but is much faster than memory. Its purpose is to act as a buffer between main memory and the processor, the idea is that when the processor wants something from memory it is already in the cache so to the processor it appears the memory is as fast as the cache. If the data required is not in the cache the data has to be fetched from main memory and this can take hundreds of cycles, in this case the processor will stall and will get no work done.

Because the cache represents memory, in order to keep that representation accurate it has to keep the data in the cache coherent (i.e. in sync) with the data in main memory. If another part of the system writes to main memory (e.g. data coming in from disc) and data in that part of RAM is cached, the copy in the cache is no longer accurate and has to be removed. In order to remove bad data the cache needs to be told what needs to be deleted, the transfer of this data process consumes interconnect bandwidth.

In a multicore system each core may have it’s own cache and these not only have to be kept coherent with main memory but also with each other. This increases the latency to the cache since each additional port requires additional logic and this takes time for data to get through. One way to avoid the coherence traffic between caches is to use a shared cache as IBM do in POWER and Intel do in their new Core processors.

The additional latency is not completely removed however as the cache now needs to interface with two cores instead of one and additional latency creeps in there, any clever logic for allocating more or less cache capacity to each core also imposes a penalty. L1 is very latency sensitive so while some dual core devices use shared caches normally the L1 is per core and only the L2 shared.

In the computer world things normally work by doubling so you would assume the complexity associated with cache coherence would do the same. Unfortunately in the case of interconnections it doesn’t, it’s much worse:

Links between caches required in multi core processors:

2 cores (2 caches) - 1 link

4 cores - (4 caches) 6 links

8 cores - (8 caches) 28 links

In an 8 core processor giving each core it’s own cache will prove very complex indeed, unless some form of shared bus is used. Even then traffic between cores is likely to be heavy and using a shared bus will also increase latency.

However this is a simplification, I have not included the number of links required for I/O or for hooking up to multi level caches (L1 to L2), including these figures makes things even more complex.

The links can be reduced by making the L1 and L2 caches inclusive. Typically caches in modern processors are exclusive, that is they do not hold the same data. Inclusive caches hold copies of the same data so are in effect slightly smaller. Inclusive caches will reduce the coherence traffic necessary since data that is held in both caches can be tagged, the L1 will only need updating when tagged date changes in the L2.

Needless to say I think we can expect 8+ core processors to have shared caches, expect L2 cache latency figures to start rising sharply.

Cache coherency causes huge scaling problems due to the traffic generated gobbling up system bandwidth. There are a number of methods to decrease the traffic generated but these all add complexity and thus latency.

One solution is to use a directory, this involves storing a list of what is in each cache in a directory and looking it up before sending around any coherence traffic. Doing a directory lookup before you do a cache lookup will sharply reduce cache coherence traffic but is not exactly going to help latency.

There appears to be no way to avoid increasing cache latency in a multi core system.

The Problem With Memory

Cache latency will rise but you can expect the same to happen with memory. Memory latency does not fall as processors get faster, in fact memory latency tends to rise over time with increasing capacity. New types of memory such as DDR3 or FBDIMMs both increase latency compared to previous memory types. Going multicore will increase memory latency further and effectively decrease memory bandwidth.

Having a large number of processor cores means that memory will now have to be shared between them. To ensure the memory requests are efficient they will be reordered. This sharing and reordering hardware takes time to do it’s job so any memory requests going through it will be subject to latency from it. The more cores there are the more complex it will become and the greater the latency.

Sharing memory also has the effect of lowering bandwidth. If you have a memory interface bandwidth of 10 Gigabytes per second, adding a second core means the bandwidth per core is reduced to 5 Gigabytes per second (per core). Going to 4 cores means you have only 2.5 Gigabytes (per core). For memory bandwidth to keep up with processors with large numbers of cores they are going to need more higher speed memory controllers. Adding more memory controllers means more logic and that of course means yet more memory access latency...

Another problem with memory is related to caches. Lets say one core wishes to read from a certain address in memory. Another core could have written data to that address but the data may be sitting in cache having not yet made it to RAM. The first core thus needs to ask the caches of the other cores if they have data that it wants. As this will happen for every memory read this process will generate an enormous amount of inter-core traffic slowing down both memory and cache access. Again, the more cores the more complex it gets.

There are no easy solutions to the hardware problems that large scale multicore processors will bring. Every potential solution brings it own set of trade-offs so you can expect microprocessor architects to be investigating all sorts of different options.

The problem With Software

Not all the problems are going to be at the hardware level, large scale multicore processors will also bring challenges to the software domain. The physical limitations now hitting processor designers now mean that in a short few years plugging in a new processor may actually have the opposite effect, if you want faster software you are going to have to rewrite it.

Some types of algorithms common today are going to just plain suck on future large scale multicore processors. Developers are going to need to learn what to use and what not to. The days of coding anything you like and expecting the hardware to just handle it are over, if you want high performance code on an 8+ core processor you are going to have to do a lot of work.

With all the complexities large scale multicore processing will bring, software practices are going to need to change in order to extract the performance from these processors. If you want maximum performance. Predictable, linear data structures and predictable code are the sorts of things you are going to need on these processors. Like it or not, today’s single threaded, branchy code will be going out the window.

In order to take advantage of massively parallel hardware you need software which can be broken up.

Not only will algorithms need to change but software will need to become heavily multithreaded. This process is relatively easy for certain classes of “embarrassingly parallel” problems but these are limited. For most existing software the transition is going to be a long, difficult process. However, even when parallelised there’s other problems to consider...

Meet Gene Amdahl

Multithreading involves splitting the application into threads and co-ordinating them. Not everything can be broken into threads however and of the applications that can there are still likely to be parts which execute serially on a single core. You may not consider this to be a big deal but that serial component imposes a limit to how much faster that a parallel application can go.

As an example lets assume you have an application which is 90% parallel and 10% serial. If it takes 100 minutes to execute on a single core we can work out how long the task will take to complete on various numbers of cores.

As you see from the below figures going to 2 cores roughly doubles performance. Going to 4 cores however only triples performance, going to 8 cores doesn’t even increase performance 5 times.

Speed for different cores in application 10% Serial, 90% Parallel.

1 core 100 (1x)

2 cores 55 (1.8x)

4 cores 32.5 (3.1x)

8 cores 21.3 (4.7x)

16 cores 15.6 (6.4x)

It doesn’t matter how many cores you keep adding, if 10% of the code is serial the speedup will never get beyond 10X.

This effect is not theoretical, it is known as “Amdahl’s law” after it was first noticed in the 1960’s by mainframe designer Dr Gene Amdahl. Unlike “Moore’s law” which is just an observation, Amdahl’s law is a fundamental software scaling law.

Between Amdahl’s law and a limited front side bus, a future system from Intel looks like it is going to ably demonstrate the problems I’m describing [8P] :

Cinebench 9.5 on Cloverton 2Ghz

1 core: 362

8 cores: 1723 (4.7 times faster)

It should be noted that these chips are quite some time from release so the figures will likely rise (not least the clock speed). That said a lack of scaling is clearly demonstrated here and is caused by everything discussed this far.

The above figures involve four physical chips with shared busses and these will limit scaling, you can expect better figures once all the cores are on a single chip. Do not expect perfect scaling though and don’t expect scaling problems to be limited to one manufacturer or architecture, these problems are going to hit every vendor in the industry.

The Problem with Power

The inability to raise processor speed without raising power consumption is the reason vendors switched to building multicore processors in the first place. Two lower clocked cores can offer more processing capability at lower power than a higher clocked single core.

Unfortunately going multicore only reduces the power problem, it doesn’t solve it. In order to keep the power usage at the same level, every time you double the number of cores you need to halve their power consumption, that is not going to be easy. Sun and STI have both managed to get the power of their cores down by using a simple architecture suited to their intended workloads, other processors will not be able to do this without impacting performance.

I once predicted Intel would move to a transmeta style VLIW based architecture for their x86 processors [VLIW86] . While I got that prediction wrong I still expect vendors of complex cores to switch to simpler designs simply because of power concerns. Due to the low number of architectural registers (the registers visible to the programmer) x86 processors in particular are dependant on out-of-order (OOO) execution, removing it would have a highly negative effect on performance. A VLIW design moves OOO into the compiler cutting power drastically. Interestingly, more reliance on the compiler means this technique has a side benefit that will help fight the latency problems, memory reads likely to incur latency can be moved forward by the compiler to reduce the impact of latency. This wont help in all cases but should be useful nonetheless.

The Big Crunch: The Downside Of Multicore

Part 1 : Problems

Part 2 : Solutions?