This article has been updated, see Version 2

Rebuttal to Ars article.

Any program has bugs. We have software testers to find them so they can be reported. A big article is like this as well, there's bugs in it to. In an article however, we use human languages which unlike computer can be interpreted in different ways. This means even if there are no specific errors, what readers take out of an article may still be incorrect.

Professional authors have their equivalent of software testers, these are called editors. I am not a professional author, my article did not get any editing apart that from myself.

Editing yourself alone is not always a good idea as you know what you've written and it's all too easy to miss bit's while reading. Another problem is you know what you mean yourself, you're not going to misinterpret yourself. A third problem is that if you're not perfect at spelling or grammar (I'm not) you're not going to notice these errors.

I've been reading the various comments on the article around the web and as you may expect some things have indeed been misinterpreted. I've also had quite a bit of correspondence on the article and some of them have pointed out errors, if you've e-mailed me I'd like to take the opportunity to thank you.

In the above cases I have gone through the article and corrected the mistakes. Where I've seen that points have been misinterpreted I've checked them over to make the relevant point clearer. For people who have already read the article I added an extra section pointing out these clarifications.

For reasons unknown the author "Hannibal" at tech site Arstechnica decided to do a short write up critiquing my article. He seems somewhat unimpressed and talks about a number of points which he considers flawed.

My article is based on the reading of the Sony patent and other sources of information, the aim was to help people understand what the Cell architecture would look like and what it's potentialities are. It is speculative by nature, it is not a scientific paper.

Looking at it in a pedantic manner as Hannibal seems to have done is completely pointless. He manages to take small misunderstandings and blows them up into major points.

Why he decided to do this in public is beyond me, but as such I feel I should answer these points.

Two of these points he makes were already corrected before his article was published, the other two are at best "differences of interpretation".

 

SETI time Estimation

"For instance, the author, Nicholas Blachford, starts off with a fantastic and completely made-up benchmark estimate for how fast Cell will complete a SETI@Home work unit (i.e. 5 mins)."

The SETI figure of 5 minutes for 4 Cells to complete a unit is a "calculated guess". It relies on a number of assumptions which may or may not be wrong. It also relies on the maximum theoretical performance of the Cell. The same goes for the Opteron comparison.

Given that we don't have the chip yet to test and there is no hard data to look at, it is thus safe to assume any figure I give should be taken with a pinch of salt. I would have thought this would have been obvious to anyone reading it, evidently not.

The Cell may not even get close to this level of performance in real life but it's theoretical performance is so high that even at 25% it's still going to blow everything else clean out of the water.

GPUs already exhibit performance massively beyond any desktop processor, they're just not in wide or general purpose apps. I'm really not saying anything that spectacular here. Get SETI running on an Nvidia or ATI if you don't believe me.

Before this article was published I had already made it clear that the figure was a bit of a guess in the article and made a note of it in the clarifications section. This point was thus redundant.

 

Compilers

"Blachford also declares that the longstanding problems inherent in code parallelism and multithreaded programming are now solved, because the Cell will just miraculously do all this stuff for you via fancy compiler and process scheduling tricks."

I did not say anything of the sort. One sentence, if read in a pedantic manner and taken out of context could potentially be seen to say this, it's certainly not what I meant. What I was really saying was after code had been split into Cells the infrastructure handles how they are distributed. That is, once a program is broken into software Cells you don't need to worry about the number of hardware Cells they are computed on.

That said this technology does indeed exist, try the following exact lines in Google:

"auto-parallelizing" compiler
"auto vectorizing" compiler

I have read on several occasions that IBM have been involved in auto-vectorising efforts over the last year, perhaps this is why.

Again, this had already been corrected so the point was again redundant.

 

Local memory V's Cache

In another part of the article, Blachford claims that the cell processing units have no "cache." Instead, they each have a "local memory" that fetches data from main memory in 1024-bit blocks. Well, that's sort of like saying that an iMac doesn't have a "monitor," but it does have a surface on which visual output is displayed.

In the case of the analogy given both the "surface" and "monitor" would perform the same function. The local memory and cache do a similar job but there's distinct differences between the two.
It's true to say they solve the same problem but they do it in very different ways and there are trade-offs in both approaches.

A CRT and LCD may perform the same function but that's not the same as saying they are the same thing.

A cache divides a system's memory map into blocks and can hold a portion of each of those blocks. While modern caches can be controlled to a degree they are not addressable in the same way as memory. This portion limits how much you can load from any given area. If you are working on a 50K block of RAM the cache will only hold a small part of it at any one time.

A local memory is of a fixed size and is directly addressable. There are no portions to worry about other than the maximum size of the memory. If you want to load 50K into an APU's local memory you just load it.

If your application involves iterating over a block of this size many times which approach is going to be faster? One has to keep going to RAM, the other does not.

On the other hand if you are multitasking, the cache approach makes a lot more sense since when you switch tasks at least part of the data you want in in memory already. This makes less sense if you have many cores and spread applications over them.

 

AltiVec V's APUs

In reference to my points on the APUs most likely not being AltiVec this is written:

The author appears to be confusing an instruction set with an implementation.

I state 3 reasons why I consider the APUs to be using a different instruction set from AltiVec.

1) The number of registers is different.
2) APUs use local memory
3) AltiVec is part of the PowerPC instruction set and operates as part of it.

The important points are 2 and 3.

The fact that the individual processing units have a local cache has little to do with whether or not the PUs themselves implement some hypothetical AltiVec derivative.

In AltiVec data is moved to and from the memory into registers then processed, results are then written back to memory from the registers. If you try that on an APU you'll not get very far, primarily because the instructions to do it do not exist.

The APUs can instruct memory to be moved between main memory and local memory or, between local memory and registers. It cannot move data between registers and main memory.

Finally, the statement, "Altivec is an add-on to the existing PowerPC instruction set," is correct, but the rest of that sentence--"and operates as part of a PowerPC processor"--doesn't make a whole lot of sense to me in this context.

APUs will have to control the flow of instructions so you will probably find there are some extra instruction units and registers to handle this.

Altivec will use the PowerPC general purpose registers and instruction units to handle this. If the APUs did this they would in effect be full PowerPC cores. I see no indication of this whatsoever and I think it would go against the entire "Cray on a chip" philosophy of the Cell.

To repeat what I said in the article:

"There will no doubt be a great similarity between the two but don't expect any direct compatibility. It should however be relatively simple to convert between the two."

 

To Conclude

Hannibal usually writes very good articles which I enjoy reading. Quite why he decided to write this pedantic rant is beyond me. It is something of a disappointment, especially his "ivory tower" tone.

Of course if he thought the piece had flaws he could have just sent me an email pointing them out. That's what I'd do and that's what others have done.

 

If I am being enthusiastic I believe I have justification for being so. Follow the references in parts 5, especially "Stream" and anything on GPUs. Similar technology already exists and is already delivering incredible performance.

 


Introduction and Index
Part 1: Inside The Cell
Part 2: Again Inside The Cell
Part 3: Cellular Computing
Part 4: Cell Vs the PC
Part 5: Conclusion and References
Part 6: Updates, Clarifications and Missing Bits

 

Nicholas Blachford 25/01/05