Thursday, October 18, 2007

Programming the PS3's Cell Processor


I found that excellent article I mentioned in an earlier post about how to program the new Cell processor in the PS3.

Dr. Dobbs Portal, Programming the Cell Processor



"In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputer—and all this with just with a single Cell processor! Some techniques (loop unrolling, function inlining, SIMDization) are familiar; others (bulk synchronous parallelization, DMA traffic scheduling, overlapping of computation and transfers) are less so."



Here's another article I just found that is less intensely technical than the above, but also very good.


Cell Architecture Explained Version 2



"It is when the SPEs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform an order of magnitude more calculations per second than any desktop processor currently available.

On the other hand if the stream uses large amounts of bandwidth and the data blocks can fit into the local stores the performance difference might actually be bigger. Even if conventional CPUs are capable of processing, the data at the same rate the transfers between the CPUs will be held up while they wait for chip to chip transfers. The Cell’s internal interconnect system allows transfers running into hundreds of Gigabytes per second, chip to chip interconnects allows transfers in the low 10’s of Gigabytes per second.

While conventional processors have vector units on board (SSE or VMX / AltiVec) they are not dedicated vector processors. The vector processing capability is an add-on to the existing instruction sets and has to share the CPUs resources. The SPEs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory (and not even this much if the data can fit in the local stores). Add to this the fact there are 8 of them and you can see why their potential computational capacity is so large."



No comments: