I'd guess everyone has heard something of the large, public, flame war that erupted between Intel and Nvidia about whose product is or will be superior: Intel Larrabee, or Nvidia's CUDA platforms. There have been many detailed analyses posted about details of these, such as who has (or will have) how many FLOPS, how much bandwidth per cycle, and how many nanoseconds latency when everything lines up right. Of course, all this is “peak values,” which still means “the values the vendor guarantees you cannot exceed” (Jack Dongarra’s definition), and one can argue forever about how much programming genius or which miraculous compiler is needed to get what fraction of those values.
Such discussion, it seems to me, ignores the elephant in the room. I think a key point, if not the key point, is that this is an issue of MIMD (Intel Larrabee) vs. SIMD (Nvidia CUDA).
If you question this, please see the update at the end of this post. Yes, Nvidia is SIMD, not SPMD.
I’d like to point to a Wikipedia article on those terms, from Flynn’s taxonomy, but their article on SIMD has been corrupted by Intel and others’ redefinition of SIMD to “vector.” I mean the original. So this post becomes much longer.
MIMD (Multiple Instruction, Multiple Data) refers to a parallel computer that runs an independent separate program – that’s the “multiple instruction” part – on each of its simultaneously-executing parallel units. SMPs and clusters are MIMD systems. You have multiple, independent programs barging along, doing things that may have nothing to do with each other, or may not. When they are related, they barge into each other at least occasionally, hopefully as intended by the programmer, to exchange data or to synchronize their otherwise totally separate operation. Quite regularly the barging is unintended, leading to a wide variety of insidious data- and time-dependent bugs.
SIMD (Single Instruction, Multiple Data) refers to a parallel computer that runs the EXACT SAME program – that’s the “single instruction” part – on each of its simultaneously-executing parallel units. When ILIAC IV, the original 1960s canonical SIMD system, basically a rectangular array of ALUs, was originally explained to me late in grad school (I think possibly by Bob Metcalfe) it was put this way:
Some guy sits in the middle of the room, shouts ADD!, and everybody adds.
I was a programming language hacker at the time (LISP derivatives), and I was horrified. How could anybody conceivably use such a thing? Well, first, it helps that when you say ADD! you really say something like “ADD Register 3 to Register 4 and put the result in Register 5,” and everybody has their own set of registers. That at least lets everybody have a different answer, which helps. Then you have to bend your head so all the world is linear algebra: Add matrix 1 to matrix 2, with each matrix element in a different parallel unit. Aha. Makes sense. For that. I guess. (Later I wrote about 150 KLOC of APL, which bent my head adequately.)
Unfortunately, the pure version doesn’t make quite enough sense, so Burroughs, Cray, and others developed a close relative called vector processing: You have a couple of lists of values, and say ADD ALL THOSE, producing another list whose elements are the pair wise sums of the originals. The lists can be in memory, but dedicated registers (“vector registers”) are more common. Rather than pure parallel execution, vectors lend themselves to pipelining of the operations done. That doesn’t do it all in the same amount of time – longer vectors take longer – but it’s a lot more parsimonious of hardware. Vectors also provide a lot more programming flexibility, since rows, columns, diagonals, and other structures can all be vectors. However, you still spend a lot of thought lining up all those operations so you can do them in large batches. Notice, however, that it’s a lot harder (but not impossible) for one parallel unit (or pipelined unit) to unintentionally barge into another’s business. SIMD and vector, when you can use them, are a whole lot easier to debug than MIMD because SIMD simply can’t exhibit a whole range of behaviors (bugs) possible with MIMD.
Intel’s SSE and variants, as well as AMD and IBM’s equivalents, are vector operations. But the marketers apparently decided “SIMD” was a cooler name, so this is what is now often called SIMD.
Bah, humbug. This exercises one of my pet peeves: Polluting the language for perceived gain, or just from ignorance, by needlessly redefining words. It damages our ability to communicate, causing people to have arguments about nothing.
Anyway, ILLIAC IV, the CM-1 Connection Machine (which, bizarrely, worked on lists – elements distributed among the units), and a variety of image processing and hard-wired graphics processors have been rather pure SIMD. Clearspeed’s accelerator products for HPC are a current example.
Graphics, by the way, is flat-out crazy mad for linear algebra. Graphics multiplies matrices multiple times for each endpoint of each of thousands or millions of triangles; then, in rasterizing, for each scanline across each triangle it interpolates a texture or color value, with additional illumination calculations involving normals to the approximated surface, doing the same operations for each pixel. There’s an utterly astonishing amount of repetitive arithmetic going on.
Now that we’ve got SIMD and MIMD terms defined, let’s get back to Larrabee and CUDA, or, strictly speaking, the Larrabee architecture and CUDA. (I’m strictly speaking in a state of sin when I say “Larrabee or CUDA,” since one’s an implementation and the other’s an architecture. What the heck, I’ll do penance later.)
Larrabee is a traditional cache-coherent SMP, programmed as a shared-memory MIMD system. Each independent processor does have its own vector unit (SSE stuff), but all 8, 16, 24, 32, or however many cores it has are independent executors of programs. As are each of the threads in those cores. You program it like MIMD, working in each program to batch together operations for each program’s vector (SIMD) unit.
CUDA, on the other hand, is basically SIMD at its top level: You issue an instruction, and many units execute that same instruction. There is an ability to partition those units into separate collections, each of which runs its own instruction stream, but there aren’t a lot of those (4, 8, or so). Nvidia calls that SIMT, where the “T” stands for “thread” and I refuse to look up the rest because this has a perfectly good term already existing: MSIMD, for Multiple SIMD. (See pet peeve above.) The instructions it can do are organized around a graphics pipeline, which adds its own set of issues that I won’t get into here.
Which is better? Here are basic arguments:
For a given technology, SIMD always has the advantage in raw peak operations per second. After all, it mainly consists of as many adders, floating-point units, shaders, or what have you, as you can pack into a given area. There’s little other overhead. All the instruction fetching, decoding, sequencing, etc., are done once, and shouted out, um, I mean broadcast. The silicon is mainly used for function, the business end of what you want to do. If Nvidia doesn’t have gobs of peak performance over Larrabee, they’re doing something really wrong. Engineers who have never programmed don’t understand why SIMD isn’t absolutely the cat’s pajamas.
On the other hand, there’s the problem of batching all those operations. If you really have only one ADD to do, on just two values, and you really have to do it before you do a batch (like, it’s testing for whether you should do the whole batch), then you’re slowed to the speed of one single unit. This is not good. Average speeds get really screwed up when you average with a zero. Also not good is the basic need to batch everything. My own experience in writing a ton of APL, a language where everything is a vector or matrix, is that a whole lot of APL code is written that is basically serial: One thing is done at a time.
So Larrabee should have a big advantage in flexibility, and also familiarity. You can write code for it just like SMP code, in C++ or whatever your favorite language is. You are potentially subject to a pile of nasty bugs that aren’t there, but if you stick to religiously using only parallel primitives pre-programmed by some genius chained in the basement, you’ll be OK.
[Here’s some free advice. Do not ever even program a simple lock for yourself. You’ll regret it. Case in point: A friend of mine is CTO of an Austin company that writes multithreaded parallel device drivers. He’s told me that they regularly hire people who are really good, highly experienced programmers, only to let them go because they can’t handle that kind of work. Granted, device drivers are probably a worst-case scenario among worst cases, but nevertheless this shows that doing it right takes a very special skill set. That’s why they can bill about $190 per hour.]
But what about the experience with these architectures in HPC? We should be able to say something useful about that, since MIMD vs. SIMD has been a topic forever in HPC, where forever really means back to ILLIAC days in the late 60s.
It seems to me that the way Intel's headed corresponds to how that topic finally shook out: A MIMD system with, effectively, vectors. This is reminiscent of the original, much beloved, Cray SMPs. (Well, probably except for Cray’s even more beloved memory bandwidth.) So by the lesson of history, Larrabee wins.
However, that history played out over a time when Moore’s Law was producing a 45% CAGR in performance. So if you start from basically serial code, which is the usual place, you just wait. It will go faster than the current best SIMD/vector/offload/whatever thingy in a short time and all you have to do is sit there on your dumb butt. Under those circumstances, the very large peak advantage of SIMD just dissipates, and doing the work to exploit it simply isn’t worth the effort.
Yo ho. Excuse me. We’re not in that world any more. Clock rates aren’t improving like that any more; they’re virtually flat. But density improvement is still going strong, so those SIMD guys can keep packing more and more units onto chips.
Ha, right back at ‘cha: MIMD can pack more of their stuff onto chips, too, using the same density. But… It’s not sit on your butt time any more. Making 100s of processors scale up performance is not like making 4 or 8 or even 64 scale up. Providing the same old SMP model can be done, but will be expensive and add ever-increasing overhead, so it won’t be done. Things will trend towards the same kinds of synch done in SIMD.
Furthermore, I've seen game developer interviews where they strongly state that Larrabee is not what they want; they like GPUs. They said the same when IBM had a meeting telling them about Cell, but then they just wanted higher clock rates; presumably everybody's beyond that now.
Pure graphics processing isn’t the end point of all of this, though. For game physics, well, maybe my head just isn't build for SIMD; I don't understand how it can possibly work well. But that may just be me.
If either doesn't win in that game market, the volumes won't exist, and how well it does elsewhere won't matter very much. I'm not at all certain Intel's market position matters; see Itanium. And, of course, execution matters. There Intel at least has a (potential?) process advantage.
I doubt Intel gives two hoots about this issue, since a major part of their motivation is to ensure than the X86 architecture rules the world everywhere.
But, on the gripping hand, does this all really matter in the long run? Can Nvidia survive as an independent graphics and HPC vendor? More density inevitably will lead to really significant graphics hardware integrated onto silicon with the processors, so it will be “free,” in the same sense that Microsoft made Internet Explorer free, which killed Netscape. AMD sucked up ATI for exactly this reason. Intel has decided to build the expertise in house, instead, hoping to rise above their prior less-than-stellar graphics heritage.
My take for now is that CUDA will at least not get wiped out by Larrabee for the foreseeable future, just because Intel no longer has Moore’s 45% CAGR on its side. Whether it will survive as a company depends on many things not relevant here, and on how soon embedded graphics becomes “good enough” for nearly everybody, and "good enough" for HPC.
There was some discussion on Reddit of this post; it seems to have aged off now – Reddit search doesn’t find it. I thought I'd comment on it anyway, since this is still one of the most-referenced posts I've made, even after all this time.
Part of what was said there was that I was off-base: Nvidia wasn’t SIMD, it was SPMD (separate instructions for each core). Unfortunately, some of the folks there appear to have been confused by Nvidia-speak. But you don’t have to take my word for that. See this excellent tutorial from SIGGRAPH 09 by Kayvon Fatahalian of Stanford. On pp. 49-53, he explains that in “generic-speak” (as opposed to “Nvidia-speak”) the Nvidia GeForce GTX 285 does have 30 independent MIMD cores, but each of those cores is effectively 1024-way SIMD: It has with groups of 32 “fragments” running as I described above, multiplied by 32 contexts also sharing the same instruction stream for memory stall overlap. So, to get performance you have to think SIMD out to 1024 if you want to get the parallel performance that is theoretically possible. Yes, then you have to use MIMD (SPMD) 30-way on top of that, but if you don’t have a lot of SIMD you just won’t exploit the hardware.(source