This is the title of a thread I found on the www.techreport.net, thread link here. The laurels go to UberGerbil who brainstormed all of this. Good job mate
Originally Posted by UberGerbil
There seem to have been a lot of questions lately about multiple cores and multithreaded code, and it just so happens I was answering a question for someone else recently that caused me to write a bit of background on the topic, so I thought I’d modify it a bit and post it here. (Note: I’m going to try to keep this high-level and not get too hung up in the specifics – that naturally means I’m going to be generalizing and simplifying things… perhaps oversimplifying. I know there are several of you here on TR that do this for a living and probably know as much about this topic as I do, or more; keep in mind that I’m trying to write this for an average TR reader who keeps up on the tech but might not ever have written code or know much about the details of threading, so try not to nitpick me to death ).
It’s true that multiple cores can improve the responsiveness of a system: as we run more apps simultaneously and add more and more background processes (virus-checkers, audio players, chat programs, etc), having another core (or more) to offload that work can do a lot to maintain the “creamy smoothness™” everyone desires – and that’s a benefit even if none of those apps are explicitly multithreaded. And while several common applications (Microsoft Office, for example) are multithreaded, they’re generally not CPU-intensive enough for it to matter. Word’s background spellchecking and repagination is running on another thread, but putting that ~3% load onto another core doesn’t make any real difference. On the other hand, many of the most CPU-intensive apps that most people here run – games – are not yet multithreaded. Of course, the various game studios have said they are working on multithreaded engines and everyone is eagerly awaiting them, multicore chips in hand. Unfortunately, it appears some people have wildly unrealistic expectations for what this will mean when they actually arrive: the benefits may not be as great as some people expect.
Part of this is intrinsic to the problem of finding TLP (Thread Level Parallelism) in any software design. TLP is distinct from ILP (Instruction Level Parallelism) which involves the CPU executing multiple instructions from a single thread simultaneously. ILP can involve the compiler (in the case of SSE instructions, for example, as well as code generation strategies that seek to maximize the opportunities for ILP) but it happens at runtime and is largely out of the programmer's hands: the programmer can make choices that tend to optimize it, but the upper limit is determined by the superscalar design of the processor. TLP, on the other hand, is a software design decision that has to be made very early in the development process. This may involve coding threads explicitly, or using libraries that do so on the programmer's behalf; the networking portion of DirectX, for example, is multithreaded. It’s sometimes possible to go back in a single-threaded design and find some work that can get split out and run on another thread, but the benefit of that kind of retrofit is generally limited (this is what Carmack did when putting SMP into the Quake 3 code, for example, and the results were underwhelming). To really take advantage of multithreading, you have to design for it from the start. Among other things, you seek to minimize writes to shared data (because this can create dependencies: situations where one or more other threads are stalled while they wait for an update from another thread, not to mention outright bugs where they didn't wait and got inconsistent data, or where every thread is deadlocked waiting on the one of the others). Just coding multiple threads correctly is hard; getting them to be efficient is harder still.
Complicating this is the diversity of hardware in the installed base, and it’s only getting worse: with multithreaded games coming Real Soon, many hardcore gamers have been buying dual core chips this year, but some of them will start adopting quadcores next year, and meanwhile the fat mainstream market still consists largely of uniprocessor boxes. Keep in mind there’s always some overhead with threading, because each thread has state information that has to be saved and restored every time the thread is switched off of or on to a core; this overhead is acceptable if there are generally enough cores to go around so that the thread is spending most of its time running, not waiting its turn. For this reason the general rule of thumb is that you should have no more threads than you have cores. Finding an optimal multithreaded design for consoles where the hardware is fixed target is one thing; designing for a user base that ranges from one to four cores (and perhaps eventually more within the lifetime of a game engine) is something else. Naïve approaches, like assigning a thread to every entity requiring AI, tend to perform poorly (particularly at the low end) because they add a lot of overhead without increasing performance; smarter approaches tend to be more closely tailored to the number of expected cores, which will likely leave a couple of cores largely unused when quadcores start arriving. And designing something that scales smoothly as more cores are available is very, very hard.
Now, there are cases where you get almost perfect scaling with cores. The classic example might be ray-trace rendering, where every pixel is independent of every other pixel, so while there's shared data (the model being rendered) it is entirely read-only. In effect the calculation for each pixel can be an independent thread. Thus you can keep adding processors (at least until you have as many processors as pixels) and get pretty much a linear increase in speed (2 cores = 2x as fast, 4 cores = 4x as fast, etc). There are other cases where there is no shared state between threads or processes, but they tend to be server apps (web servers, for example, where there’s no interdependence between an instance serving one browser/user and an instance serving another, which is why Sun’s highly-threaded Niagara design makes sense in that domain).
However, in the real world of end user applications, away from special cases like that, things aren't quite so neat. In general, there is always some serialization in the code: some routine that cannot be parallelized and that thus acts as a bottleneck for everything else. It may be the code that divvies up work to the other threads; it may be communication between the threads as they do their tasks; it may be updating shared data when they are done. But there is going to be some code like that, and it has a significant impact on the overall gains you get from threading. In fact, it turns out to be the critical factor.
Games are the application where most people (here at least) are looking for improvements in the future. So let's set aside some inconvenient realities (that many games are still GPU-, not CPU-limited, for example; that most developers have to consider an installed base that is still largely uniprocessor; and that testing all the permutations of multicore hardware may become a gating factor on development schedules) and talk about theoretical improvements we might expect to see from multithreaded games running on multicore processors. Now, you might say, these programmers are smart, and good, and are going to get this right. So -- given enough time -- they should be able to wring a lot of TLP out of their code. Let's assume they do. In fact, let's be rather wildly optimistic and assume that they can get 90% parallelization. In other words, let’s imagine they can multithread almost all of their code so that 90% of the time it is running on multiple cores, and only 10% of the time those threads are serialized (waiting on one thread, etc). That’s almost certainly a better case than will actually happen, but let’s use that number. Moreover, let's assume these guys are really good, and they can scale this indefinitely: if you have four cores, they can spin off four threads; if you have eight cores, they can spin off eight, all without having more than 10% serialization. Even better, right? There may be some pain now, but once everybody is over the multithreading hurdle we’re golden. We’ll be getting 90% utilization no matter how many cores we have, and we’ll see immediate benefits whenever we upgrade to a chip with more cores. The future is bright, and as Intel and AMD give us more and more cores our games will get faster and faster. Right?
Uh, not quite.
Remember that 10% of the code that is serial? It turns out it isn’t just a small drag on the overall app; it’s the anchor that sinks the ship. Ten percent may not seem like much, but adding more cores and threads makes that worse. When you have a second thread running on a second core, it too is wasting 10% of its time waiting on that serial code; and each additional core wastes 10% of its time as well. But wait, you say, we’re improving the throughput on the other 90% of the code. Surely that 90% must outweigh the 10%. So how fast will the code run overall?
Well, with two cores we have 10% of the code that doesn’t go faster, and we can double the speed of the other 90%, so the speedup looks like this:
0.1 + ( 0.9 / 2) = 0.55; if we want that in terms of a multiplier of our baseline, it is
( 1 / .55 ) = 1.8x as fast as a single thread / core.
Ok, that’s not so bad. How do four cores look? Same math:
0.1 + ( 0.9 / 4 ) = 0.325; again, as a multiple of the baseline it is
( 1 / .325 ) = 3.08x
So note: even though 90% of our code can be spread across all four cores, the net speedup is only a factor of three. And it gets worse; for eight cores we get:
0.1 + ( 0.9 / 8 ) = 0.2125; inverted to get the speedup:
( 1 / 0.2125 ) = 4.7x
Yes, that math is right. Doubling the number of cores from four to eight speeds our hypothetical code by just 53%; put another way, we added four more cores to the mix and received an added benefit of less than two. Eight cores give us little more than four cores' worth of performance. I’ll let you do the math for 16 and beyond, but believe me it only gets uglier (we don’t see an 8x speedup until we get to 36 threads/cores, for example). Can you say "diminishing returns"?
This, ladies and gentleman, is known as Amdahl's Law and it’s not new – in fact, it’ll be celebrating its 40th birthday next year…and it has never been more relevant.