Although memory access latency is going to be much lower in Nehalem than it was in Core 2, this isn't going to somehow arbitrarily increase performance accross the board. What it does do is very specific and that is cut down on the relatively huge amount of latency from having to access main memory through a FSB. The Core2 architecture already sucessfully managed to mask this latency by having relatively large amounts of cache available and aggressively prefetching data using idle memory bandwidth. Assuming that the prefetcher is fairly accurate, the chance of a cache miss will be fairly low as will the number of situations where main memory has to be accessed directly. Remember that an IMC only becomes more advantageous the more often main memory is accessed. So if you want to see where the IMC will benefit Nehalem, you have to look for where the most cache misses are occuring. Obviously it isn't happening all that often in single threaded applications because that's where Core2 really shines. However, when you start multithreading it cuts down on the amount of cache available per thread dramatically. Less cache available means more cache misses and the performance penalty that ensues. This is when the IMC will make the difference for Nehalem, with multithreading.
Besides the on-die memory controller, K8 also widened and deepened the pipeline from K7 in addition to the minor tweaks. But, yes there was obviously a pretty huge performance difference in AMD CPUs with the addition of the IMC. However, the circumstances under which AMD integrated it with K8 are different than what Intel faced with Nehalem. Personally, I don't think that AMD's decision to implement an IMC was purely a design decision. I don't doubt that if AMD would have instead developed a smart prefetcher and used a larger amount of cache that they could have achieved similar results. The problem with this, though, is that cache takes up a very significant amount of die space. For a company with a limited production capacity like AMD coming off the sucess of K7, an IMC was a good business decision because it meant a reduced die size, therefore more could fit on a single wafer and increase yields.
I'm not saying that AMD's inclusion of the IMC was purely business, because it obviously turned out to be a good design deicsion as well. I'm trying to point out the differences in the circumstance that Intel chose to include it. Namely, that is that AMD used it as an alternative to putting more cache on the CPU. Obviously this isn't the same circumstance that Intel was in. They already had the cache and prefetching pretty much mastered with Core2 and they could afford the extra die space for it. However, their cache/prefetch model doesn't work nearly as well when you start to increase the number of simultaneous threads and CPU cores, and that's why Intel is going with it now. I don't think it's reasonable to expect the same kind of performance gain with the IMC from Intel that AMD had because it's meant to solve a different problem.
The thing is, though, that as far as single threading goes, Nehalem really is just a Core2 revision. Nothing major has been introduced to address any single threaded performance that has been revealed. By far the biggest changes are QuickPath, IMC, and HyperThreading. All three of those are clearly geared towards multithreading/multicore performance, and really all this is doing is bringing the Core architecture to parity with Barcelona in terms of scaling. The only thing AMD would need to do with Shanghai to compete with this is improve single threaded performance with some balance of IPC tweaking and increased clock speeds with the move to 45nm.
You bring up some valid points and I would love to see the performance out of the AMD chips are you are claiming. But one of the most significant arguement you have is that Shanghai will introduce significant IPC improvement. This claim has absolutely NO evidence.
The second part of your argument is that Nehalem's inprovements are simple focused on scaling performance. This is absolutely NOT the case. With just a little research from Wikipedia I found that:
- Nehalem is a modular architecture supporting integrated graphics and I/O chips
- 33% more in-flight micro-ops than Core. What does this mean:
Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.
With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.
- Improvements in unaligned cache access performance. What does this mean:
|In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.|
- New second level brand predictor per core. What does this mean:
|Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.|
- New L2 and L3 memory system (doesn't just help scaling!)
- 1.1x to 1.25x the single-threaded performance or 1.2x to 2x the multithreaded performance at the same power level
- 30% lower power usage for the same performance
- According to a preview from AnandTech "expect a 20 - 30% overall advantage over Penryn with only a 10% increase in power usage. It looks like Intel is on track to delivering just that in Q4."
So you are very correct that Nehalem looks to vastly improve on its scalability, something that has been needed desperately, you assume that this is the only changes and that these changes only affect multi-CPU environments.
Another critical mistake is believing AMD's numbers about 20-30% IPC improvements. Wait until AMD demo's some units before you use that in an arguement.