Overclock.net › Forums › Industry News › Hardware News › [TechReport] Intel gives peek into Nehalem bag of tricks
New Posts  All Forums:Forum Nav:

[TechReport] Intel gives peek into Nehalem bag of tricks - Page 6

post #51 of 55
Quote:
Originally Posted by darkcloud89 View Post
Although memory access latency is going to be much lower in Nehalem than it was in Core 2, this isn't going to somehow arbitrarily increase performance accross the board. What it does do is very specific and that is cut down on the relatively huge amount of latency from having to access main memory through a FSB. The Core2 architecture already sucessfully managed to mask this latency by having relatively large amounts of cache available and aggressively prefetching data using idle memory bandwidth. Assuming that the prefetcher is fairly accurate, the chance of a cache miss will be fairly low as will the number of situations where main memory has to be accessed directly. Remember that an IMC only becomes more advantageous the more often main memory is accessed. So if you want to see where the IMC will benefit Nehalem, you have to look for where the most cache misses are occuring. Obviously it isn't happening all that often in single threaded applications because that's where Core2 really shines. However, when you start multithreading it cuts down on the amount of cache available per thread dramatically. Less cache available means more cache misses and the performance penalty that ensues. This is when the IMC will make the difference for Nehalem, with multithreading.



Besides the on-die memory controller, K8 also widened and deepened the pipeline from K7 in addition to the minor tweaks. But, yes there was obviously a pretty huge performance difference in AMD CPUs with the addition of the IMC. However, the circumstances under which AMD integrated it with K8 are different than what Intel faced with Nehalem. Personally, I don't think that AMD's decision to implement an IMC was purely a design decision. I don't doubt that if AMD would have instead developed a smart prefetcher and used a larger amount of cache that they could have achieved similar results. The problem with this, though, is that cache takes up a very significant amount of die space. For a company with a limited production capacity like AMD coming off the sucess of K7, an IMC was a good business decision because it meant a reduced die size, therefore more could fit on a single wafer and increase yields.

I'm not saying that AMD's inclusion of the IMC was purely business, because it obviously turned out to be a good design deicsion as well. I'm trying to point out the differences in the circumstance that Intel chose to include it. Namely, that is that AMD used it as an alternative to putting more cache on the CPU. Obviously this isn't the same circumstance that Intel was in. They already had the cache and prefetching pretty much mastered with Core2 and they could afford the extra die space for it. However, their cache/prefetch model doesn't work nearly as well when you start to increase the number of simultaneous threads and CPU cores, and that's why Intel is going with it now. I don't think it's reasonable to expect the same kind of performance gain with the IMC from Intel that AMD had because it's meant to solve a different problem.



The thing is, though, that as far as single threading goes, Nehalem really is just a Core2 revision. Nothing major has been introduced to address any single threaded performance that has been revealed. By far the biggest changes are QuickPath, IMC, and HyperThreading. All three of those are clearly geared towards multithreading/multicore performance, and really all this is doing is bringing the Core architecture to parity with Barcelona in terms of scaling. The only thing AMD would need to do with Shanghai to compete with this is improve single threaded performance with some balance of IPC tweaking and increased clock speeds with the move to 45nm.
Although you spice it up with superior communication skills there is little difference between what you are claiming now and what many people were claiming leading up to the release of Phenom. (Remember the AMD folks that benchmarked a secret Phenom @ 3Ghz before launch).

You bring up some valid points and I would love to see the performance out of the AMD chips are you are claiming. But one of the most significant arguement you have is that Shanghai will introduce significant IPC improvement. This claim has absolutely NO evidence.

The second part of your argument is that Nehalem's inprovements are simple focused on scaling performance. This is absolutely NOT the case. With just a little research from Wikipedia I found that:

  • Nehalem is a modular architecture supporting integrated graphics and I/O chips
  • 33% more in-flight micro-ops than Core. What does this mean:
Quote:
Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.
With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.
  • Improvements in unaligned cache access performance. What does this mean:
Quote:
In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.
  • New second level brand predictor per core. What does this mean:
Quote:
Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.
  • New L2 and L3 memory system (doesn't just help scaling!)
According to Wikipedia:

  • 1.1x to 1.25x the single-threaded performance or 1.2x to 2x the multithreaded performance at the same power level
  • 30% lower power usage for the same performance
  • According to a preview from AnandTech "expect a 20 - 30% overall advantage over Penryn with only a 10% increase in power usage. It looks like Intel is on track to delivering just that in Q4."
Sources:

http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
http://www.anandtech.com/cpuchipsets...oc.aspx?i=3264

So you are very correct that Nehalem looks to vastly improve on its scalability, something that has been needed desperately, you assume that this is the only changes and that these changes only affect multi-CPU environments.

Another critical mistake is believing AMD's numbers about 20-30% IPC improvements. Wait until AMD demo's some units before you use that in an arguement.
System
(13 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 2500k ASRock P67 Extreme4 Gen 3 AMD 7970 16GB DDR3 
Hard DriveOptical DriveOSMonitor
Intel 520 256GB SATA DVD Burner Windows 7 64 bit Deal U2410 
KeyboardPowerMouse
Adesso Mechanical Silverstone OP650 Logitech G700 
  hide details  
Reply
System
(13 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 2500k ASRock P67 Extreme4 Gen 3 AMD 7970 16GB DDR3 
Hard DriveOptical DriveOSMonitor
Intel 520 256GB SATA DVD Burner Windows 7 64 bit Deal U2410 
KeyboardPowerMouse
Adesso Mechanical Silverstone OP650 Logitech G700 
  hide details  
Reply
post #52 of 55
Quote:
Originally Posted by pauldovi View Post
  • New second level brand predictor per core. What does this mean:

Quote:
Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.
So you are very correct that Nehalem looks to vastly improve on its scalability, something that has been needed desperately, you assume that this is the only changes and that these changes only affect multi-CPU environments.

Another critical mistake is believing AMD's numbers about 20-30% IPC improvements. Wait until AMD demo's some units before you use that in an arguement.
Very interesting - I wonder how much the second level branch prediction really helps performance . I was under the impression that branch prediction was already in the upper nineties in terms of hit percentage. However, with the heavy multithreading the penalties for a miss are much higher.

I'm with you on AMD's IPC claims. Since the first conroe rumors started to break Intel has been consistently living up to its promises in terms of performance and future tech. AMD, not so much.
It goes to eleven
(13 items)
 
  
CPUMotherboardGraphicsRAM
E6300 DS3 EVGA 8600GTS 2GB XMS2 DDR2-800 
Hard DriveOSMonitorKeyboard
1.294 TB Arch Linux/XP Samsung 226bw Eclipse II 
PowerCaseMouse
Corsair 520HX Lian-Li v1000B Plus G7 
  hide details  
Reply
It goes to eleven
(13 items)
 
  
CPUMotherboardGraphicsRAM
E6300 DS3 EVGA 8600GTS 2GB XMS2 DDR2-800 
Hard DriveOSMonitorKeyboard
1.294 TB Arch Linux/XP Samsung 226bw Eclipse II 
PowerCaseMouse
Corsair 520HX Lian-Li v1000B Plus G7 
  hide details  
Reply
post #53 of 55
Quote:
Originally Posted by pauldovi View Post
Although you spice it up with superior communication skills there is little difference between what you are claiming now and what many people were claiming leading up to the release of Phenom. (Remember the AMD folks that benchmarked a secret Phenom @ 3Ghz before launch).
How? If anything here is similar to the launch of the Phenom, it's Nehalem. When AMD said that Barcelona would be 40% faster than Clovertown, for some reason everyone assumed that they were also saying Phenom will be 40% faster than C2D. And then, when Phenom didn't live up to an expectation that AMD never set, accusations were made that AMD somehow lied. In reality the problem was with interpretation of AMD's claims. I can see the same thing happening now, Intel is saying there will be a 30-40% improvement, but they're talking about multithreading and scaling. Again, people see the number and believe it will just be 40% faster.

Quote:
You bring up some valid points and I would love to see the performance out of the AMD chips are you are claiming. But one of the most significant arguement you have is that Shanghai will introduce significant IPC improvement. This claim has absolutely NO evidence.
No evidence? If the link I provided wasn't enough, have another:
Quote:
Originally Posted by xbitlabs
The vice president of the world’s second largest maker of x86 central processing units (CPUs) also said that Shanghai microprocessors will be able to offer higher instructions per clock (IPC) throughput compared to Barcelona, which should transform into higher overall performance per clock. Thanks to higher IPC and larger level-three cache (6MB instead of 2MB), the new processors are likely to offer considerably higher speed than existing quad-core chips by AMD.
Quote:
Originally Posted by pauldovi
The second part of your argument is that Nehalem's inprovements are simple focused on scaling performance. This is absolutely NOT the case.

Nehalem is a modular architecture supporting integrated graphics and I/O chips
So are you telling me that Intel wasn't focused on performance scaling and multithreading? But then you bring up the fact that it's a modular design? You can't be serious...

Quote:
33% more in-flight micro-ops than Core. What does this mean:
Yes, the Re-Order Buffer is bigger, and can track more micro-ops in flight, and this change was made for HyperThreading. But, what it doesn't mention is that the ROB is statically split between two threads. What this means is that a single thread only has access to 64 ROB entries which is less than it would in Penryn.

Quote:
Improvements in unaligned cache access performance. What does this mean:
This only becomes a benefit in a situation where there is a lot of unaligned cache access. Since compilers already tended to avoid doing this as much as possible, the impact won't be large.

Quote:
New second level brand predictor per core. What does this mean:
In situations where the first-level branch predictor is already accurate, this means almost nothing because the new "fallback" predictor is slower. The only time this will be advantageous is when the first-level predictor fails and the second-level predictor's additional history storage is enough to produce a more accurate prediction. If I had to guess when the extra data would pay off, I'd say it's in situations with relatively large instructions and situations involving multiple threads.

Quote:
New L2 and L3 memory system (doesn't just help scaling!)
This is basically a by-product of integrating the memory controller. The amounts of cache that Penryn had available just isn't needed when you have fast memory access. I already explained how much of a difference I think this will actually make in my previous post.

Quote:
1.1x to 1.25x the single-threaded performance or 1.2x to 2x the multithreaded performance at the same power level
So after calling me out for having "no evidence" of Shanghai IPC improvements, you go ahead and quote this? Hint: There's no source for this information on the wikipedia article.

Quote:
According to a preview from AnandTech "expect a 20 - 30% overall advantage over Penryn with only a 10% increase in power usage. It looks like Intel is on track to delivering just that in Q4."
Yes, I've read that review and he does say "20-30% overall" but that's because he only really tested multithreaded performance. There's a reason there wasn't more than one single threaded benchmark.

Quote:
So you are very correct that Nehalem looks to vastly improve on its scalability, something that has been needed desperately, you assume that this is the only changes and that these changes only affect multi-CPU environments.
It's not that the changes only effect multithreaded/multisocket situations, but that the improvement for single-threaded situations is just so miniscule in comparison, unless there's still something major that's yet to be revealed. Seeing the single-thread Cinebench results pretty much confirmed what I was thinking.

Quote:
Another critical mistake is believing AMD's numbers about 20-30% IPC improvements. Wait until AMD demo's some units before you use that in an arguement.
I wouldn't call listening to what AMD says a mistake. When taken in the correct context, the performance advantages claimed for Barcelona turned out to be true.


That 14.4% Average improvement (minus outliers) is close enough for me to the 15% AMD claimed given the tests they did.
Edited by darkcloud89 - 6/21/08 at 8:48am
post #54 of 55
dang... this is some hectic and informative conversation...

nice it's not turning into a flame war...
Herschel
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i7 4770K ASRock Z87M Extreme4 eVGA GTX 680 2GB 12GB G.Skill Ripjaws 1600 
Hard DriveOptical DriveOSMonitor
1x 60GB SSD 1x 500GB, 1x 640GB, 1x 1TB Asus something or other Windows 7 Ultimate x64 Acer H236HLbid (23" 1920x1080) 
MonitorKeyboardPowerCase
Asus VE198 (19". 1440x900) Microsoft Sidewinder X4 Seasonic X650 Antec P180 Mini White 
Mouse
Logitech G500 
  hide details  
Reply
Herschel
(14 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i7 4770K ASRock Z87M Extreme4 eVGA GTX 680 2GB 12GB G.Skill Ripjaws 1600 
Hard DriveOptical DriveOSMonitor
1x 60GB SSD 1x 500GB, 1x 640GB, 1x 1TB Asus something or other Windows 7 Ultimate x64 Acer H236HLbid (23" 1920x1080) 
MonitorKeyboardPowerCase
Asus VE198 (19". 1440x900) Microsoft Sidewinder X4 Seasonic X650 Antec P180 Mini White 
Mouse
Logitech G500 
  hide details  
Reply
post #55 of 55
Quote:
Originally Posted by -iceblade^ View Post
dang... this is some hectic and informative conversation...

nice it's not turning into a flame war...
Yeah i loving seeing this on OCN, always winds up to me learning a lot more.
Lee XT
(17 items)
 
  
CPUMotherboardGraphicsRAM
AMD FX-6300 Asus M5A97 SAPPHIRE Radeon HD 7850 AMD 4GB DDR3 1333MHZ 
RAMRAMRAMHard Drive
AMD 4GB DDR3 1333MHZ AMD 4GB DDR3 1333MHZ AMD 4GB DDR3 1333MHZ OCZ Vertex 4 256GB 
CoolingOSMonitorKeyboard
Corsair H80 Windows 8.1 Pro MCE Dell P2414H WHXV7  Microsoft Generic 
PowerCaseMouseMouse Pad
Ultra 600W Limited Edition NZXT Black Steel Razer Deathadder Razer Goliath 
Audio
Realtek HD Audio 
  hide details  
Reply
Lee XT
(17 items)
 
  
CPUMotherboardGraphicsRAM
AMD FX-6300 Asus M5A97 SAPPHIRE Radeon HD 7850 AMD 4GB DDR3 1333MHZ 
RAMRAMRAMHard Drive
AMD 4GB DDR3 1333MHZ AMD 4GB DDR3 1333MHZ AMD 4GB DDR3 1333MHZ OCZ Vertex 4 256GB 
CoolingOSMonitorKeyboard
Corsair H80 Windows 8.1 Pro MCE Dell P2414H WHXV7  Microsoft Generic 
PowerCaseMouseMouse Pad
Ultra 600W Limited Edition NZXT Black Steel Razer Deathadder Razer Goliath 
Audio
Realtek HD Audio 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Hardware News
Overclock.net › Forums › Industry News › Hardware News › [TechReport] Intel gives peek into Nehalem bag of tricks