Originally Posted by JF-AMD
OK, daddy is going to do some math, everyone follow along please.
First: There is only ONE performance number that has been legally cleared, 16-core Interlagos will give 50% more throughput than 12-core Opteron 6100. This is a statement about throughput and about server workloads only. You CANNOT make any client performance assumptions about that statement.
Now, let's get started.
First, everything that I am about to say below is about THROUGHPUT and throughput is different than speed. If you do not understand that, then please stop reading here.
Second, ALL comparisons are against the same cores, these are not comparison different generations nor are they comparisons against different architectures.
Assume that a processor core has 100% throughput.
Adding a second core to an architecture is typically going to give ~95% greater throughput. There is obviously some overhead because the threads will stall, the threads will wait for each other and the threads may share data. So, two completely independent cores would equal 195% (100% for the first core, 95% for the second core.)
Looking at SPEC int and SPEC FP, Hyperthreading gives you 14% greater throughput for integer and 22% greater throughput for FP. Let's just average the two together.
One core is 100%. Two cores are 118%. Everyone following so far? We have 195% for 2 threads on 2 cores and we have 118% for 2 threads on 1 core.
Now, one bulldozer core is 100%. Running 2 threads on 2 seperate modules would lead to ~195%, it's consistent with running on two independent cores.
Running 2 threads on the same module is ~180%.
You can see why the strategy is more appealing than HT when it comes to threaded workloads. And, yes, the world is becoming more threaded.
Now, where does the 90% come from? What is 180% /2? 90%.
People have argued that there is a 10% overhead for sharing because you are not getting 200%. But, as we saw before, 2 cores actually only equals 195%, so the net per core if you divide the workload is actually 97.5%, so it is roughly a 7-8% delta from just having cores.
Now, before anyone starts complaining about this overhead and saying that AMD is compromising single thread performance (because the fanboys will), keep in mind that a processor with HT equals ~118% for 2 threads, so per thread that equals 59%, so there is a ~36% hit for HT. This is specifically why I think that people need to stay away from talking about it. If you want to pick on AMD for the 7-8%, you have to acknowledge the ~36% hit from HT. But ultimately that is not how people jusdge these things. Having 5 people in a car consumes more gas than driving alone, but nobody talks about the increase in gas consumption because it is so much less than 5 individual cars driving to the same place.
So, now you know the approximate metrics about how the numbers work out. But what does that mean to a processor? Well, let's do some rough math to show where the architecture shines.
An Orochi die has 8 cores. Let's say, for sake of argument, that if we blew up the design and said not modules, only independent cores, we'd end up with about 6 cores.
Now let's compare the two with the assumption that all of the cores are independent on one and in modules on the other. For sake of argument we will assume that all cores scale identically and that all modules scale identically. The fact that incremental cores scale to something less than 100% is already comprehended in the 180% number, so don't fixate on that. In reality the 3rd core would not be at 95% but we are holding that constant for example.
Mythical 6-core bulldozer:
100% + 95% + 95% + 95% + 95% + 95% = 575%
Orochi die with 4 modules:
180% + 180% + 180% + 180% = 720%
What if we had just done a 4 core and added HT (keeping in the same die space):
100% + 95% +95% +95% + 18% + 18% + 18% + 18% = 457%
What about a 6 core with HT (has to assume more die space):
100% + 95% +95% +95% +95% +95% + 18% + 18% + 18% + 18% + 18% + 18% = 683%
(Spoiler alert - this is a comparison using the same cores, do NOT start saying that there is a 25% performance gain over a 6-core Thuban, which I am sure someone is already starting to type.)
The reality is that by making the architecture modular and by sharing some resources you are able to squeeze more throughput out of the design than if you tried to use independent cores or tried to use HT. In the last example I did not take into consideration that the HT circuitry would have delivered an extra 5% circuitry overhead....
Every design has some degree of tradeoff involved, there is no free lunch. The goal behind BD was to increase core count and get more throughput. Because cores scale better than HT, it's the most predictable way to get there.
When you do the math on die space vs. throughput, you find that adding more cores is the best way to get to higher throughput. Taking a small hit on overall performance but having the extra space for additional cores is a much better tradeoff in my mind.
Nothing I have provided above would allow anyone to make a performance estimate of BD vs. either our current architecture or our compeition, so, everyone please use this as a learning experience and do not try to make a performance estimate, OK?