Joined
·
4,265 Posts
Hello OCN.
The following "cheat sheet" can be used to quickly approximate the performance difference between popular desktop CPUs depending on the number of threads the workload presents as.
Execution Throughput is synonymous with Instruction Rate, or IPS (Instruction per Second), or Compute Performance in general. Software executes as a series of instructions/computations. The faster those instructions can be executed, the better the software performs. Not all software produces workloads that scale into an unlimited number of cores, as such, comparing CPU performance can be difficult when most benchmarks either test 1 core or all of the cores of a CPU, but nothing in-between. This "chart" is an attempt to bridge that gap and provide a quick-n-dirty reference to compare CPU performance with the primary variable being the number of threads the computing workload presents as.
The relative performance is a rough average derived from my own testing of some CPUs (K10, K15, Ivy/Haswell) and critical analysis of many 3rd party sources (passmark, openbenchmarking, and CPU reviews). Considerations of actual CPU architecture are also taken into account: ( http://www.agner.org/optimize/microarchitecture.pdf ). The chart does not compare a specific workload, rather, is an approximate average of many possible workloads with the key variable being the number of threads the workload presents. The "scores" in the chart are only relevant and relative to other scores within this same chart. In some cases I have indeed made good faith estimates where I had access to little or no relative data to work with. The chart is not derived of raw synthetics, rather, execution performance in actual software workloads like compiling, compressing, decompressing, trans-coding/encoding, physics, rendering, science/math, CAD, image manipulation, gaming etc. YMMV. Passmark and OpenBenchmarking have been heavily influential here to produce baselines upon which to build the chart from.
The chart has been color coded as follows:
The green region is centered over the most relevant region for consideration in gaming and other real-time workloads, most CAD/engineering software (like Sketchup, AutoCAD, Solidworks), popular photo manipulation software (gimp/A-shop).
The light blue, grey and teal region is centered over the most relevant region for consideration in non-real-time workloads with minimal scaling penalties; 3D rendering, trans-coding, video toasting, compiling, compressing, decompressing, encryption, scientific research, distributed computing projects, mathematical research, etc.
The dark blue row, is a theoretical estimate of relative performance in an enterprise class workload encompassing very high numbers of active threads. This can give some indication of how well the CPU and associated platform will tolerate insane-multi-tasking and/or server class workloads (lots of active VMs, or high traffic hosting) and assumes gobs of system memory to accommodate this theoretical workload. This particular row of entries needs to be considered with lots of room for error as this would vary dramatically depending on the specifics of that brutal workload.
The execution throughput of the 3.0 GHZ Pentium G3220 w/1 thread = 10. This is used as the baseline standard to generate the chart. SMT/CMT/Turbo scaling is included where applicable and the chart assumes proper scheduling of the host OS to scale into SMT/CMT properly.
The chart can be viewed in post #3 of this thread below, or by opening this gif image in another tab/window, or by downloading and viewing the attached pdf or spreadsheet.

AproximateRelativeExecutionThroughput.pdf 102k .pdf file
AproximateRelativeExecutionThroughput.xlsx 10k .xlsx file
Myths:
1. "New games are optimized for 8 core CPUs."
Software that has been written and compiled to scale into many threads is not "optimized" for many cores, it is simply able to scale into many cores. There is no significant penalty to execution throughput when running several threads on a core. The only significant important characteristic of a CPU's performance is the availability of execution throughput to the workload. Whether that workload is 2 threads or 20 threads running on a 1 core or 10 core CPU, the resulting performance will be based on the availability of execution throughput to that workload. Being able to use more cores, and being "optimized" for more cores, are two very different things. The latter is based in a fallacy.
The advantage to many-core CPUs, comes into play with the number of active threads is far beyond what any one or several pieces of desktop software would ever benefit from spawning. Any attempt to create a workload that is pushing more and more into that "blue" area of the chart, is only counterproductive to the performance of the software. No game will ever be engineered to spawn 200-300+ active threads. It would be an enormous waste of execution overhead. Having strong CPU performance in that "dark blue" part of the chart, is great for workloads that present themselves that way because there is no alternative (enterprise conditions.)
2. "Consoles have 8 cores now, therefor I should build an 8 core gaming compute to be future proof."
The 8 cores in a console would have a combined score of about "22" in the chart above. (on the "8 threaded workload line). Scheduling optimizations for console hardware will be stripped from software when it is ported. The only remaining consideration will be whether or not the CPU in the desktop has enough combined execution throughput for the game engine after scheduling optimizations are removed and overhead from the loss of HSA is factored in. Whether that is provided via 2 cores or 10 cores is going to be largely irrelevant. As you can see, most desktop CPUs provide more execution throughput than the "8core" found in these consoles. The CPU on the consoles is not technically an 8-core CPU, it's a dual-quad CPU implementation, much like an old core2quad where they took two dual cores and strapped them to the same PCB. This creates performance scaling limitations that require scheduling optimizations to work around. A Jaguar module also suffers MP scaling penalties because it shares an L2 cache across all 4 cores of the module. Best performance for a given thread occurs when the thread is allowed to run solo on the module. Due to these scaling limitations of the platform, it is highly unlikely that console games are actually leveraging all 8 cores in a heavily saturated manner as attempting to do so would come with too many penalties and overheads to be worthwhile. It's actually more likely that they are using scheduler optimizations to throw the toughest 1-2 threads on a jaguar module by themselves, and using the other module for the remainder of the less demanding jobs. Actual saturation probably rarely exceeds 80% of available execution throughput.
By my best estimates, if you build a desktop with at least double the single threaded performance (score of "8+" on the 1 thread workload line), and 50% more combined execution performance (score of 33+ on or below the 8 thread workload line) than the console, it will play console ports just fine. Modern i3's and A8/A10 or better CPUs will prove effective for most console ports. The reason for the requirement of at least "double" the single threaded performance is to makeup for scheduler optimization losses and overhead introduced from the removal of the HSA environment.
3. "Games don't use Hyper-threading / Hyperthreading sucks"
Hyper-threading is very misunderstood. Nearly any workload that does not bottleneck a single thread on the instruction decoders or execution resources of a core can benefit from hyper-threading if it scales into multiple threads, games included. The reason we do not observe performance scaling from hyper-threading in many desktop workloads (especially games) is simply because the workload is not simultaneously parallel and demanding enough to scale into the additional available execution throughput. Most observations are made between the i5 and i7, where the i5 is already providing as much execution throughput via parallelism as the software can leverage anyway. Remember, a real-time workload like a game-engine is always technically being "throttled" by a timeline of events that must adhere to real-time. Even though gaming isn't considered by many to be a "true" real-time workload (like running a feedback loop on a precision machine or robot), it still must adhere to all the same principals and a timeline and order of operations and events. This makes "useful scaling" very difficult beyond a certain number of threads.
Observe the difference in performance between the i3 and Pentium in gaming workloads. HT scaling is very effective in gaming workloads when the availability of execution resources is cut in HALF (2 cores instead of 4 cores on the i5/i7). In fact, almost ALL games scale nicely into hyper-threading when the availability of execution resources forces the issue.
4. CMT scaling is better than SMT scaling.
The available execution resources (pipelines) on any one cycle on a haswell core, is basically the same as the available execution resources on any one cycle on a piledriver module. The number of "pipelines" that can have work scheduled on them simultaneously is effectively the same. (the IPC differences stem primarily from cache latency, cache bandwidth, different instruction penalties, average pipeline length, decoder/scheduling performance, instruction/vector size support, and instruction latency)
The entirety of the execution resources in a haswell core are available to a single thread, while only half of those execution resources are available to a single thread on a piledriver module on any given cycle. When ONLY the scaling is observed as thread count increases or decreases, scaling appears better on CMT. This perspective requires a purposeful ignorance of the starting point of execution throughput to have merit. A far more useful way to observe SMT vs CMT scaling, is how much performance is LOST when dropping from 2 to 1 threads on a core or module, as this is actually a far more relevant consideration for the actual performance of real desktop workloads. CMT has a lot more to lose when the thread count is reduced, because the availability of execution resources is also reduced. SMT does not suffer from any loss in available execution resources when the thread count is reduced. From my perspective, SMT scaling is actually better than CMT scaling, because it provides a higher availability of execution throughput to a wider variety of workloads (highly threaded or not). CMT suffers far greater penalties to poorly threaded workloads. In most cases where CMT scaling shows large scaling, and SMT scalin shows minimal scaling, it is because SMT has already achieved decoder or execution resource saturation without the need of another thread to get there. We "observe" scaling from CMT in these conditions because we are effectively "activating" unused resources that SMT was already leveraging before the second thread was thrown into the mix. In cases where workloads do not scale up on SMT, it is because the ideal saturation of execution or decoder or scheduling resources has already occurred. It's not possible for a single thread to saturate the available execution, decoder, or scheduler resources of a pile-driver module, it is effectively "throttled" by the fact that half of its execution resources are forced to lay dormant until another thread is spun up on the module.
Trying to characterize the premise of CMT "scaling" as an upward benefit is not much different than trying to characterize cylinder deactivation as having horse power advantages. It doesn't. The problem is that so many of these concepts are very cerebral and simultaneously disconnected (we can't get our hands on them or visualize them as part of a working mechanical system), so it's very easy to get lost down a rabbit hole of broken relativity and useless perspectives.
The best way to "see" why I prefer the perspective of CMT as a penalty, and not an upward scaling benefit, is to study the execution throughput chart. When comparing an equal number of SMT cores vs CMT modules, the SMT solution offers the highest availability of execution throughput to the widest variety of workloads, highly threaded or not.
5. You're an Intel Fanboy, obviously! Quit bashing AMD you hater!
The conclusions that one might draw from reading this post, and studying the chart, are that I might be here to "prove" that Intel is better than AMD. The chart certainly shows that at this time, Intel is offering a lot of CPUs at competitive prices that offer higher execution performance to the "sweet spot" for most mainstream desktop builders at this time.
Consider the inverse, if I "knew" this "stuff" about CPUs, and chose NOT to share it, then by any standard that makes me an Intel fanboy for sharing it, I would be an AMD fanboy for choosing to sweep it under the rug and keep it to myself. The chart, and my "mythbusters" would be the same whether under a rug or here on the forum.
My concern, is CPU vs CPU, performance, features, cost. There's more to a CPU than the chart in this thread. Don't forget about support for specialized instructions, or platform capabilities like IOMMU. At this time, Intel solutions are better for 80-90% of desktop builds due to the intended use of those builds and the performance characteristics of the Intel solutions available to fit the budget of those 80-90% of builds. This doesn't mean that Intel is "better," It means that Intel CPUs are better for most applications for the money right now. Intel could have run over my dog and I would still have to admit that their CPUs are the better solution for most builds right now.
Enjoy and happy tuning/building!
Eric
The following "cheat sheet" can be used to quickly approximate the performance difference between popular desktop CPUs depending on the number of threads the workload presents as.
Execution Throughput is synonymous with Instruction Rate, or IPS (Instruction per Second), or Compute Performance in general. Software executes as a series of instructions/computations. The faster those instructions can be executed, the better the software performs. Not all software produces workloads that scale into an unlimited number of cores, as such, comparing CPU performance can be difficult when most benchmarks either test 1 core or all of the cores of a CPU, but nothing in-between. This "chart" is an attempt to bridge that gap and provide a quick-n-dirty reference to compare CPU performance with the primary variable being the number of threads the computing workload presents as.
The relative performance is a rough average derived from my own testing of some CPUs (K10, K15, Ivy/Haswell) and critical analysis of many 3rd party sources (passmark, openbenchmarking, and CPU reviews). Considerations of actual CPU architecture are also taken into account: ( http://www.agner.org/optimize/microarchitecture.pdf ). The chart does not compare a specific workload, rather, is an approximate average of many possible workloads with the key variable being the number of threads the workload presents. The "scores" in the chart are only relevant and relative to other scores within this same chart. In some cases I have indeed made good faith estimates where I had access to little or no relative data to work with. The chart is not derived of raw synthetics, rather, execution performance in actual software workloads like compiling, compressing, decompressing, trans-coding/encoding, physics, rendering, science/math, CAD, image manipulation, gaming etc. YMMV. Passmark and OpenBenchmarking have been heavily influential here to produce baselines upon which to build the chart from.
The chart has been color coded as follows:
The green region is centered over the most relevant region for consideration in gaming and other real-time workloads, most CAD/engineering software (like Sketchup, AutoCAD, Solidworks), popular photo manipulation software (gimp/A-shop).
The light blue, grey and teal region is centered over the most relevant region for consideration in non-real-time workloads with minimal scaling penalties; 3D rendering, trans-coding, video toasting, compiling, compressing, decompressing, encryption, scientific research, distributed computing projects, mathematical research, etc.
The dark blue row, is a theoretical estimate of relative performance in an enterprise class workload encompassing very high numbers of active threads. This can give some indication of how well the CPU and associated platform will tolerate insane-multi-tasking and/or server class workloads (lots of active VMs, or high traffic hosting) and assumes gobs of system memory to accommodate this theoretical workload. This particular row of entries needs to be considered with lots of room for error as this would vary dramatically depending on the specifics of that brutal workload.
The execution throughput of the 3.0 GHZ Pentium G3220 w/1 thread = 10. This is used as the baseline standard to generate the chart. SMT/CMT/Turbo scaling is included where applicable and the chart assumes proper scheduling of the host OS to scale into SMT/CMT properly.
The chart can be viewed in post #3 of this thread below, or by opening this gif image in another tab/window, or by downloading and viewing the attached pdf or spreadsheet.
AproximateRelativeExecutionThroughput.pdf 102k .pdf file
AproximateRelativeExecutionThroughput.xlsx 10k .xlsx file
Myths:
1. "New games are optimized for 8 core CPUs."
Software that has been written and compiled to scale into many threads is not "optimized" for many cores, it is simply able to scale into many cores. There is no significant penalty to execution throughput when running several threads on a core. The only significant important characteristic of a CPU's performance is the availability of execution throughput to the workload. Whether that workload is 2 threads or 20 threads running on a 1 core or 10 core CPU, the resulting performance will be based on the availability of execution throughput to that workload. Being able to use more cores, and being "optimized" for more cores, are two very different things. The latter is based in a fallacy.
The advantage to many-core CPUs, comes into play with the number of active threads is far beyond what any one or several pieces of desktop software would ever benefit from spawning. Any attempt to create a workload that is pushing more and more into that "blue" area of the chart, is only counterproductive to the performance of the software. No game will ever be engineered to spawn 200-300+ active threads. It would be an enormous waste of execution overhead. Having strong CPU performance in that "dark blue" part of the chart, is great for workloads that present themselves that way because there is no alternative (enterprise conditions.)
2. "Consoles have 8 cores now, therefor I should build an 8 core gaming compute to be future proof."
The 8 cores in a console would have a combined score of about "22" in the chart above. (on the "8 threaded workload line). Scheduling optimizations for console hardware will be stripped from software when it is ported. The only remaining consideration will be whether or not the CPU in the desktop has enough combined execution throughput for the game engine after scheduling optimizations are removed and overhead from the loss of HSA is factored in. Whether that is provided via 2 cores or 10 cores is going to be largely irrelevant. As you can see, most desktop CPUs provide more execution throughput than the "8core" found in these consoles. The CPU on the consoles is not technically an 8-core CPU, it's a dual-quad CPU implementation, much like an old core2quad where they took two dual cores and strapped them to the same PCB. This creates performance scaling limitations that require scheduling optimizations to work around. A Jaguar module also suffers MP scaling penalties because it shares an L2 cache across all 4 cores of the module. Best performance for a given thread occurs when the thread is allowed to run solo on the module. Due to these scaling limitations of the platform, it is highly unlikely that console games are actually leveraging all 8 cores in a heavily saturated manner as attempting to do so would come with too many penalties and overheads to be worthwhile. It's actually more likely that they are using scheduler optimizations to throw the toughest 1-2 threads on a jaguar module by themselves, and using the other module for the remainder of the less demanding jobs. Actual saturation probably rarely exceeds 80% of available execution throughput.
By my best estimates, if you build a desktop with at least double the single threaded performance (score of "8+" on the 1 thread workload line), and 50% more combined execution performance (score of 33+ on or below the 8 thread workload line) than the console, it will play console ports just fine. Modern i3's and A8/A10 or better CPUs will prove effective for most console ports. The reason for the requirement of at least "double" the single threaded performance is to makeup for scheduler optimization losses and overhead introduced from the removal of the HSA environment.
3. "Games don't use Hyper-threading / Hyperthreading sucks"
Hyper-threading is very misunderstood. Nearly any workload that does not bottleneck a single thread on the instruction decoders or execution resources of a core can benefit from hyper-threading if it scales into multiple threads, games included. The reason we do not observe performance scaling from hyper-threading in many desktop workloads (especially games) is simply because the workload is not simultaneously parallel and demanding enough to scale into the additional available execution throughput. Most observations are made between the i5 and i7, where the i5 is already providing as much execution throughput via parallelism as the software can leverage anyway. Remember, a real-time workload like a game-engine is always technically being "throttled" by a timeline of events that must adhere to real-time. Even though gaming isn't considered by many to be a "true" real-time workload (like running a feedback loop on a precision machine or robot), it still must adhere to all the same principals and a timeline and order of operations and events. This makes "useful scaling" very difficult beyond a certain number of threads.
Observe the difference in performance between the i3 and Pentium in gaming workloads. HT scaling is very effective in gaming workloads when the availability of execution resources is cut in HALF (2 cores instead of 4 cores on the i5/i7). In fact, almost ALL games scale nicely into hyper-threading when the availability of execution resources forces the issue.
4. CMT scaling is better than SMT scaling.
The available execution resources (pipelines) on any one cycle on a haswell core, is basically the same as the available execution resources on any one cycle on a piledriver module. The number of "pipelines" that can have work scheduled on them simultaneously is effectively the same. (the IPC differences stem primarily from cache latency, cache bandwidth, different instruction penalties, average pipeline length, decoder/scheduling performance, instruction/vector size support, and instruction latency)
The entirety of the execution resources in a haswell core are available to a single thread, while only half of those execution resources are available to a single thread on a piledriver module on any given cycle. When ONLY the scaling is observed as thread count increases or decreases, scaling appears better on CMT. This perspective requires a purposeful ignorance of the starting point of execution throughput to have merit. A far more useful way to observe SMT vs CMT scaling, is how much performance is LOST when dropping from 2 to 1 threads on a core or module, as this is actually a far more relevant consideration for the actual performance of real desktop workloads. CMT has a lot more to lose when the thread count is reduced, because the availability of execution resources is also reduced. SMT does not suffer from any loss in available execution resources when the thread count is reduced. From my perspective, SMT scaling is actually better than CMT scaling, because it provides a higher availability of execution throughput to a wider variety of workloads (highly threaded or not). CMT suffers far greater penalties to poorly threaded workloads. In most cases where CMT scaling shows large scaling, and SMT scalin shows minimal scaling, it is because SMT has already achieved decoder or execution resource saturation without the need of another thread to get there. We "observe" scaling from CMT in these conditions because we are effectively "activating" unused resources that SMT was already leveraging before the second thread was thrown into the mix. In cases where workloads do not scale up on SMT, it is because the ideal saturation of execution or decoder or scheduling resources has already occurred. It's not possible for a single thread to saturate the available execution, decoder, or scheduler resources of a pile-driver module, it is effectively "throttled" by the fact that half of its execution resources are forced to lay dormant until another thread is spun up on the module.
Trying to characterize the premise of CMT "scaling" as an upward benefit is not much different than trying to characterize cylinder deactivation as having horse power advantages. It doesn't. The problem is that so many of these concepts are very cerebral and simultaneously disconnected (we can't get our hands on them or visualize them as part of a working mechanical system), so it's very easy to get lost down a rabbit hole of broken relativity and useless perspectives.
The best way to "see" why I prefer the perspective of CMT as a penalty, and not an upward scaling benefit, is to study the execution throughput chart. When comparing an equal number of SMT cores vs CMT modules, the SMT solution offers the highest availability of execution throughput to the widest variety of workloads, highly threaded or not.
5. You're an Intel Fanboy, obviously! Quit bashing AMD you hater!
The conclusions that one might draw from reading this post, and studying the chart, are that I might be here to "prove" that Intel is better than AMD. The chart certainly shows that at this time, Intel is offering a lot of CPUs at competitive prices that offer higher execution performance to the "sweet spot" for most mainstream desktop builders at this time.
Consider the inverse, if I "knew" this "stuff" about CPUs, and chose NOT to share it, then by any standard that makes me an Intel fanboy for sharing it, I would be an AMD fanboy for choosing to sweep it under the rug and keep it to myself. The chart, and my "mythbusters" would be the same whether under a rug or here on the forum.
My concern, is CPU vs CPU, performance, features, cost. There's more to a CPU than the chart in this thread. Don't forget about support for specialized instructions, or platform capabilities like IOMMU. At this time, Intel solutions are better for 80-90% of desktop builds due to the intended use of those builds and the performance characteristics of the Intel solutions available to fit the budget of those 80-90% of builds. This doesn't mean that Intel is "better," It means that Intel CPUs are better for most applications for the money right now. Intel could have run over my dog and I would still have to admit that their CPUs are the better solution for most builds right now.
Enjoy and happy tuning/building!
Eric
Attachments
-
102 KB Views: 149
-
9.7 KB Views: 117