Overclock.net banner

1 - 20 of 41 Posts

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #1 ·
Hello OCN.

The following "cheat sheet" can be used to quickly approximate the performance difference between popular desktop CPUs depending on the number of threads the workload presents as.

Execution Throughput is synonymous with Instruction Rate, or IPS (Instruction per Second), or Compute Performance in general. Software executes as a series of instructions/computations. The faster those instructions can be executed, the better the software performs. Not all software produces workloads that scale into an unlimited number of cores, as such, comparing CPU performance can be difficult when most benchmarks either test 1 core or all of the cores of a CPU, but nothing in-between. This "chart" is an attempt to bridge that gap and provide a quick-n-dirty reference to compare CPU performance with the primary variable being the number of threads the computing workload presents as.

The relative performance is a rough average derived from my own testing of some CPUs (K10, K15, Ivy/Haswell) and critical analysis of many 3rd party sources (passmark, openbenchmarking, and CPU reviews). Considerations of actual CPU architecture are also taken into account: ( http://www.agner.org/optimize/microarchitecture.pdf ). The chart does not compare a specific workload, rather, is an approximate average of many possible workloads with the key variable being the number of threads the workload presents. The "scores" in the chart are only relevant and relative to other scores within this same chart. In some cases I have indeed made good faith estimates where I had access to little or no relative data to work with. The chart is not derived of raw synthetics, rather, execution performance in actual software workloads like compiling, compressing, decompressing, trans-coding/encoding, physics, rendering, science/math, CAD, image manipulation, gaming etc. YMMV. Passmark and OpenBenchmarking have been heavily influential here to produce baselines upon which to build the chart from.

The chart has been color coded as follows:

The green region is centered over the most relevant region for consideration in gaming and other real-time workloads, most CAD/engineering software (like Sketchup, AutoCAD, Solidworks), popular photo manipulation software (gimp/A-shop).

The light blue, grey and teal region is centered over the most relevant region for consideration in non-real-time workloads with minimal scaling penalties; 3D rendering, trans-coding, video toasting, compiling, compressing, decompressing, encryption, scientific research, distributed computing projects, mathematical research, etc.

The dark blue row, is a theoretical estimate of relative performance in an enterprise class workload encompassing very high numbers of active threads. This can give some indication of how well the CPU and associated platform will tolerate insane-multi-tasking and/or server class workloads (lots of active VMs, or high traffic hosting) and assumes gobs of system memory to accommodate this theoretical workload. This particular row of entries needs to be considered with lots of room for error as this would vary dramatically depending on the specifics of that brutal workload.

The execution throughput of the 3.0 GHZ Pentium G3220 w/1 thread = 10. This is used as the baseline standard to generate the chart. SMT/CMT/Turbo scaling is included where applicable and the chart assumes proper scheduling of the host OS to scale into SMT/CMT properly.

The chart can be viewed in post #3 of this thread below, or by opening this gif image in another tab/window, or by downloading and viewing the attached pdf or spreadsheet.

Created with GIMP

AproximateRelativeExecutionThroughput.pdf 102k .pdf file


AproximateRelativeExecutionThroughput.xlsx 10k .xlsx file


Myths:

1. "New games are optimized for 8 core CPUs."

Software that has been written and compiled to scale into many threads is not "optimized" for many cores, it is simply able to scale into many cores. There is no significant penalty to execution throughput when running several threads on a core. The only significant important characteristic of a CPU's performance is the availability of execution throughput to the workload. Whether that workload is 2 threads or 20 threads running on a 1 core or 10 core CPU, the resulting performance will be based on the availability of execution throughput to that workload. Being able to use more cores, and being "optimized" for more cores, are two very different things. The latter is based in a fallacy.

The advantage to many-core CPUs, comes into play with the number of active threads is far beyond what any one or several pieces of desktop software would ever benefit from spawning. Any attempt to create a workload that is pushing more and more into that "blue" area of the chart, is only counterproductive to the performance of the software. No game will ever be engineered to spawn 200-300+ active threads. It would be an enormous waste of execution overhead. Having strong CPU performance in that "dark blue" part of the chart, is great for workloads that present themselves that way because there is no alternative (enterprise conditions.)

2. "Consoles have 8 cores now, therefor I should build an 8 core gaming compute to be future proof."

The 8 cores in a console would have a combined score of about "22" in the chart above. (on the "8 threaded workload line). Scheduling optimizations for console hardware will be stripped from software when it is ported. The only remaining consideration will be whether or not the CPU in the desktop has enough combined execution throughput for the game engine after scheduling optimizations are removed and overhead from the loss of HSA is factored in. Whether that is provided via 2 cores or 10 cores is going to be largely irrelevant. As you can see, most desktop CPUs provide more execution throughput than the "8core" found in these consoles. The CPU on the consoles is not technically an 8-core CPU, it's a dual-quad CPU implementation, much like an old core2quad where they took two dual cores and strapped them to the same PCB. This creates performance scaling limitations that require scheduling optimizations to work around. A Jaguar module also suffers MP scaling penalties because it shares an L2 cache across all 4 cores of the module. Best performance for a given thread occurs when the thread is allowed to run solo on the module. Due to these scaling limitations of the platform, it is highly unlikely that console games are actually leveraging all 8 cores in a heavily saturated manner as attempting to do so would come with too many penalties and overheads to be worthwhile. It's actually more likely that they are using scheduler optimizations to throw the toughest 1-2 threads on a jaguar module by themselves, and using the other module for the remainder of the less demanding jobs. Actual saturation probably rarely exceeds 80% of available execution throughput.

By my best estimates, if you build a desktop with at least double the single threaded performance (score of "8+" on the 1 thread workload line), and 50% more combined execution performance (score of 33+ on or below the 8 thread workload line) than the console, it will play console ports just fine. Modern i3's and A8/A10 or better CPUs will prove effective for most console ports. The reason for the requirement of at least "double" the single threaded performance is to makeup for scheduler optimization losses and overhead introduced from the removal of the HSA environment.

3. "Games don't use Hyper-threading / Hyperthreading sucks"

Hyper-threading is very misunderstood. Nearly any workload that does not bottleneck a single thread on the instruction decoders or execution resources of a core can benefit from hyper-threading if it scales into multiple threads, games included. The reason we do not observe performance scaling from hyper-threading in many desktop workloads (especially games) is simply because the workload is not simultaneously parallel and demanding enough to scale into the additional available execution throughput. Most observations are made between the i5 and i7, where the i5 is already providing as much execution throughput via parallelism as the software can leverage anyway. Remember, a real-time workload like a game-engine is always technically being "throttled" by a timeline of events that must adhere to real-time. Even though gaming isn't considered by many to be a "true" real-time workload (like running a feedback loop on a precision machine or robot), it still must adhere to all the same principals and a timeline and order of operations and events. This makes "useful scaling" very difficult beyond a certain number of threads.

Observe the difference in performance between the i3 and Pentium in gaming workloads. HT scaling is very effective in gaming workloads when the availability of execution resources is cut in HALF (2 cores instead of 4 cores on the i5/i7). In fact, almost ALL games scale nicely into hyper-threading when the availability of execution resources forces the issue.

4. CMT scaling is better than SMT scaling.

The available execution resources (pipelines) on any one cycle on a haswell core, is basically the same as the available execution resources on any one cycle on a piledriver module. The number of "pipelines" that can have work scheduled on them simultaneously is effectively the same. (the IPC differences stem primarily from cache latency, cache bandwidth, different instruction penalties, average pipeline length, decoder/scheduling performance, instruction/vector size support, and instruction latency)
The entirety of the execution resources in a haswell core are available to a single thread, while only half of those execution resources are available to a single thread on a piledriver module on any given cycle. When ONLY the scaling is observed as thread count increases or decreases, scaling appears better on CMT. This perspective requires a purposeful ignorance of the starting point of execution throughput to have merit. A far more useful way to observe SMT vs CMT scaling, is how much performance is LOST when dropping from 2 to 1 threads on a core or module, as this is actually a far more relevant consideration for the actual performance of real desktop workloads. CMT has a lot more to lose when the thread count is reduced, because the availability of execution resources is also reduced. SMT does not suffer from any loss in available execution resources when the thread count is reduced. From my perspective, SMT scaling is actually better than CMT scaling, because it provides a higher availability of execution throughput to a wider variety of workloads (highly threaded or not). CMT suffers far greater penalties to poorly threaded workloads. In most cases where CMT scaling shows large scaling, and SMT scalin shows minimal scaling, it is because SMT has already achieved decoder or execution resource saturation without the need of another thread to get there. We "observe" scaling from CMT in these conditions because we are effectively "activating" unused resources that SMT was already leveraging before the second thread was thrown into the mix. In cases where workloads do not scale up on SMT, it is because the ideal saturation of execution or decoder or scheduling resources has already occurred. It's not possible for a single thread to saturate the available execution, decoder, or scheduler resources of a pile-driver module, it is effectively "throttled" by the fact that half of its execution resources are forced to lay dormant until another thread is spun up on the module.

Trying to characterize the premise of CMT "scaling" as an upward benefit is not much different than trying to characterize cylinder deactivation as having horse power advantages. It doesn't. The problem is that so many of these concepts are very cerebral and simultaneously disconnected (we can't get our hands on them or visualize them as part of a working mechanical system), so it's very easy to get lost down a rabbit hole of broken relativity and useless perspectives.

The best way to "see" why I prefer the perspective of CMT as a penalty, and not an upward scaling benefit, is to study the execution throughput chart. When comparing an equal number of SMT cores vs CMT modules, the SMT solution offers the highest availability of execution throughput to the widest variety of workloads, highly threaded or not.

5. You're an Intel Fanboy, obviously! Quit bashing AMD you hater!

The conclusions that one might draw from reading this post, and studying the chart, are that I might be here to "prove" that Intel is better than AMD. The chart certainly shows that at this time, Intel is offering a lot of CPUs at competitive prices that offer higher execution performance to the "sweet spot" for most mainstream desktop builders at this time.

Consider the inverse, if I "knew" this "stuff" about CPUs, and chose NOT to share it, then by any standard that makes me an Intel fanboy for sharing it, I would be an AMD fanboy for choosing to sweep it under the rug and keep it to myself. The chart, and my "mythbusters" would be the same whether under a rug or here on the forum.

My concern, is CPU vs CPU, performance, features, cost. There's more to a CPU than the chart in this thread. Don't forget about support for specialized instructions, or platform capabilities like IOMMU. At this time, Intel solutions are better for 80-90% of desktop builds due to the intended use of those builds and the performance characteristics of the Intel solutions available to fit the budget of those 80-90% of builds. This doesn't mean that Intel is "better," It means that Intel CPUs are better for most applications for the money right now. Intel could have run over my dog and I would still have to admit that their CPUs are the better solution for most builds right now.

Enjoy and happy tuning/building!
Eric
 

Attachments

·
Overclock Failed...
Joined
·
13,565 Posts
Wow. Great work.

Would you post the spreadsheet.?
I like to look at stuff graphically.
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #3 ·
CPU>G3220G3258i3-4150i5-4440Si5-4590i5-4670Ki5-4690KE3-1271V3i7-4770Ki7-4790K
Workload (threads)3.0GHZ4.8GHZ3.5GHZ2.8/3.3GHZ3.3/3.7GHZ4.4GHZ4.8GHZ3.6/4.0GHZ4.4GHZ4.8GHZ
25610161928334448526369
1616262636435763607480
1217272737435863607481
818292837435863617481
719302837445863587077
619302937445863556773
519312937445864526369
420312937445965486065
320322629354448384549
220322321243032263032
110161211121516141516






















CPU>G2020i3-3220i5-3470i5-3570Ki7-3770i7-3770KE5-1650V2i7-4930KE5-1650i7-3930K
Workload (threads)2.9GHZ3.3GHZ3.2/3.6GHZ4.4GHZ3.4/3.9GHZ4.4GHZ3.5/3.9GHZ4.8GHZ3.2/3.8GHZ4.8GHZ
25691629394457751036598
161422375251667910969103
121523385251677910969104
816233852516769946090
716243852496366915886
617243853466064875783
517243853445755734969
417253853415345584055
317223040344034443141
217202126232724292128
19101113121312151114






















CPU>G640I3-2100i5-24002500K 4.8Gi7-2600i7-2700Ki7-970970 4.0Gi5-750C2Q 8400
Workload (threads)2.8GHZ3.1GHZ3.1/3.4GHZ4.8GHZ3.4/3.8GHZ4.8GHZ3.2/3.46GHZ4.0GHZ2.66/3.2GHZ2.66GHZ
2568152741425949611713
1613203554486854672219
1214213554496955672320
815213555496947582321
715223555466645562321
615223655446243542322
515223655415936452322
416223655395530362322
316202841304123271818
216181928212816181313
189101411148977






















CPU>A10-7700KA10-7850KSempron 3850Athlon 5350A4-6300A6-6400KA8-6500A10-6800KFX-4300FX-4350
Workload (threads)3.4/3.8GHZ4.4GHZ1.3GHZ2.05GHZ3.7/3.9GHZ4.8GHZ3.5/4.1GHZ4.8GHZ3.8/4.0GHZ4.8GHZ
2562228343415212126
1629387981021292532
12303981191122302532
83039913101223312532
730391013101323322632
631401014101323322633
531401014111423322633
431401014111423322633
32430811111418252026
2172068111515181518
191035798989






















CPU>FX-6300FX-6350FX-8320FX-8320FX-8350FX-8350FX-9590FX-9590FX-4100FX-4170
Workload (threads)3.5/4.1GHZ4.8GHZ3.5/4.0GHZG4.4GHZ4.0/4.2GHZ4.8GHZ4.7/5.0GHZ5.2GHZ3.6/3.8GHZ4.8GHZ
25630424354495958641723
1635484759546564712128
1235494759556564712128
836494860556664712229
736494253495857632229
636493747435150552229
531423240374443482229
425342734313636402229
320272225252727301723
215181517161819201316
1898889101068






















CPU>FX-6100FX-6200FX-8120FX-8150P II X4 955P II X4 970P II 1055P II 1090TA II X4 640A10-3870K
Workload (threads)3.3/3.9GHZ4.4GHZ3.1/4.0GHZ4.8GZ3.2GHZ4.0GHZ2.8/3.3GHZ4.0GHZ3.0GHZ3.6GHZ
25626343453202429411519
1630393858263234482125
1230393858263234482226
830403858263234482227
730403452263234492327
630403145263234492328
526342839263228402328
421282532263223322428
318221924202418241821
213151316131613161214
17778787867
 

·
Overclock Failed...
Joined
·
13,565 Posts
No, I have the graphic, but I'm just to lazy to go through One Note > Word > Excel to pull the numbers out of it.
Please post the Excel file as an attachment, if you would.

Thanks.

 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #5 ·
Excel, right
wink.gif


See link in fist post for updated spreadsheet.

I understand now what you mean "graphically."

You mean you want to CHART/GRAPH it. I was like, waaaa? How is a PNG not graphical?

I hadn't even thought of that. Good idea.

Well now it's on here as a PNG, an HTML spreadsheet, AND an xlsx. Sweet.
 

·
Premium Member
Joined
·
65,162 Posts
Quote:
SMT/CMT/Turbo scaling is included where applicable and the chart assumes proper scheduling of the host OS to scale into SMT/CMT properly.
What's the workload you are using? This has major impact on the benefits of SMT/CMT.
 

·
Overclock Failed...
Joined
·
13,565 Posts
Quote:
Originally Posted by mdocod View Post

Excel, right
wink.gif


Well now it's on here as a PNG, an HTML spreadsheet, AND an xlsx. Sweet.
U Da Man!
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #8 ·
Hi DuckieHo,

The relative performance is determined using a constant assumed ideal scaling for SMT/CMT . The chart is built using +25% and +80% scaling for SMT and CMT respectively when going from 1 thread per core/module to 2 threads per core/module. My intention was to give each technology a decent representation. There will be cases where the constant used to build this chart will be far from accurate if specific workloads are considered.

Regards,
Eric
 

·
Premium Member
Joined
·
65,162 Posts
Quote:
Originally Posted by mdocod View Post

Hi DuckieHo,

The relative performance is determined using a constant assumed ideal scaling for SMT/CMT . The chart is built using +30% and +80% scaling for SMT and CMT respectively when going from 1 thread per core/module to 2 threads per core/module. My intention was to give each technology a decent representation. There will be cases where the constant used to build this chart will be far from accurate if specific workloads are considered.

Regards,
Eric
So what is the workload? Does it stress FP32, FP64, INT, L1/L2 cache, ect?

You are using an arbitrary scaling factor for SMT and CMT? It would be vastly more useful to provide the raw values and note the potential of scaling.
 

·
Premium Member
Joined
·
4,604 Posts
Quote:
Originally Posted by mdocod View Post

Software that has been written and compiled to scale into many threads is not "optimized" for many cores, it is simply able to scale into many cores. There is no significant penalty to execution throughput when running several threads on a core. The only significant important characteristic of a CPU's performance is the availability of execution throughput to the workload. Whether that workload is 2 threads or 20 threads running on a 1 core or 10 core CPU, the resulting performance will be based on the availability of execution throughput to that workload.
You 're probably right, but i think the nature of software is relevant also and influences the result. I am a bit familiary with x264 for instance. I had done in the past my own experiment with thread number and it's known that it has impact on performance.

Examples:

http://forum.doom9.org/showthread.php?t=146667

http://forum.doom9.org/showthread.php?t=166729

Even 0.5 fps difference, is considerable in a time consuming process like HD video encoding. I am too lazy to repeat it again, but i had seen differences myself in 1090T using this, which is obsolete now, but it is the easiest GUI enabled software to change manually thread number.

http://sourceforge.net/projects/asxgui/

See also here how FX performs better on some multithreaded software than on others:

http://www.anandtech.com/bench/product/287?vs=698

(x264 is widely accepted as one of the real world (non bench) software that can scale as perfectly as possible with cores and has no favourable coding bias in favour of AMD. The dev himself, uses Intel). At some point, i remember clearly, that x264 had maximum thread number 128, but the "auto" setting never uses anything near that.

Observe also how Dragon Age Origins (that on my FX6300 hits 100% load on all 6 cores), shows the FX lagging behind the Intel, while the FX beats the Intel in 7zip and x264. Scaling from scaling is different, according to the nature of software and dev. Seems game developers, don't want to sweat much in order to scale well their games... And Dragon Age is actually an example of laudable effort. Most games simply don't care to do a good job in programming, so they scale poorly. What AMD fans have been complaining about, is exactly the fact that games don't seem coded to scale well (as x264 or 7zip is), with the result, that even in most games AMD performs much worse than Intel.

I think that in general, you are right. However, there are certain limitations inherent to each software that may cause different behaviour.

We are probably say the same thing,but my english isn't of high enough level to follow your wording properly.
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #11 ·
Hi DuckieHo,

There isn't a specific workload. The constants used to produce the chart are average-ideals observed from the study of many workloads/benches etc. Think "cheat sheet." The chart answers a "different" question than the one you are proposing. We could build a chart like it for every conceivable workload, but then it's not a useful fast-reference for articulating a point any longer. The Point I am intending to articulate and illustrate here is that CPU performance often can not always be compared based on an "ideal" where all available threads/cores are saturated. Real-world software doesn't work like that. We see benchmarks for single threaded workloads, unlimited thread workloads, but rarely does anyone take the time to show how that pans out "in-between." Just because a workload can scale to 2 threads, doesn't always mean it scales to 4, or 8, or 12 just as well. The chart above allows one to get a rough idea of the relative performance of a CPU based on the number of threads, not based on what work those threads are doing. The chart does is not intended to be the answer to "all" questions, it is intended to be an answer to one question.

Example scenario:

FX-9590 scores about the same in passmark as the i7-4790, thus, it's just as good. Or is it?

In actual fact, the i7 offers more execution throughput to a wider assortment of real-world workloads than the FX-9590. This can be difficult to articulate. The chart above can be used to do this.

Hi undervolter,

I believe we are saying the same thing. The purpose of the chart is to help illustrate that in the real world, since lots of real software doesn't scale to many threads, especially things like game engines, comparing CPUs based on their performance on the "blue/grey" part of the chart (like comparing raw benchmarks of the total combined IPS of all cores/threads on a CPU) can be highly misleading if the comparison is not well understood. The green region on the chart highlights the reason we don't see i7's outperforming i5's in most games. Often, the fact that the i7 shows no advantage to the i5 in games, is mistakenly used to conclude that hyper-threading doesn't work for gaming workloads when in actual fact, the reason the i7 doesn't produce better results than the i5 is because there isn't enough work on enough threads for any scaling to be observed. Meanwhile, the i3 almost consistently out-performs Pentiums in many of the same games that show no scaling from the i5 to i7. The i3 derives almost ALL of its scaling from hyper-threading in these conditions.
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #13 ·
Thanks Internet Swag,

So many myths, which to choose
wink.gif
CPU/GPU relationships? CPU core design? Hyper-threading?, TDP ratings? Bottlenecks? These seem to be popular areas of confusion. They could all potentially find some relevance here. Perhaps I'll work on some more "myth busting" as it pertains to the goals of this thread. I can get carried away (long winded) and would like to attempt to maintain a sort of simplicity if possible though.
 

·
Registered
Joined
·
254 Posts
This is probably a odd question, but how important is ghz for CPU speed and does overclocking provide a great speed increase? Or is it more about the smaller die sizes? How small can die sizes get? I don't even know why things like 22nm are important.

I should google some of this too heh.
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #15 ·
Execution throughput is directly proportional to clock speed, but not all CPU architectures achieve the same number of instructions per cycle. Overclocking the CPU speeds up CPU bound tasks proportionally to the overclock.

Die size has actually remained relatively similar over the years, but the fabrication process of the die has changed to pack more and more transistors. A smaller process (22nm vs, for example, 32nm) can pack more transistors into a given volume of silicon.
 

·
Registered
Joined
·
3,326 Posts
Quote:
Originally Posted by EpIcSnIpErZ23 View Post

This is great dude. Now i can actually have some reputable proof when i say the 4670k is better than the 8350 for gaming
thumb.gif
Yes I agree. mdocod's proof is irrefutable because he actually researches. Runs tests etc... He is no fan boy spouting off.
 

·
Premium Member
Joined
·
4,265 Posts
Discussion Starter · #20 ·
Quote:
Originally Posted by Darklyric View Post

Yes yes but what hit do your fps take while 7zipping.... lol amd wins!

Jp...
Software prioritization actually winds up defeating this theoretical advantage. If a game runs better on the i5 without the background task running, it still winds up running better with the background task running. The FX chip winds running the background task faster, but not the game.
 
1 - 20 of 41 Posts
Top