(Review) Explanation of multithread performance with benchmarks: Impact of windows scheduler and programmed CPU core allocation - Overclock.net - An Overclocking Community

Forum Jump: 

(Review) Explanation of multithread performance with benchmarks: Impact of windows scheduler and programmed CPU core allocation

Thread Tools
post #1 of 3 (permalink) Old 07-15-2019, 03:52 PM - Thread Starter
New to Overclock.net
PhillyB's Avatar
Join Date: Sep 2014
Location: Kincheloe, MI
Posts: 485
Rep: 26 (Unique: 25)
(Review) Explanation of multithread performance with benchmarks: Impact of windows scheduler and programmed CPU core allocation

(note sure the best place to put this...so general processors seemed a good match)

I have been heavily bothered by online reviewers lack of modern CPUs function and the result has been CPUs targeted as the weak points when the benchmarks and OS used is the problem. So, I thought I would conduct an experiment with C++ and programmically locked threads.

Windows scheduling and poor allocation of threads by programmers in benchmarks impact results beyond the scope understood by enthusiasts.
Marketers utilize consumer misunderstanding of tech to profit from selling CPUs beyond the needs of the user. (Not accidentally reviewed, but it is a bias which may impact my results)

There is a general lack of understanding by online reviewers in areas of CPU threads/cores/hyperthread cores/impact by Windows thread scheduling. For some reason they expect significant increase based on logical core count on all tasks despite the task usage of the individual core. I have not seen a single review which takes programming into account. This is often bias towards one GPU manufacturer over another (like in hairworks, nvidea only tech, etc), but has impacted the CPU market at times. This should help clarify their lack of insight. At best, they review a variety of seemingly random programs, and then try to explain (poorly) why that particular one was significant enough to warrant selection.

C++ with manual thread generation and core allocation with win32.dll
Simple array storage of 5000 integer values (compiled in 64bit: total size = 320KB per core) with localized storage within the thread function. This limits interaction between threads and keeps storage local for processing. Size selected to remove storage from RAM and keep with CPU cache.
Two random locations are selected from array on each cycle and multiplied, then stored into a random location in the array. This is repeated 4 times per cycle.
Cycles are stored local to thread and sent to a secondary reporting thread at 250ms intervals.
Total counts are averaged on each 250ms interval to minimize outlying cpu cores.
CPU: 6700k set at static 4.5ghz to remove any turbo core effects
3 second process time.
The program was compiled with Visual Studio 2017 C++ default compiler in 64bit mode. It may contain optimization and performance enhancements which modify my programs results.
Note: This program and its source code will be made available to anyone who asks.
(My experience: I have 2 Master’s degrees in computer science, teach computer science at the undergraduate level and am CEO of a small game studio. I started multithread programming since 2010 to include dual CPU systems. I am biased in favor of AMD, but do my best to limit this in my discussion.)

4 thread with each locked to a physical core (0,2,4,6)
4 thread with each locked to sequencial logical cores (4,5,6,7)
4 thread with each locked to a physical core with its hyperthread partner(0/1, 2/3, 4/5, 6/7)
4 thread with each locked to 2 logical cores that are non-physical sets (0/4, 1/5, 2/6, 3/7)
4 thread with no thread lock allowing windows 10 1903 to assign logical cores with default scheduler
Future test: AMD 3900x and 2400g (not collocated with these systems for testing at this time)

Mcps= million cycles per second
Windows Scheduled: note inconsistent use of a 8 logical thread and lower than 100% usage on each. Inconsistent 250ms results between 27mips and 40mcps: average 35mcps (attachment: CPUThread_Unlocked)

Locked cores: 0, 2, 4, 6: Note 100% usage of specified cores: consistent values are evident in iterations 2 through 10 all near average of 36.5 mcps (attachment: CPUThread_Locked_1perPC)

Locked cores with hyperthread counterpart: increase in processing average to 37.7mcps with similar consistency in iterations 2 through 10. (second cpu spike was this test process. Note less than 100% usage) (attachment: CPUThread_Locked_2perPC)

Locked cores to non-physically linked logical cores: Return to destabilized iteration values: range from 27.9 through 39.8mcps and an average of 34.7mcps. Not similar behavior in cpu usage as windows scheduler and similar drop in performance. (attachment: CPUThread_Locked_1per2PC)

Locked cores to sequential logical cores: this CPU uses hyper threading, which means only two cores will be used. Note the random spike on the 3rd core is from the compile process. Only the bottom rows were used for the processing. The processing has been reduced to 22.9 mcps as expected as logical cores are not physical cores. This means each physical core is getting about 45mcps instead of the 37.7 results we saw in single thread per physical core. (attachment: CPUThread_Locked_sequencial)

What does this actually mean: Same benchmark, same thread count, allocated to cores in different ways.
1: Windows 10 thread scheduling is terrible.
Letting Windows move threads anywhere it thought useful, barely out performed locking threads to two different physical cores. Now remember, locking a thread to 2 logical/physical cores just opens up the ability for the windows scheduler to move threads between the allocated logical cores. So, even in the situation where it can chose between two physically different cores, it will do so even at the detriment of performance. Cache movement alone is enough to justify keeping a heavily used process on a single core.

2: Hyperthread cores are not equal in performance to a normal core
Other benchmarks show an increase of processing when multithreading is involved. Games also do. But, high performance code sees a reduced improvement. Why? Simply because a processing core divides its available time into small chunks and the OS allocates each chunk to a thread (a process). Generally, this is a static time allotment and its up to the program to use it to its fullest extent. Almost all code does not. Some of that time is unused because of things like memory calls or waiting for a user to interact with a process. There is a lot of wasted time.

Hyperthreading in a simplified overview simply takes a second process and tries to fill the empty time on the cpu physical core. 2 logical processing cores + 1 cpu = 1 process at a time. This is why its impossible to see more than 50% improvement (math…). A perfectly created program that uses all its time will actually prevent other processes from filling the extra time. This equate to 0% speedup. This program had only a 7.1 mcps improvement (about 18.8%) This isn’t even a heavily optimized process and there is waste in it. It would not be unreasonable to reduce that improvement percentage with a few hours of work with improvements to the processing loop.

So…double the logical cores does not equal double the performance. Nor does it equate to 50% improvement. Online reviewers simply need to understand this in their discussion. The program itself will determine if speedup is possible, not the CPU. Any reviewer that claims otherwise or seems shocked that hyperthreads are not a huge boon in specific software simply does not understand the technology they are reviewing. (This includes some pretty popular names in tech reviews, not just the youtube ones either)

3: Threads moving between cores has a performance tax due to processing overhead to move the thread and also move the memory stored in cache from one core to another.
Moving a thread is a program in the OS. Every program creates at least one process in the CPU. This equates to a new task being allocated a time slot which is likely equal to the time slot of the benchmark. The other problem is any memory allocated to that thread must now move to the L1 and L2 cache allocated for that physical CPU. A CPU core only has its connected cache, it cannot grab data from another cores cache. It has to go up one layer (usually L3) get the info from there, the L3 has to check if its version of the data is correct, correct it if it is not, then pass it to the correct cores cache. This must occur every time the thread moves.

This is why there was so much change in the performance of the benchmark when thread is decoupled from a specific thread.

So, why was the performance improved by letting the scheduler move between logical cores on a single physical core? Simple, 1 cache path and time slots were being more effectively used on the core. Likely this is because windows was prevented from placing background tasks onto that physical core because the two logical cores were too busy.

Bottom line version:
Modern CPUs are scheduled by the OS and impacted by the programming of the task. Making a task that fully utilizes a core to ‘100%’ doesn’t actually use it 100%. I can peg a cpu to 100% by just telling it to sleep the thread for 1ms and then repeat. It’s doing nothing for 1ms, but still showing 100% usage.

Challenge the online reviewers to actually discuss how the CPU is performing instead of its allocation by the OS and programmer. I also challenge benchmark makers to open source their code so we as consumers can review it for bias and flaws which favor one architecture over another. This may be unintentional, but many like games are well…questionable at best.

Please note that I did not even get into specialized instruction sets that are available on only 1 cpu brand nor those that have improved performance on 1 while the other has to emulate it.
Attached Thumbnails
Click image for larger version

Name:	CPUThread_Locked_sequencial.png
Views:	3
Size:	215.8 KB
ID:	280622  

Click image for larger version

Name:	CPUThread_Locked_2perPC.png
Views:	3
Size:	257.0 KB
ID:	280624  

Click image for larger version

Name:	CPUThread_Locked_1perPC.png
Views:	3
Size:	263.9 KB
ID:	280626  

Click image for larger version

Name:	CPUThread_Locked_1per2PC.png
Views:	3
Size:	291.0 KB
ID:	280628  

Click image for larger version

Name:	CPUThread_Unlocked.png
Views:	4
Size:	347.5 KB
ID:	280630  

PhillyB is offline  
Sponsored Links
post #2 of 3 (permalink) Old 07-15-2019, 03:59 PM
⤷ αC
AlphaC's Avatar
Join Date: Sep 2012
Posts: 11,175
Rep: 904 (Unique: 590)
Yes Windows scheduling is terrible, yes hyperthread/SMT only gives 30-40% of a real core's performance, yes core scaling is poor when parallelization is low (amdahl's law). That's why Linux is a better OS for Ryzen for non-gaming tasks.

Ryzen 3rd gen is double wide AVX2 (unlike Zen+) so the only place I can see there being a major deficit is the seldom used AVX-512.

For power users generally core affinity is used to lock processes to a single CCX in Ryzen.

P.S. you might find agner's coverage on Ryzen applicable to your work: https://www.agner.org/optimize/blog/read.php?i=838

► Recommended GPU Projects: [email protected] , [email protected] (FP64) (AMD moreso) ► Other notable GPU projects: [email protected] (Nvidia), GPUGrid (Nvidia) ► Project list

AlphaC is offline  
post #3 of 3 (permalink) Old 07-15-2019, 05:15 PM - Thread Starter
New to Overclock.net
PhillyB's Avatar
Join Date: Sep 2014
Location: Kincheloe, MI
Posts: 485
Rep: 26 (Unique: 25)
Quote: Originally Posted by AlphaC View Post
P.S. you might find agner's coverage on Ryzen applicable to your work: https://www.agner.org/optimize/blog/read.php?i=838
Thanks. I will take a look tomorrow.

My goal at this point is to determine the impact of chiplets and the new architecture, but needed a starting point. Just getting tired of all these crap reviews and guesses at the new tech by people claiming to be experts yet know little on the items that make up a complete system. Too many just point their finger at one part and blame/reward it blindly instead of examining why. "This one test I know little about gives me this number so therefor it must be correct and apply to all cases." Too many 'reviewers' profiting off of this type of information.
PhillyB is offline  

Quick Reply

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Please enter a password for your user account. Note that passwords are case-sensitive.
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page

Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off