(Review) Explanation of multithread performance with benchmarks: Impact of windows scheduler and programmed CPU core allocation
(note sure the best place to put this...so general processors seemed a good match)
I have been heavily bothered by online reviewers lack of modern CPUs function and the result has been CPUs targeted as the weak points when the benchmarks and OS used is the problem. So, I thought I would conduct an experiment with C++ and programmically locked threads.
Windows scheduling and poor allocation of threads by programmers in benchmarks impact results beyond the scope understood by enthusiasts.
Marketers utilize consumer misunderstanding of tech to profit from selling CPUs beyond the needs of the user. (Not accidentally reviewed, but it is a bias which may impact my results)
There is a general lack of understanding by online reviewers in areas of CPU threads/cores/hyperthread cores/impact by Windows thread scheduling. For some reason they expect significant increase based on logical core count on all tasks despite the task usage of the individual core. I have not seen a single review which takes programming into account. This is often bias towards one GPU manufacturer over another (like in hairworks, nvidea only tech, etc), but has impacted the CPU market at times. This should help clarify their lack of insight. At best, they review a variety of seemingly random programs, and then try to explain (poorly) why that particular one was significant enough to warrant selection.
C++ with manual thread generation and core allocation with win32.dll
Simple array storage of 5000 integer values (compiled in 64bit: total size = 320KB per core) with localized storage within the thread function. This limits interaction between threads and keeps storage local for processing. Size selected to remove storage from RAM and keep with CPU cache.
Two random locations are selected from array on each cycle and multiplied, then stored into a random location in the array. This is repeated 4 times per cycle.
Cycles are stored local to thread and sent to a secondary reporting thread at 250ms intervals.
Total counts are averaged on each 250ms interval to minimize outlying cpu cores.
CPU: 6700k set at static 4.5ghz to remove any turbo core effects
3 second process time.
The program was compiled with Visual Studio 2017 C++ default compiler in 64bit mode. It may contain optimization and performance enhancements which modify my programs results.
Note: This program and its source code will be made available to anyone who asks.
(My experience: I have 2 Master’s degrees in computer science, teach computer science at the undergraduate level and am CEO of a small game studio. I started multithread programming since 2010 to include dual CPU systems. I am biased in favor of AMD, but do my best to limit this in my discussion.)
4 thread with each locked to a physical core (0,2,4,6)
4 thread with each locked to sequencial logical cores (4,5,6,7)
4 thread with each locked to a physical core with its hyperthread partner(0/1, 2/3, 4/5, 6/7)
4 thread with each locked to 2 logical cores that are non-physical sets (0/4, 1/5, 2/6, 3/7)
4 thread with no thread lock allowing windows 10 1903 to assign logical cores with default scheduler
Future test: AMD 3900x and 2400g (not collocated with these systems for testing at this time)
Mcps= million cycles per second
Windows Scheduled: note inconsistent use of a 8 logical thread and lower than 100% usage on each. Inconsistent 250ms results between 27mips and 40mcps: average 35mcps (attachment: CPUThread_Unlocked)
Locked cores: 0, 2, 4, 6: Note 100% usage of specified cores: consistent values are evident in iterations 2 through 10 all near average of 36.5 mcps (attachment: CPUThread_Locked_1perPC)
Locked cores with hyperthread counterpart: increase in processing average to 37.7mcps with similar consistency in iterations 2 through 10. (second cpu spike was this test process. Note less than 100% usage) (attachment: CPUThread_Locked_2perPC)
Locked cores to non-physically linked logical cores: Return to destabilized iteration values: range from 27.9 through 39.8mcps and an average of 34.7mcps. Not similar behavior in cpu usage as windows scheduler and similar drop in performance. (attachment: CPUThread_Locked_1per2PC)
Locked cores to sequential logical cores: this CPU uses hyper threading, which means only two cores will be used. Note the random spike on the 3rd core is from the compile process. Only the bottom rows were used for the processing. The processing has been reduced to 22.9 mcps as expected as logical cores are not physical cores. This means each physical core is getting about 45mcps instead of the 37.7 results we saw in single thread per physical core. (attachment: CPUThread_Locked_sequencial)
What does this actually mean: Same benchmark, same thread count, allocated to cores in different ways.
1: Windows 10 thread scheduling is terrible.
Letting Windows move threads anywhere it thought useful, barely out performed locking threads to two different physical cores. Now remember, locking a thread to 2 logical/physical cores just opens up the ability for the windows scheduler to move threads between the allocated logical cores. So, even in the situation where it can chose between two physically different cores, it will do so even at the detriment of performance. Cache movement alone is enough to justify keeping a heavily used process on a single core.
2: Hyperthread cores are not equal in performance to a normal core
Other benchmarks show an increase of processing when multithreading is involved. Games also do. But, high performance code sees a reduced improvement. Why? Simply because a processing core divides its available time into small chunks and the OS allocates each chunk to a thread (a process). Generally, this is a static time allotment and its up to the program to use it to its fullest extent. Almost all code does not. Some of that time is unused because of things like memory calls or waiting for a user to interact with a process. There is a lot of wasted time.
Hyperthreading in a simplified overview simply takes a second process and tries to fill the empty time on the cpu physical core. 2 logical processing cores + 1 cpu = 1 process at a time. This is why its impossible to see more than 50% improvement (math…). A perfectly created program that uses all its time will actually prevent other processes from filling the extra time. This equate to 0% speedup. This program had only a 7.1 mcps improvement (about 18.8%) This isn’t even a heavily optimized process and there is waste in it. It would not be unreasonable to reduce that improvement percentage with a few hours of work with improvements to the processing loop.
So…double the logical cores does not equal double the performance. Nor does it equate to 50% improvement. Online reviewers simply need to understand this in their discussion. The program itself will determine if speedup is possible, not the CPU. Any reviewer that claims otherwise or seems shocked that hyperthreads are not a huge boon in specific software simply does not understand the technology they are reviewing. (This includes some pretty popular names in tech reviews, not just the youtube ones either)
3: Threads moving between cores has a performance tax due to processing overhead to move the thread and also move the memory stored in cache from one core to another.
Moving a thread is a program in the OS. Every program creates at least one process in the CPU. This equates to a new task being allocated a time slot which is likely equal to the time slot of the benchmark. The other problem is any memory allocated to that thread must now move to the L1 and L2 cache allocated for that physical CPU. A CPU core only has its connected cache, it cannot grab data from another cores cache. It has to go up one layer (usually L3) get the info from there, the L3 has to check if its version of the data is correct, correct it if it is not, then pass it to the correct cores cache. This must occur every time the thread moves.
This is why there was so much change in the performance of the benchmark when thread is decoupled from a specific thread.
So, why was the performance improved by letting the scheduler move between logical cores on a single physical core? Simple, 1 cache path and time slots were being more effectively used on the core. Likely this is because windows was prevented from placing background tasks onto that physical core because the two logical cores were too busy.
Bottom line version:
Modern CPUs are scheduled by the OS and impacted by the programming of the task. Making a task that fully utilizes a core to ‘100%’ doesn’t actually use it 100%. I can peg a cpu to 100% by just telling it to sleep the thread for 1ms and then repeat. It’s doing nothing for 1ms, but still showing 100% usage.
Challenge the online reviewers to actually discuss how the CPU is performing instead of its allocation by the OS and programmer. I also challenge benchmark makers to open source their code so we as consumers can review it for bias and flaws which favor one architecture over another. This may be unintentional, but many like games are well…questionable at best.
Please note that I did not even get into specialized instruction sets that are available on only 1 cpu brand nor those that have improved performance on 1 while the other has to emulate it.