Originally Posted by miklkit
What? Is this for real?
The code itself is purely demonstrative, however I do have some code online that demonstrates the point:
There is a lock-free implementation in the "moody" branch, and the highest performing, but experimental, version is in "new2."
This code is specifically designed to use C++11 features and provide shockingly better performance than C++11's std:async (which, itself, is barely able to handle two cores and, often enough, is slower than doing the work in-line)... it also contains considerable amount of performance monitoring code in the thread loop (FutureThread::_Thread).
The scaling is nearly 100% perfectly in the new2 branch, with one of my test load scenarios specifically designed to extract the maximum scaling. In fact, due to timing idiosyncrasies, it will sometimes come out as slightly greater than 100% effective
Which is to say that eight cores will give you effectively eight times the performance of std::async on eight cores (if running code has no locking contention of its own).
I've put a lot of thought into using every available processing capability available from the CPUs for a long time