Originally Posted by The Stilt
Point being, the design as it is cannot go any faster no matter what you do
There's not much room for improvement without doubling-down on the design and so many internal components that you end up with two cores sitting side by side, with two independent execution pathways.
That said, if the execution cores were all jumbled into one pile (4x ALU 4x AGU, and the full FPU), and any thread can use any execution unit at any time and the only places you care about which thread owns which operation is in atomic operations, the branch predictor (which is not in-lined), and barely even a retirement ordering unit, then you have a situation where all execution units are available to every thread, and you don't have anywhere near as many penalties.
Of course, we call that SMT
In theory, this can be done with only one extra stage in the entire pipeline - and that would be in the reordering retirement unit (yes, I like to use lots of different names for the same thing, so sue me
Zen will undoubtedly be going this way, and AMD has a lot of IP that could give them this superior arrangement with just a couple of years of design work (mostly on the logic front). Making the core wider will help with single threads and SMT scaling, provided AMD does some serious work on the FastPath code. They will need the ability to assign instructions to groups of units in a manner consistent with software priorities (which instruction types can be executed at once, vs which ones can wait for the others more often than not). This is exactly what Intel does with their architecture.
Indeed, AMD got a HUGE freebie from Intel on this: which instructions take precedence, and which can wait. That's because Intel published their findings from billions of dollars worth of research and even published a detailed list of what instructions go to what unified reservation station port.
AMD undoubtedly used a very similar arrangement.
The graphic doesn't show ports 2,3, &4, because they are dedicated to Load, Store Addr, and Store Data, respectively.
If AMD just copies the design straight-up, I'd be surprised, Id expect them to try to differentiate some, maybe with more ports to help emphasize floating point performance, or to help mitigate a weak point in their design.Edited by looncraz - 9/27/15 at 11:31pm