All of this is derived from the GCC patches:
Haswell, is 8-wide, Zen is 10-wide. But that really doesn't tell you much, except that Zen is apparently using more simple pipelines to reduce complexity (to help assure higher clockspeeds, simplify development, and also to retain some of the performance dynamic found in the Stars architecture enabled by independently addressable floating point pipes).
Haswell uses a unified scheduler, so all instructions are fed into a single unit post-decode, just prior to finding their way into an available execution unit. Haswell cannot initiate execution of certain floating point instructions while also initiating execution of certain integer instructions on the same port (0, 1,5, 6). This is really only a loss of a cycle or two, but for instructions that might only need one to five cycles, it becomes a more meaningful limitation. There are many advantages to this arrangement, however, such as a simple, fast, results bus that allows results to be fed back into the unified scheduler to more quickly satisfy the data needs of result-dependent instructions. Often enough, the dependent instructions can be fed into the proper execution unit within a cycle or two of the result being calculated - a massive speedup from the old system.
Phenom II used a unified reorder buffer, then divided the instructions by floating point or integer type and sent them to independent schedulers. This allowed the greatest possible degree of decoupling possible. Results went to the load-store-queue, which would allow dependent instructions to pull in the results directly, but often the result would end up in a target register or in the cache before the next instruction was available, leading to less of a speedup than a unified scheduler, but it was nearly as good for 90% of the time. However, integer scheduling was more complicated. Phenom II had three AGU/ALU clusters, each fed by a dedicated scheduler, which was fed by a integer control buffer, which got its data right from the reorder buffer. Integer instructions included all dependent memory operations as well, as each cluster was a nearly fully capable execution unit (you could pretty much use one cluster to act as a full core). This had the upside that the results would be immediately be locally available (for the cluster) - and the downside that an ALU would often sit idle waiting for the memory, while the scheduler was backlogged for that cluster.
For Zen, it appears there will be three schedulers. One for integer, one for memory, and one for floating point. Probably also fed by a reorder buffer. The upside is that you don't (usually) have instructions blocking up progress while waiting for data. You have maximum decoupling from memory operations, integer operations, and floating point operations. It is more likely to reach the peak performance possible from the core assets this way... as long as you can get the results back to the dependent instructions fast enough (often an issue with multiple schedulers). AMD could not have just copied the Phenom II solution, as that was rather dependent on the locality of data and instruction provided by the cluster design. It will be interesting to see what the use (they will probably use a variation on something they already have). The downside to independent schedulers is that you will, more often, be stalling for the results. This is where SMT and out-of-order execution come to the rescue.
In the end, it looks like Haswell and Zen will be averaging out to the same throughput, except in situations where Haswell will be delaying port executions due to instruction conflicts - where Zen will pull ahead (10%+), or in situations where dependent data is less localized - where Haswell will pull ahead (possibly by a great deal - 30%+). It then comes down to the cache performance and the specific instruction mix.Edited by looncraz - 1/30/16 at 12:04pm