from amd supplied gcc "machine descriptor" file.
;; AMD bdver3 Scheduling
;; The bdver3 contains three pipelined FP units and two integer units.
;; Fetching and decoding logic is different from previous fam15 processors.
;; Fetching is done every two cycles rather than every cycle and
;; two decode units are available. The decode units therefore decode
;; four instructions in two cycles.
;; Three DirectPath instructions decoders and only one VectorPath decoder
;; is available. They can decode three DirectPath instructions or one
;; VectorPath instruction per cycle.
;; The load/store queue unit is not attached to the schedulers but
;; communicates with all the execution units separately instead.
;; bdver3 belong to fam15 processors. We use the same insn attribute
;; that was used for bdver3 decoding scheme.
I've previously heard that bulldozer didn't just alternate between cores so it could potentially do A/A/B/A/A/B/B ... and so. Maybe this setup guarantees 4 instructions per core every other cycle. Since x86 processors rarely even execute more than 2 ipc this is a good idea. Also consider that running the decoders every cycle could greatly increase power consumption.
Originally Posted by Opcode
Each integer core has its own decoder with Steamroller. Which means all integer related tasks should be a lot faster. Tho the FPU is still shared so floats are still going to be quite slow. The addition of the extra decoder should increase the operations handled per cycle greatly. As the single decoder with Bulldozer has to dig through the register and offload the instructions one at a time to each core. There is no sequence with Steamroller, as each core has its own decoder. Steamroller is close to being a true quad core again. The only thing notably different now is the shared memory and FPU.
Edit: Here's a basic diagram for comparison between the two.
How many games use 256-bit AVX code. The shared FPU is more than enough for two threads for the foreseeable future.
FPU itself is bottlenecked by that fact that it can only do one 128 bit store per cycle. The FPU can compute two 128-bit FP ops per cycle with fully loaded pipes, however it can only write back 1 128 bit op result per cycle. With two threads, this is 0.5 128 bit stores per thread per cycle - not good.
In steamroller this is doubled (2 128 bit stores per cycle) so throughput can be higher in many cases.