Originally Posted by Tsumi
The only thing doubled is the decoders. There is still the single prefetch, shared L2, etc. The doubled decoder removes the bottleneck where 2 cores perform at about 1.6-1.8x performance of a single core instead of 2x performance.
Edit: Well, it's not the only thing doubled, but it is the biggest change.
Single - Single - Single - Single
Double - Single - Single
Double - 256-bit vector instructions
Single - 128-bit vector instructions
The doubling of the decoders in Steamroller is not because the decoders were a bottleneck for Bulldozer. The doubling of decoders is because of the module of Steamroller will have 8 ALUs/8 AGUs/8 VALUs where Bulldozer has 4 ALUs/4 AGUs/4 VALUs.
What else has been doubled:
Instruction Fetch/Prefetch(Two seperate duplicate engines)
Instruction L1 Cache(2x64KB or 1x128KB)
Instruction Decoders(SSSS/SSSS or DSS/DSS)
General Purpose x86-64 Logic
Data L1 Cache(16 KB -> 32 KB)
Vector x86-64 Logic(FPU)
Also, the Cluster Multithreading numbers Pre-Bulldozer launch are: (-> actual benchmarks)
Module, Single Core(Maxed): Core 0 100% performance, Core 1 000% performance(off) -> Which leads to only 80% usage of all resources.
Module, Dual Core(Both Maxed): Core 0 90% performance, Core 2 90% performance -> Which leads to >95% usage of all resources.
Also, to note that 10% loss in both cores are returned with access to the same pool of memory. (L1i/L2)Edited by Seronx - 6/12/13 at 3:24am