Originally Posted by maarten12100
Part is still shared in Steamroller in the end the BD derivatives should give a larger speedup than hyperthreading while still being weaker than true cores instead of a 10/15% speedup it would run @80/90% or so which is pretty good for the space you would be saving.
Does that answer his question?
- Branch Predictor => the old BP was pretty good and this one's mispredict rate is 20% lower
- Instruction Cache => 96KB 3-way is good compromise for power/hit rate/latency
- Prefetch => Doubled and additional entries can be used for low power loop buffer
- Fetch => 16B (four instructions) per core per cycle - no bottleneck
- FPU => No decrease in throughput vs. BD/PD, many load/store instructions have drastically lower latency. Bring on 256-bit FMACs Excavator!
- Write Coalescing Cache => No idea. I'm guessing there have been a lot of changes around this to increase write bandwidth.
- L2 Cache/Bus Unit => Ditto, (improve bandwidth, latency, outstanding misses etc.). Resizeable! Very cool feature.
- Decode => sustain decoding/dispatch of 4 IPC
- Micro-op Queue => small feature, but may be critical to IPC and perf/watt in tight loops
- Register Files => larger, same latency as before
- Scheduler => bigger scheduling window, smarter scheduling
- 2 ALU + 2 AGLU => AGLUs can execute reg-to-reg MOVs, allows 4 MOVs per cycle
- Load/Store Unit => Added ability to execute 2 stores/cycle (up from 1 load + 1 store per cycle or 2 loads per cycle)
- Data Cache => Performance really depends on this. I don't have any specifics but I think this is drastically improved (but no size/associativity increase)
- L1 Prefetcher => Also improved, nice to hear.
- Thread Retire => no change, can retire 4 IPC
- Write Combining Cache => I'm guessing there is more write combining resource. With a large enough buffer, write-through becomes a non-issue for store bandwidth.
Overall, the front-end should cause few problems for throughput. FPU seems to be latency-optimized. AMD has focused on optimizing loops.
My only concern right now is L1 write/copy bandwidth which could be a bottleneck. AMD says "major improvements to store handling", so I'm thinking this was an area of focus as well.
Originally Posted by Redwoodz
Dual socket Kaveri desktop board with quad-channel DDR4!
Quad-core Kaveri Athlon for $80? I'm down.