Originally Posted by EniGma1987
You sound mad. U mad bro?
If you knew a lot about the architecture then you would also know that there is a double sized FP unit that is two 128-bit units together and it can either be run as dual 128-bit or fused as a single 256-bit unit when there are either less operations needing to be done or it actually needs to perform a large 256-bit sized operation. The Phenom 2 processors had a single integer core and a single 128-bit floating point unit in one core, the new architecture has two integer cores and two 128-bit floating point units in a module, with shared caches and other stuff. Therefore your logic is ridiculous.
The three main bottlenecks of the Orochi architecture currently are:
L1 data has to be written to l2 which makes the L1 data writes have very low bandwidth
one decoder is shared between two integer cores
two integer cores are fighting over too small of cache space.
Steamroller core design will alleviate one of these problems by doubling the number of decoders as well as doubling how much data is fetched per operation. The fetch unit only operates once every other cycle though, so technically it can only get the same amount of data overall. The new fetch design though I guess is supposed to be more efficient for working with the architecture.
We also know that the cache design has been improved slightly, so it should work somewhat better. Though I dont think the architecture was (or can be?) changed so that l1 data is not copied to l2. So this issue will only be partway alleviated. Cache size was also increased so hopefully issue 3 is also fixed.
A: Don't know why you were trying to taunt me? I didnt taunt anyone else? Okay...
B: If we go by that rule (your paragraph about the FPU), then we may as well call a single Intel core as a dual core seeing as they have twice the amount of integer execution units as AMD has. I mean, that is your logic, no? Dont mind the number of clusters, just mind the integrity of the clusters and the items pertaining to them? I know that isnt how you meant it to be interpreted, but the point still lies; the number of clusters is
important, the bitwidth doesn't really make a difference.
Also, your explanation doesn't state how my logic is ridiculous. Bitwidth certainly has no effect on how many cores a design is perceived to have. After all, if it were, we'd all have 32-core CPUs if you compare it to 16-bit CPUs. I personally think that logic is ridiculous; but that's just my opinion.
C: The architecture can be changed so that the L1 cache isnt copied to the L2 cache but that'd make it lose cache coherency which is really quite important in a modern CPU that operates with modern operating systems. Though again, I never mentioned anything about the bottlenecks of Bulldozer/Piledriver, so I don't know why it constantly gets brought up by everyone, I don't get why.
Originally Posted by MrJava
It was a compromise, they wanted to get 80% of the two cores in much less die area. They at least achieved this goal when we look at heavily threaded INT stuff. FPU throughput is actually pretty good as well, the 4 FPUs in Zambezi can outperform the 6 FPUs of Thuban on occasion.
They certainly did achieve this goal when it comes to most integer workloads that didnt deal with massively large data pieces or complicated instructions.
I think the extra performance of the PD FPUs is given to the frequency bump seeing as the pipeline is quite a bit longer, meaning it can be clocked quite a bit higher; though that still doesn't change my view on the design; I still would have certainly preferred to have two floating point clusters as opposed to a single one shared.Edited by elemein - 10/28/13 at 3:25pm