Overclock.net › Forums › Industry News › Rumors and Unconfirmed Articles › [Various] AMD's Zen To Have 10 Pipelines Per Core - Details Leaked In Patch (Updated)
New Posts  All Forums:Forum Nav:

[Various] AMD's Zen To Have 10 Pipelines Per Core - Details Leaked In Patch (Updated) - Page 15

post #141 of 758
Quote:
Originally Posted by Jimbags View Post

All done at stock clocks? Some overclock better than others is all :-D

Pure IPC comparisons, normalized clocks generation over generation.

The CPUs ability to clock is a secondary discussion. I don't expect Zen to have 4GHz base clocks, probably around 3.5, with turbo to 4, and reliable overclocking barely exceeding the turbo. And that's at the high end, with four cores.

Zen could be a decent upgrade for anyone who doesn't overclock, however, and is on a Sandy Bridge or older system. Considering moving from any i7 to any AMD CPU has been a general downgrade since time immemorial, I'd say that is a drastic improvement.

Zen will, by no means, defeat Intel.

I expect them to use SMT to even the playing field, price-wise, with Intel's i5s, and to add a couple more cores to compete i7s, and then have some eight cores that will compete with Intel's six core CPUs.
post #142 of 758
Quote:
Originally Posted by spurdomantbh View Post

It said FPU, not SIMD though. That being said the thing you were quoting is still incorrect.
As for Zen, the same fp pipelines seem to be used in both fp and mmx operations. Shared pipeline between units perhaps, similar to intel?

If you'd read the bulldozer part I marked in bold;
Quote:
Bulldozer module:

Core1: 2 ALU + 2 AGU
Core2: 2 ALU + 2 AGU
FPU: 2 128bit FMAC + 2 MMX
Instruction Decode: 4 Wide
L1 Cache: 64 KB Instruction per module + 16 KB data per core
L2 Cache: 2 MB

The MMX units are SIMD integer units, not floating point units. Theyre including the SIMD integer units for bulldozer, except they forgot to include the ones from Haswell and Zen doesnt seem to be having any of them which is illogical.

"The other half of the floating point cluster’s execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations."

http://www.realworldtech.com/bulldozer/7/

For the last decade, we've never seen an architecture a single SIMD unit doing both floats & integers afaik. There might be but they'd be really old. If you haven't noticed, Bulldozer is using a symmetric design eg just 2x 128 bit MMX/2x FMAC unlike any other architecture (see my haswell SIMD list) which you would classify BD's SIMD a FlexFPU which failed miserably. Whoever sketched this, got it on the SIMD side completely wrong.

Zen is not supposed to have such a flawed SIMD design
post #143 of 758
Quote:
Originally Posted by Faithh View Post

If you'd read the bulldozer part I marked in bold;
The MMX units are SIMD integer units, not floating point units. Theyre including the SIMD integer units for bulldozer, except they forgot to include the ones from Haswell and Zen doesnt seem to be having any of them which is illogical.

"The other half of the floating point cluster’s execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations."

http://www.realworldtech.com/bulldozer/7/

For the last decade, we've never seen an architecture a single SIMD unit doing both floats & integers afaik. There might be but they'd be really old. If you haven't noticed, Bulldozer is using a symmetric design eg just 2x 128 bit MMX/2x FMAC unlike any other architecture (see my haswell SIMD list) which you would classify BD's SIMD a FlexFPU which failed miserably. Whoever sketched this, got it on the SIMD side completely wrong.

Zen is not supposed to have such a flawed SIMD design

From the patch:


Integer SIMD (MMX)
Code:
+;; Currently blocking all decoders for vector path instructions as

+;; they are dispatched separetely as microcode sequence.

+;; Fix me: Need to revisit this.

+(define_reservation "znver1-vector" "znver1-decode0+znver1-decode1+znver1-decode2+znver1-decode3")

Combined with other entries in the gcc znver1.md file it would seem to suggest that integer SIMD instructions are not necessarily handled by a dedicated unit at all, but are instead translated into microcode instructions and assigned to ganged execution units by ganging together ALL of the decoders (which is beyond strange). However, the comment above it suggests this is just place-holder code that works, but isn't necessarily representative of the processor internals... which would be my guess.

For floating point vector instructions:
Code:
+(define_reservation "znver1-fvector" "znver1-fp0+znver1-fp1

+                                     +znver1-fp2+znver1-fp3

+                                     +znver1-agu0+znver1-agu1")


This, however, is exactly what we've suspected all along - that two of the pipes for the FPU merge together for vector loads. This also suggests that we are talking about a 4x64bit FPU much like that in Excavator. A few other entries suggests that the FPU may have its own load/store capabilities as well as the one or more of the ALUs having such a capability. However, this is almost certainly just a matter of logical access to the LSU from these units. However, each such entry shows units being ganged together unless an AGU is used:
Code:
+(define_insn_reservation "znver1_fp_mov_direct_load" 5

+                        (and (eq_attr "cpu" "znver1")

+                             (and (eq_attr "znver1_decode" "direct")

+                                  (and (eq_attr "type" "fmov")

+                                       (eq_attr "memory" "load"))))

+                        "znver1-direct,znver1-load,znver1-fp3|znver1-fp1")

+(define_insn_reservation "znver1_fp_mov_direct_store" 5

+                        (and (eq_attr "cpu" "znver1")

+                             (and (eq_attr "znver1_decode" "direct")

+                                  (and (eq_attr "type" "fmov")

+                                       (eq_attr "memory" "store"))))

+                        "znver1-direct,znver1-fp2|znver1-fp3,znver1-store")

Or I could be misreading what this means. My first foray into gcc's internals.
post #144 of 758
Did four tests on four different designs.

BD Gen. 2 (Piledriver) - A10-6800K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 3 (Steamroller) - A10-7870K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 4 (Excavator) - FX-8800P
2CU / 2T (no CMT penalty), 3.0GHz, 1.3GHz NCLK, 1600MHz DRAM 9-9-9-27-2T

Haswell - i5-4430
2C / 2T (no SMT penalty), 3.0GHz, 3.0GHz Cache / UCCLK, 1600MHz DRAM 9-10-9-24-2T

C-Ray V1.1 (Raytracer)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
1600x1200 with 8 rays per pixel (15360000)

PD = 187155ms (82071.0pps) - 100.0%
SR = 184502ms (83251.1pps) - 101.438%
XV = 170368ms (90157.8pps) - 109.853%
HW = 116907ms (131386.5pps) - 160.089%

Euler3D (CFD)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
NACA0012.097K air foil

PD = 177.076s (11.2946 IPS) - 100.0%
SR = 151.380s (13.2102 IPS) - 116.960%
XV = 135.674s (14.7412 IPS) - 130.515%
HW = 94.521s (21.1593 IPS) - 187.340%

X265 (Encoder)
Compiler GCC 5.20 x86-64 / YASM 1.30 (default flags)
Version 1.7+512

PD = 225.25s (1.57 fps) - 100.0%
SR = 213.97s (1.65 fps) - 105.1%
XV = 204.81s (1.72 fps) - 109.554%
HW = 117.12s (3.01 fps) - 191.720%

Cinebench R15

PD = 71pts - 100.0%
SR = 72pts - 101.408%
XV = 75pts - 105.634%
HW = 119pts - 167.606%

These are all naturally single threaded results.
All of the systems had additional core enabled in order to offload the operating system overhead.

EDIT: Fixed the messed up results rolleyes.gif
Edited by The Stilt - 10/8/15 at 10:16am
post #145 of 758
Quote:
Originally Posted by The Stilt View Post

Did four tests on four different designs.

BD Gen. 2 (Piledriver) - A10-6800K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 3 (Steamroller) - A10-7870K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 4 (Excavator) - FX-8800P
2CU / 2T (no CMT penalty), 3.0GHz, 1.3GHz NCLK, 1600MHz DRAM 9-9-9-27-2T

Haswell - i5-4430
2C / 2T (no SMT penalty), 3.0GHz, 3.0GHz Cache / UCCLK, 1600MHz DRAM 9-10-9-24-2T

C-Ray V1.1 (Raytracer)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
1600x1200 with 8 rays per pixel (15360000)

PD = 187155ms (82071.0pps) - 100.0%
SR = 184502ms (83251.1pps) - 101.438%
XV = 151462ms (101411.6pps) - 123.566%
HW = 116907ms (131386.5pps) - 160.089%

Euler3D (CFD)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
NACA0012.097K air foil

PD = 177.076s (11.2946 IPS) - 100.0%
SR = 151.380s (13.2102 IPS) - 116.960%
XV = 122.726s (16.2965 IPS) - 144.286%
HW = 94.521s (21.1593 IPS) - 187.340%

X265 (Encoder)
Compiler GCC 5.20 x86-64 / YASM 1.30 (default flags)
Version 1.7+512

PD = 225.25s (1.57 fps) - 100.0%
SR = 213.97s (1.65 fps) - 105.1%
XV = 187.47s (1.88 fps) - 119.745%
HW = 117.12s (3.01 fps) - 191.720%

Cinebench R15

PD = 71pts - 100.0%
SR = 72pts - 101.408%
XV = 75pts - 105.634%
HW = 119pts - 167.606%

These are all naturally single threaded results.
All of the systems had additional core enabled in order to offload the operating system overhead.

That raises questions FX-8800P is locked to TDP no? Although it totally slaughters the previous APUs.
Power Tower
(22 items)
 
SteamBox
(9 items)
 
Doge Miner
(7 items)
 
CPUMotherboardGraphicsRAM
Ryzen 1700X AX370-Gaming 5 AMD Radeon R9 200 Series G.Skill DDR4-2400 
RAMRAMRAMHard Drive
G.Skill DDR4-2400 G.Skill DDR4-2400 G.Skill DDR4-2400 Samsung 840 Pro 
Hard DriveHard DriveHard DriveHard Drive
CX300 Crucial 480GB Toshiba 4TB Toshbia 4TB Western Digital Black 1TB 
CoolingOSMonitorMonitor
h110i Windows 10 42" LG TV 20" Digitizer ASUS 
KeyboardPowerCaseMouse
Corsair Vengeance Mechanical Keyboard  850watt Vampire Gold Rated NZXT S340 Elite Corsair RGB FPS Mouse 
Mouse PadAudio
Borderlands Mousepad Realtek HD 
  hide details  
Reply
Power Tower
(22 items)
 
SteamBox
(9 items)
 
Doge Miner
(7 items)
 
CPUMotherboardGraphicsRAM
Ryzen 1700X AX370-Gaming 5 AMD Radeon R9 200 Series G.Skill DDR4-2400 
RAMRAMRAMHard Drive
G.Skill DDR4-2400 G.Skill DDR4-2400 G.Skill DDR4-2400 Samsung 840 Pro 
Hard DriveHard DriveHard DriveHard Drive
CX300 Crucial 480GB Toshiba 4TB Toshbia 4TB Western Digital Black 1TB 
CoolingOSMonitorMonitor
h110i Windows 10 42" LG TV 20" Digitizer ASUS 
KeyboardPowerCaseMouse
Corsair Vengeance Mechanical Keyboard  850watt Vampire Gold Rated NZXT S340 Elite Corsair RGB FPS Mouse 
Mouse PadAudio
Borderlands Mousepad Realtek HD 
  hide details  
Reply
post #146 of 758
Quote:
Originally Posted by SpeedyVT View Post

That raises questions FX-8800P is locked to TDP no? Although it totally slaughters the previous APUs.

I think the 75W TDP limit I used for the FX-8800P covers the two active units, with only one core under the load wink.gif

Sufficient L1D and lower latency L2 cache works pretty well.
Edited by The Stilt - 10/7/15 at 10:35pm
post #147 of 758
Quote:
Originally Posted by Faithh View Post

If you'd read the bulldozer part I marked in bold;
The MMX units are SIMD integer units, not floating point units. Theyre including the SIMD integer units for bulldozer, except they forgot to include the ones from Haswell and Zen doesnt seem to be having any of them which is illogical.

Ahh, my bad, missed the MMX units in bulldozer part.
Quote:
Originally Posted by Faithh View Post

Zen is not supposed to have such a flawed SIMD design

Hopefully
Quote:
Originally Posted by looncraz View Post

Code:
+;; Fix me: Need to revisit this.
However, the comment above it suggests this is just place-holder code that works, but isn't necessarily representative of the processor internals... which would be my guess.

Indeed, seems like it's too early to say how that will work.
Quote:
Originally Posted by looncraz View Post

This also suggests that we are talking about a 4x64bit FPU much like that in Excavator.

I believe it just suggests that all pipelines are used? Unless I'm missing something?
Quote:
Originally Posted by looncraz View Post


A few other entries suggests that the FPU may have its own load/store capabilities as well as the one or more of the ALUs having such a capability. However, this is almost certainly just a matter of logical access to the LSU from these units. However, each such entry shows units being ganged together unless an AGU is used:

Or I could be misreading what this means. My first foray into gcc's internals.

It's still using the 2 AGUs. "znver1-load" and "znver1-store" definitions:
Code:
+;; 2 AGU pipes.

+(define_cpu_unit "znver1-agu0" "znver1_agu")

+(define_cpu_unit "znver1-agu1" "znver1_agu")

+(define_reservation "znver1-agu-reserve" "znver1-agu0|znver1-agu1")

+

+(define_reservation "znver1-load" "znver1-agu-reserve")

+(define_reservation "znver1-store" "znver1-agu-reserve")
post #148 of 758
For those interested, Hardware.fr published a 4GHz comparison on SB/IB/HW/BW/SL and multiple apps here :



http://www.hardware.fr/articles/940-6/cpu-sandy-bridge-vs-ivy-bridge-vs-haswell-vs-skylake-4-g.html
post #149 of 758
Quote:
Originally Posted by The Stilt View Post

Did four tests on four different designs.

BD Gen. 2 (Piledriver) - A10-6800K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 3 (Steamroller) - A10-7870K
2CU / 2T (no CMT penalty), 3.0GHz, 1.8GHz NCLK, 1600MHz DRAM 9-10-9-24-2T

BD Gen. 4 (Excavator) - FX-8800P
2CU / 2T (no CMT penalty), 3.0GHz, 1.3GHz NCLK, 1600MHz DRAM 9-9-9-27-2T

Haswell - i5-4430
2C / 2T (no SMT penalty), 3.0GHz, 3.0GHz Cache / UCCLK, 1600MHz DRAM 9-10-9-24-2T

C-Ray V1.1 (Raytracer)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
1600x1200 with 8 rays per pixel (15360000)

PD = 187155ms (82071.0pps) - 100.0%
SR = 184502ms (83251.1pps) - 101.438%
XV = 151462ms (101411.6pps) - 123.566%
HW = 116907ms (131386.5pps) - 160.089%

Euler3D (CFD)
Compiler GCC 5.20 x86-64, CFlags = O3 & static
NACA0012.097K air foil

PD = 177.076s (11.2946 IPS) - 100.0%
SR = 151.380s (13.2102 IPS) - 116.960%
XV = 122.726s (16.2965 IPS) - 144.286%
HW = 94.521s (21.1593 IPS) - 187.340%

X265 (Encoder)
Compiler GCC 5.20 x86-64 / YASM 1.30 (default flags)
Version 1.7+512

PD = 225.25s (1.57 fps) - 100.0%
SR = 213.97s (1.65 fps) - 105.1%
XV = 187.47s (1.88 fps) - 119.745%
HW = 117.12s (3.01 fps) - 191.720%

Cinebench R15

PD = 71pts - 100.0%
SR = 72pts - 101.408%
XV = 75pts - 105.634%
HW = 119pts - 167.606%

These are all naturally single threaded results.
All of the systems had additional core enabled in order to offload the operating system overhead.

That's 6.22% for Steamroller vs my 6.7% (because you used fewer benchmarks).
You are showing a 15.79% improvement for Excavator, far more than my average 9.85%. For the same reason, of course.

So, you're numbers would be:

PileDriver: 100%
Steamroller: 106%
Excavator: 123%
Zen: 172%
Haswell: 176%

Which is higher than what I stated for Zen thumb.gif

Averages are beautiful, are they not? And once you expand to more tests, your numbers should converge with mine. And we have no reason to expect a uniform improvement from Zen. I expect its Cinebench performance to increase notably more than its Euler3D numbers, for example.

In any event, even your numbers put Zen within scratching distance of Haswell, mine put it only a little closer (because Haswell doesn't perform as well in other benchmarks, such as ST 3DPM, ST WebXPRT, an ST Google Octane v2..

I would like to know how you set a high TDP for the FX-8800p... and why you have all this hardware available to you ;-). My numbers are mostly based on numbers I can find on the internet (and they sometimes even disagree, which is annoying). I had an FX-8350, and I've built systems using Haswell, and I got my numbers for those directly (left one core enabled, disabled turbo, and set to 4GHz, and went to town on a 4690k tongue.gif). In fact, I will be building another system with a 4690k in the next week or two (waiting for the rest of the money to show...), so I'll run some more tests and compare them against the Phenom II 955 my wife's computer still uses and my 2600k, all at 3GHz. More data, I love me some more data! teaching.gif
post #150 of 758
Quote:
Originally Posted by Olivon View Post

For those interested, Hardware.fr published a 4GHz comparison on SB/IB/HW/BW/SL and multiple apps here :

http://www.hardware.fr/articles/940-6/cpu-sandy-bridge-vs-ivy-bridge-vs-haswell-vs-skylake-4-g.html

wow look at that Broadwell performance. Seems like Skylake is actually worse than the previous generation in general purpose applications. What's up with that? Bigger focus on HPC than GP perheps?
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Rumors and Unconfirmed Articles
Overclock.net › Forums › Industry News › Rumors and Unconfirmed Articles › [Various] AMD's Zen To Have 10 Pipelines Per Core - Details Leaked In Patch (Updated)