Overclock.net › Forums › Industry News › Hardware News › [OC3D] AMD's Zen will have a "greater than 40%" IPC improvement over Excavator, says Lisa Su
New Posts  All Forums:Forum Nav:

[OC3D] AMD's Zen will have a "greater than 40%" IPC improvement over Excavator, says Lisa Su - Page 43

post #421 of 841
Quote:
Originally Posted by Themisseble View Post

AMD really needs to pull it off this time.

Jaguar works great if they can do something like Jaguar for high end .. they I will buy it.
That seems to be the plan. Take the best parts of things they have and mash them together. As long as the Zen cores can keep fed with two AGUs and retain the high bandwidth with the really low latency they seem to have AMD should be golden.
post #422 of 841
Quote:
Originally Posted by Tojara View Post

That seems to be the plan. Take the best parts of things they have and mash them together. As long as the Zen cores can keep fed with two AGUs and retain the high bandwidth with the really low latency they seem to have AMD should be golden.

You are right...
Some rumors say like we talk before that ZEN has 10 pipelines.
http://dresdenboy.blogspot.si/2015/10/amds-zen-core-family-17h-to-have-ten.html

There is no source if it use 128bit FPU or 256Bit FPU.. actually there is so little of info its hard to predict performance. How many pipelines has haswell?
post #423 of 841
Quote:
Originally Posted by Themisseble View Post

You are right...
Some rumors say like we talk before that ZEN has 10 pipelines.
http://dresdenboy.blogspot.si/2015/10/amds-zen-core-family-17h-to-have-ten.html

There is no source if it use 128bit FPU or 256Bit FPU.. actually there is so little of info its hard to predict performance. How many pipelines has haswell?
Only six or seven if you use the same counting method as on Zen, but that doesn't mean that it's slower, just that the operations are combined to same ALUs instead of having them separate. That's theoretically a negative, but you're not really going to be able to feed more than four units for any half significant time.

Haswell has:
Four ALUs (Four INT or max 2x 256b FP)
Two LD/ST AGUs + One ST AGU
~4/12/21 cycle cache latency

Sandy Bridge has:
Three ALUs (Three INT or 1x 256b FP)
Two LD/ST AGUs
~5/14/22 cycle cache latency

Zen supposedly has:
Eight ALUs (Four INT and four FP (for combined max 2x 256b?), if with 256b basically equal to HW)
Two LD/ST AGUs (lacking one ST AGU vs HW, but only loads are critical. Not storing doesn't hurt performance until the queue is full. Basically same peak with slightly less consistency)
2/10/24 cycle cache latency at best (I'd expect something along ~3/12/26)

But again, pipelines are only a part of the equation and only give some general indication of how well the architecture can perform at peak. Latency is also pretty meaningless without knowing the bandwidth, though too much would hurt performance, so the estimates on that are also quite positive.
Edited by Tojara - 1/30/16 at 6:45am
post #424 of 841
Quote:
Originally Posted by Themisseble View Post

You are right...
Some rumors say like we talk before that ZEN has 10 pipelines.
http://dresdenboy.blogspot.si/2015/10/amds-zen-core-family-17h-to-have-ten.html

There is no source if it use 128bit FPU or 256Bit FPU.. actually there is so little of info its hard to predict performance. How many pipelines has haswell?

All of this is derived from the GCC patches:

http://looncraz.net/ZenAssignments.html

http://looncraz.net/HswAssignments.htm

Haswell, is 8-wide, Zen is 10-wide. But that really doesn't tell you much, except that Zen is apparently using more simple pipelines to reduce complexity (to help assure higher clockspeeds, simplify development, and also to retain some of the performance dynamic found in the Stars architecture enabled by independently addressable floating point pipes).

Haswell uses a unified scheduler, so all instructions are fed into a single unit post-decode, just prior to finding their way into an available execution unit. Haswell cannot initiate execution of certain floating point instructions while also initiating execution of certain integer instructions on the same port (0, 1,5, 6). This is really only a loss of a cycle or two, but for instructions that might only need one to five cycles, it becomes a more meaningful limitation. There are many advantages to this arrangement, however, such as a simple, fast, results bus that allows results to be fed back into the unified scheduler to more quickly satisfy the data needs of result-dependent instructions. Often enough, the dependent instructions can be fed into the proper execution unit within a cycle or two of the result being calculated - a massive speedup from the old system.

Phenom II used a unified reorder buffer, then divided the instructions by floating point or integer type and sent them to independent schedulers. This allowed the greatest possible degree of decoupling possible. Results went to the load-store-queue, which would allow dependent instructions to pull in the results directly, but often the result would end up in a target register or in the cache before the next instruction was available, leading to less of a speedup than a unified scheduler, but it was nearly as good for 90% of the time. However, integer scheduling was more complicated. Phenom II had three AGU/ALU clusters, each fed by a dedicated scheduler, which was fed by a integer control buffer, which got its data right from the reorder buffer. Integer instructions included all dependent memory operations as well, as each cluster was a nearly fully capable execution unit (you could pretty much use one cluster to act as a full core). This had the upside that the results would be immediately be locally available (for the cluster) - and the downside that an ALU would often sit idle waiting for the memory, while the scheduler was backlogged for that cluster.

For Zen, it appears there will be three schedulers. One for integer, one for memory, and one for floating point. Probably also fed by a reorder buffer. The upside is that you don't (usually) have instructions blocking up progress while waiting for data. You have maximum decoupling from memory operations, integer operations, and floating point operations. It is more likely to reach the peak performance possible from the core assets this way... as long as you can get the results back to the dependent instructions fast enough (often an issue with multiple schedulers). AMD could not have just copied the Phenom II solution, as that was rather dependent on the locality of data and instruction provided by the cluster design. It will be interesting to see what the use (they will probably use a variation on something they already have). The downside to independent schedulers is that you will, more often, be stalling for the results. This is where SMT and out-of-order execution come to the rescue.

In the end, it looks like Haswell and Zen will be averaging out to the same throughput, except in situations where Haswell will be delaying port executions due to instruction conflicts - where Zen will pull ahead (10%+), or in situations where dependent data is less localized - where Haswell will pull ahead (possibly by a great deal - 30%+). It then comes down to the cache performance and the specific instruction mix.
Edited by looncraz - 1/30/16 at 12:04pm
post #425 of 841
Quote:
Originally Posted by looncraz View Post

snip.

This is probably the most reasonable, well thought out, and informative post I have ever seen in one of these threads. Kudos.
post #426 of 841
Quote:
Originally Posted by Fyrwulf View Post

This is probably the most reasonable, well thought out, and informative post I have ever seen in one of these threads. Kudos.
QFT
Looncraz is one of the few commenters worth reading about ZEN speculation
Summit Ridge
(16 items)
 
ASUS R510DP
(8 items)
 
 
CPUMotherboardGraphicsRAM
AMD Ryzen 1600X ASRock Fatal1ty AB350 Gaming K4 PNY GTX 1050 2GB 2x8GB G.Skill TridentZ 3200 CL16 
Hard DriveHard DriveHard DriveHard Drive
Plextor M6S Plus 256GB SSD Toshiba X300 6TB Toshiba X300 6TB Toshiba 2TB 
CoolingOSMonitorMonitor
Deepcool BETA 400 ST Windows 8.1 Pro x64 HP S2031 20" Samsung SyncMaster 932BW 19" 
PowerCaseMouseMouse Pad
Seasonic S12G 750w Lian Li full tower Logitech MX310 SteelSeries 4HD 
CPUMotherboardGraphicsRAM
A10-5750m 3.5Ghz ASUStek A75M FCH HD 8650G + HD 8670M dual graphics 2x4GB Samsung 1600Mhz 
Hard DriveOptical DriveOSMonitor
Hitachi 250GB HDD (SSD died) Panasonic CD/DVD Xubuntu linux 16.04 LTS 15.6" 1920x1080 
  hide details  
Reply
Summit Ridge
(16 items)
 
ASUS R510DP
(8 items)
 
 
CPUMotherboardGraphicsRAM
AMD Ryzen 1600X ASRock Fatal1ty AB350 Gaming K4 PNY GTX 1050 2GB 2x8GB G.Skill TridentZ 3200 CL16 
Hard DriveHard DriveHard DriveHard Drive
Plextor M6S Plus 256GB SSD Toshiba X300 6TB Toshiba X300 6TB Toshiba 2TB 
CoolingOSMonitorMonitor
Deepcool BETA 400 ST Windows 8.1 Pro x64 HP S2031 20" Samsung SyncMaster 932BW 19" 
PowerCaseMouseMouse Pad
Seasonic S12G 750w Lian Li full tower Logitech MX310 SteelSeries 4HD 
CPUMotherboardGraphicsRAM
A10-5750m 3.5Ghz ASUStek A75M FCH HD 8650G + HD 8670M dual graphics 2x4GB Samsung 1600Mhz 
Hard DriveOptical DriveOSMonitor
Hitachi 250GB HDD (SSD died) Panasonic CD/DVD Xubuntu linux 16.04 LTS 15.6" 1920x1080 
  hide details  
Reply
post #427 of 841
Quote:
Originally Posted by Tojara View Post

Haswell has:
Four ALUs (Four INT or max 2x 256b FP)
Two LD/ST AGUs + One ST AGU
~4/12/21 cycle cache latency

Sandy Bridge has:
Three ALUs (Three INT or 1x 256b FP)
Two LD/ST AGUs
~5/14/22 cycle cache latency

Nah, 3x 256 bit SIMD FP's for both SB/HW and 3x SIMD int 128/256 bit for SB/HW. Source: http://www.realworldtech.com/haswell-cpu/4/

As far as Ive seen, Zen is using the same symmetric design Bulldozer is using. Isn't really comparable to HW/SB imo.
post #428 of 841
Quote:
Originally Posted by Faithh View Post

Nah, 3x 256 bit SIMD FP's for both SB/HW and 3x SIMD int 128/256 bit for SB/HW. Source: http://www.realworldtech.com/haswell-cpu/4/

As far as Ive seen, Zen is using the same symmetric design Bulldozer is using. Isn't really comparable to HW/SB imo.

As the article shows, things aren't always as clear-cut as they seem. Haswell has four execution ports, three of which have some sort of floating point hardware attached.

Ports 0, 1, & 5 are the most heavily adorned according to the link you posted, and according to the gcc data (http://looncraz.net/HswAssignments.htm) but things aren't quite so simple. In the end, it matters what can be done at once:

Haswell:

1x fdiv/ssediv
2x sse_add (+ mmx_add and sse_iadd)
1x sse_mul (not during fdiv/ssediv or mmx_shift).
1x sse_cvt
2x ssemulsadd

Zen is as follows: (http://looncraz.net/ZenAssignments.html)

2x fdiv OR 1x ssediv + 1x fdiv
2x sse_add (3x mmx_add and 3x sse_iadd)
2x sse_mul (one shared with one fdiv pipe)
1x sse_cvt
1x ssemulsadd (binding different units depending on the type of instruction)

It seems clear that Zen is aiming to match / beat Haswell on the FPU front... for traditional software. Of course, instruction latencies and other factors come into play, but Zen should, in theory, have a beast of an FPU.
post #429 of 841
Quote:
Originally Posted by 7850K View Post

QFT
Looncraz is one of the few commenters worth reading about ZEN speculation

gif4j1216
post #430 of 841
Quote:
Originally Posted by looncraz View Post

As the article shows, things aren't always as clear-cut as they seem. Haswell has four execution ports, three of which have some sort of floating point hardware attached.

Ports 0, 1, & 5 are the most heavily adorned according to the link you posted, and according to the gcc data (http://looncraz.net/HswAssignments.htm) but things aren't quite so simple. In the end, it matters what can be done at once:

Haswell:

1x fdiv/ssediv
2x sse_add (+ mmx_add and sse_iadd)
1x sse_mul (not during fdiv/ssediv or mmx_shift).
1x sse_cvt
2x ssemulsadd

Zen is as follows: (http://looncraz.net/ZenAssignments.html)

2x fdiv OR 1x ssediv + 1x fdiv
2x sse_add (3x mmx_add and 3x sse_iadd)
2x sse_mul (one shared with one fdiv pipe)
1x sse_cvt
1x ssemulsadd (binding different units depending on the type of instruction)

It seems clear that Zen is aiming to match / beat Haswell on the FPU front... for traditional software. Of course, instruction latencies and other factors come into play, but Zen should, in theory, have a beast of an FPU.

Hmm what about Skylake FPU?
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Hardware News
Overclock.net › Forums › Industry News › Hardware News › [OC3D] AMD's Zen will have a "greater than 40%" IPC improvement over Excavator, says Lisa Su