Overclock.net banner

The Zen+ thread

3K views 33 replies 13 participants last post by  superstition222 
#1 ·
Since we're likely a long way from the release of Zen+ it may not be too late to tell AMD what it is that we want from it. Perhaps a bit of lobbying might have an influence. Probably not but it's still interesting to discuss.

What do you want to see from Zen+?

Write your wish list here, discuss news about it, whatever. We had a massive Zen thread so this one can be the Zen+ thread.

I'll start by saying:

• An L4 cache of some type is big on my wish list. eDRAM. HBM 2. Plain transistors inside the main die. Something.

• A much larger die size to make room for more CPU power.

• A high power process (i.e. SHP).

• Performance library.

• A good quantity of new instructions. Intel shouldn't be the only company that gets to innovate here.

• AVX 1024. By the time Zen+ hits Intel will have something past AVX 512 out, even if it's mainly for marketing purposes.

• The fabric, if it's used, needs to be faster, obviously.

• Continued use of solder. Yay for Zen having solder not cruddy TIM.

• A design that doesn't have rules restricting performance vs. power usage so much in favor of the latter. Regular Zen is designed to scale down to tiny low-power forms. Let Zen+ have bigger wings.

• Hardware for encoding acceleration of HEVC with all the high-quality stuff enabled. GPU video encoding is typically hampered by lower-quality settings being the only thing that can be enabled.

• Have the design make RAM speed less important (e.g. L4 caching). Enterprise users don't buy overclocked RAM nor do they overclock it, eh? JEDEC standards are always hyper-conservative. And, some reviewers will use slow RAM.

Regular Zen can have some incremental improvements and continue to be manufactured. This enables Zen+ to target a higher-performance section of the market instead of trying to be everything.

I'm no CPU design expert so my list, I'm sure, can use some tweaking.
 
See less See more
#2 ·
Avx use is super rare, unless you´re running benchies. The sse2/3 is important

L4/hbm - for power apu... maybe must have.

This is mostly for servers (locked) - more aggressive turbo, more steps.
My Opteron is on turbo all the time, under full blast. So i think there is some space for higher clocks.
wink.gif


better bioses, as i found with this cpu, somehow it runs on 1333, but mems+cpu are capable 1600....so i am digging into datasheet and will edit manually bios settings... thanks supermicro..

Some bios defaults memory settings, makes huge changes in throughput. You have to verify it /disable yourself.
This situation, if the reviewer is not careful, will yield worse scores than competition. Many reviews are in defaults.....
bulb.gif
 
#3 ·
Quote:
Originally Posted by superstition222 View Post

• An L4 cache of some type is big on my wish list. eDRAM. HBM 2. Plain transistors inside the main die. Something.

• A much larger die size to make room for more CPU power.

• A high power process (i.e. SHP).
All of these are likely to be counter productive features that would harm AMD's bottom line.
Quote:
Originally Posted by superstition222 View Post

• AVX 1024. By the time Zen+ hits Intel will have something past AVX 512 out, even if it's mainly for marketing purposes.
AVX does have provisions to scale to 1024-bits, but this could take a while.
Quote:
Originally Posted by superstition222 View Post

• The fabric, if it's used, needs to be faster, obviously.
Yes.
Quote:
Originally Posted by superstition222 View Post

• Have the design make RAM speed less important (e.g. L4 caching). Enterprise users don't buy overclocked RAM nor do they overclock it, eh? JEDEC standards are always hyper-conservative. And, some reviewers will use slow RAM.
Don't need an L4 cache, just need to decouple the data fabric and memory clocks.
 
#4 ·
Quote:
Originally Posted by Blameless View Post

All of these are likely to be counter productive features that would harm AMD's bottom line.
I don't see why an L4 in particular would be counter-productive. A higher-power process would also help single-thread competitiveness against the higher-clocked Intel CPUs. Die size will have to increase to achieve parity with Intel in terms of AVX and especially to leapfrog Intel.
 
#5 ·
Quote:
Originally Posted by superstition222 View Post

I don't see why an L4 in particular would be counter-productive.
Large cost in die area, even if it's just an HBM stack (which is easily the size of a CCX), likely without commensurate gains in performance.

Ryzen could certainly use some more refinement to it's cache/memory subsystem, but one only needs to look at Broadwell to see than an L4 is an inefficient way too improve CPU (as opposed to IGP) performance outside of a handful of niche scenarios.
Quote:
Originally Posted by superstition222 View Post

A higher-power process would also help single-thread competitiveness against the higher-clocked Intel CPUs.
It would, but at the cost of die area and overall performance per watt.
 
#6 ·
Quote:
Originally Posted by Blameless View Post

one only needs to look at Broadwell to see than an L4 is an inefficient way too improve CPU (as opposed to IGP) performance outside of a handful of niche scenarios.
Peter Bright of Ars disagrees.
Quote:
Originally Posted by Blameless View Post

It would, but at the cost of die area and overall performance per watt.
Which is why regular Zen would still be around also, with basic refinements.
 
#7 ·
Quote:
Originally Posted by superstition222 View Post

Peter Bright of Ars disagrees.
I'm not so sure about that...
Quote:
The effect was far from universal. The 5775C gives up a lot in clock speed (and power consumption) to the 6700K, and with that advantage, the Skylake part often wins. But in memory-intensive workloads, such as some games and scientific applications, the cache is better than 21 percent more clock speed and 40 percent more power. That's the kind of gain that doesn't come along very often in our dismal post-Moore's law world.

Those 5775C results tantalized us with the prospect of a comparable Skylake part. Pair that ginormous cache with Intel's latest-and-greatest core and raise the speed limit on the clock speed by giving it a 90-odd W power envelope, and one can't help but imagine that the result would be a fine processor for gaming and workstations alike.

But imagine is all we can do because Intel isn't releasing such a chip. There won't be socketed, desktop-oriented eDRAM parts because, well, who knows why.
The overwhelming majority of applications are not ones that will benefit from the presence of a big L4. Intel hasn't pushed out further socketed Iris Pro parts because likely because they have low demand and inferior margins.

Anyway, I never said an L4 wouldn't make Ryzen perform better, but that it will be a waste of transistors. You could slap an entire extra CCX on a Ryzen part for the die area/transistor cost of an HBM stack or eDRAM die.

Performance per watt and performance per transistor are better served via other means. Iris Pro Broadwells, despite being very fast in a few areas, are great examples of this.
 
#8 ·
Double (or multiply by more than that) the Infinity Fabric Interconnect. Reduce reliance on memory clocks (add straps)

DDR4 3200MHZ support out of the box with major brands such as Corsair, GSKill, , Kingston, Crucial , Team, etc

SMT optimizations

Higher clock attainable speeds (4.4-4.8GHz would be OK , that's ~ 10-20% improvement in clocks)

Higher XFR bins on the top end part , especially for low core counts under low temperatures. +100MHz is a joke (that's 4 bins under a 25MHz step so that might be the reasoning).


Make the AMD CBS that adjusts p-states mandatory on all X370 boards , while keeping the multiplier unlocked for B350 as it is now. If someone pays $200+ for an X370 board it should have p-state adjustment.

Example:

http://www.tweaktown.com/reviews/8099/asrock-x370-taichi-amd-motherboard-review/index5.html


https://nl.hardware.info/product/386634/asus-prime-x370-pro/fotos


https://nl.hardware.info/product/386386/asus-crosshair-vi-hero/fotos

Encourage the motherboard manufacturers to actually have USB 3.1 Gen 2 instead of marketing USB 3.1 Gen 1 (USB 3.0 speed) as USB 3.1. It's part of the B350 chipset already, let alone the X370.
 
#9 ·
All of those things are incremental. Zen+ implies some type of significant design change.
Quote:
Originally Posted by Blameless
I'm not so sure about that...
I'm not since the thrust of the article, the entire point of it, is that Intel is preventing the public from being able to buy a Skylake with L4. Selective quoting there. If you're not sure I suggest reading that article again, starting with its headline.
 
#10 ·
Quote:
Originally Posted by superstition222 View Post

All of those things are incremental. Zen+ implies some type of significant design change.
I think Zen+ will be Zen, but with single cycle AVX256, a faster and decoupled infinity fabric clock, and a handful of other evolutionary enhancements. I might wish for more, but this is the bulk of what's needed to address current shortcomings and play off existing strengths.

Zen to Zen+ is probably going to look like Nehalem to Sandy or Sandy to Haswell.
Quote:
Originally Posted by superstition222 View Post

I'm not since the thrust of the article, the entire point of it, is that Intel is preventing the public from being able to buy a Skylake with L4.
Which contradicts my statements how?
Quote:
Originally Posted by superstition222 View Post

Selective quoting there. If you're not sure I suggest reading that article again, starting with its headline.
The author may not fully realize that giant L4 caches are a waste on desktop parts, from Intel's perspective, but they are.

AMD likely realizes this too and I will bet against more cache levels on any consumer part that lacks an IGP. Cost/benefit ratio is not justifiable.
 
#11 ·
Quote:
Originally Posted by Blameless View Post

The author may not fully realize that giant L4 caches are a waste on desktop parts, from Intel's perspective, but they are.
Intel's perspective isn't necessarily the right one. I'll cite lousy polymer TIM and useless integrated graphics taking up half the die or more for ostensibly enthusiast-aimed parts. Fact is that an L4 would be a better usage of half a die than integrated graphics for any enthusiast.

It's also possible that you aren't appreciating the evidence he provided to justify his argument.
Quote:
Originally Posted by Blameless View Post

AMD likely realizes this too and I will bet against more cache levels on any consumer part that lacks an IGP. Cost/benefit ratio is not justifiable.
Or, the goal isn't to make the best product possible, one that will still be profitable, but instead to cut corners to make a wee bit more (e.g. polymer TIM and integrated graphics).
 
#12 ·
I'm using solely linux, so here's the list from my perspective.

My mychines fall into three categories.

1. Workstations. Our use dictates multimonitor setup, we do need roomey RAM but nothing over-the top. GPU is needed ( graphic op acceleration) but it's not essential that it is top of the line. Energy consumption is important, since workstations tend to be on most of the time.

2. Small machines in special roles ( chip programmers, various tools control etc)

3. File server/firewall/router/etc. No special CPU muscle neccesary, but low power is important. It has to have as many RAM slots as possible and ECC is essential.

4. CPU muscle. Special box for tasks that need carpet bombing with CPU cores. Typically batch compiling jobs, some simulations, maybe video conversions ezc.

So, here it is:

- APU with HBM2. There is no APU today that would offer decent 2D/3D performance and replace dGPU, especially when connectoed to 3 4k monitors. Bandwidth drain is just too much. Since there is simply no alternative, this is by far largest point on my list.

- if possible, latest GPU generation on APU, with most advanced HSA integration.

- high-end APU version on bigger socket that could host 8C/16T Ryzen, really beefy GPU and HBM2. Something for TDP 200-250W.

- several decent video outputs, preferably at least 3 x DP1.4. When you pay nice lump of cash for APU with HBM, you want for it to be able even more demanding scenarios. What's the use of great graphics if you can see it on just one monitor ?

- for fileserver it would be nice if one could have ultra-low power model with TDP 45W or lower, be it for AM4 or SP3_v2 whatever that half-naples is supposed to occupy.

- higher DDR4-speeds supported( for CPU muscle ), since it affects intermodule communication. If 4000Mhz or more is attainable, it would be great. If not, 3200-3600 MHz seems fine.

-more flexibility in memory controller so that one could get to higher memory speeds without sacrificing PCIe3

- change to high-speed 14 nm process, if that can yield significantly higher speeds within same or not much higher TDP. If not, scratch that.
it's not worth having big costs of process transfer just for a couple 100 of MHz.
 
#13 ·
Quote:
Originally Posted by superstition222 View Post

Fact is that an L4 would be a better usage of half a die than integrated graphics for any enthusiast.
That's never been in dispute.

Enthusiasts are still a tiny market and even we would benefit more from transistors spent elsewhere than an L4, in the majority of cases.

A Broadwell Iris Pro part has at least as much total die area as Broawell-E LCC. The 5775C probably cost more to make than any of the 6000 series. I'd far rather have a 6900K or 6950X than a 5775C. With Ryzen, I'd rather have an extra CCX than an L4.

Of course, as a tiny market, no business in their right mind would engineer a CPU with the enthusiast in mind. If they can cater to the enthusiast market with a design that would otherwise still exist, that's one thing, but no one has ever built a CPU for the enthusiast.
Quote:
Originally Posted by superstition222 View Post

It's also possible that you aren't appreciating the evidence he provided to justify his argument.
We are using the same evidence, none of which I was ignorant of before I read the article.
Quote:
Originally Posted by superstition222 View Post

Or, the goal isn't to make the best product possible, one that will still be profitable, but instead to cut corners to make a wee bit more (e.g. polymer TIM and integrated graphics).
The best product possible, from a manufacturer's standpoint, is the one that makes them the most money. Satisfying the needs of relevant markets, while spending as little as possible, and avoiding cannibalizing your own line up.

If AMD does otherwise, they will continue to struggle.
 
#14 ·
In my book Zen+ needs
1) Better overclocking
2) Improve Infinite fabric
3) Small refinement to improve IPC.
 
#15 ·
There are pretty big bandwidth limits to getting AVX1024.

I'm not sure that the L1 cache on Ryzen (which lags behind Intel's) has the bandwidth to perform a Store in 1 cycle.

I recall reading that on Knights Landing, the L1 cache barely has the bandwidth to keep its functional units fed. That meant 40:1 hits between L1 and L2. Forget about L3 and DRAM to avoid stalls. Compounding the problem, for less specialized code, getting it to even use AVX 512 a portion of the time is really, really hard. Sure the FP performance is amazing with AVX 512, but getting it to work takes very specialized code (at least for now).

My understanding is that the L1 bandwidth needs to be faster. The other is the memory bandwidth, which is why Skylake did not support AVX3 (it was hoped DDR4 would improve on this). Edit: There is also the matter than AVX512 means a penalty in die space. ON consumer CPUs, Intel decided against this.

AVX1024 would be very, very difficult, to put it mildly. The difficulties would be multiplied.

Right now I'd be happy with just AVX512 on Ryzen+ and AVX that matches Intel - that would be a major accomplishment.

Edit:
If anyone is interested on what Intel has been doing and some more information about AVX 512, here's Intel's programming reference:
https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference

Quote:
Originally Posted by Blameless View Post

All of these are likely to be counter productive features that would harm AMD's bottom line.
The higher performance node is not as crazy as it sounds. Interestingly, Samsung already is offering 14LPU.

https://www.extremetech.com/extreme/238896-samsung-announces-new-high-performance-high-power-14nm-node-plans-10nm-improvements-shows-off-7nm-euv-wafer

Quote:
First, there's interesting news concerning a fourth-generation 14nm product, 14LPU. For those keeping score at home, Samsung released 14nm Low Power Early (14LPE) first, followed by 14nm Low Power Plus (14LPP), which was broader ramp with more customers and up to 10% improved performance. Earlier this year, the company announced it would build a lower-cost variant of 14nm that didn't sacrifice on power or performance, 14LPC. This fourth-generation 14LPU is meant explicitly for customers who are building "high performance, compute-intensive" applications. 14LPU is said to offer better performance than 14LPC, but Samsung hasn't published details on how all four of its processes compare with one another; only 14LPP and 14LPE are listed on its website.

...
Samsung seems to be implying that 14LPU is a higher performance node, while 10LPU is a cost-optimized node. Meanwhile, the company also showed off a 7nm EUV wafer and provided an "update" on its 7nm EUV progress, but neglected to tell us anything about what that update was. This is one area where there's notable difference between the various foundries - Intel says it intends to push to 7nm without EUV, but will deploy that tech at 5nm. TSMC has said something similar, but Samsung remains resolute that it can introduce EUV at the 7nm node.
I'd hesitate to guess that 14LPU might be ideal for Zen+ and Navi. Importantly, if we can get a few hundred MHz without too much power consumption penalty, that would be be huge. Maybe some further optimizations with High Density Libraries could squeeze more clocks.

Quote:
Originally Posted by Blameless View Post

A Broadwell Iris Pro part has at least as much total die area as Broawell-E LCC. The 5775C probably cost more to make than any of the 6000 series. I'd far rather have a 6900K or 6950X than a 5775C. With Ryzen, I'd rather have an extra CCX than an L4.
Interestingly, you could get a 5775C for about 366 USD in 2015:
http://www.anandtech.com/show/9320/intel-broadwell-review-i7-5775c-i5-5675c/10

Agree though that for that price, a 5820k is in most cases, a better buy in 2015 (they are very similar in price). Granted, an X99 board would have been more expensive over a Z97 one, and perhaps considering it was 2015, a premium would have applied for DDR4, but these days DDR4 no longer has a price premium so that's no longer applicable.

I'd say that a smaller L4 might help. I think that even a small one might help inter-CCX communication. That's a hypothesis though.

But the big thing that it needs is to improve the bandwidth and latency of the Infinity Fabric.

Quote:
Originally Posted by superstition222 View Post

Intel's perspective isn't necessarily the right one. I'll cite lousy polymer TIM and useless integrated graphics taking up half the die or more for ostensibly enthusiast-aimed parts. Fact is that an L4 would be a better usage of half a die than integrated graphics for any enthusiast.

It's also possible that you aren't appreciating the evidence he provided to justify his argument.
Or, the goal isn't to make the best product possible, one that will still be profitable, but instead to cut corners to make a wee bit more (e.g. polymer TIM and integrated graphics).
Unfortunately for us, Intel is in the business of maximizing its profits, not delivering the best product possible.

They are using the same die across the board on laptops, desktops, and for all segments. It's just that they ruthlessly segment features.

There's nothing for example stopping INtel from giving us a unlocked CPU with ECC and other business/workstation features you find on their high end CPUs. No technical reason, just profits. Major props to AMD for giving us ECC on Ryzen for those that want it.
 
#17 ·
Quote:
Originally Posted by MBugaria View Post

In 2017 will approve the specification PCIe 4.0
But it seems more likely it's not for the Zen+ but maybe for Zen3

Then, from Zen+ I would like to see atleast more PCIe 3.0 lanes from CPU
20 lines now it's just a shame
Trust me you do not need more. By the time you need more PCIE 4.0 will be out. Single GPU + 4x NVME is 95% of the users.
 
#18 ·
WRT to AVX2 stuff, forget it.

Such extra wide vector units put great demands on other parts, which means various compromises on other sides and all that goes for little gain.

Nowadays AVX looks like a joke compared to GPU unit, so why bother with it ?
It's fine to have something if old code demands SSEx, but for anything high performance it is just a waste of time.
 
#19 ·
Quote:
Originally Posted by Brane2 View Post

Quote:
Originally Posted by MBugaria View Post

In 2017 will approve the specification PCIe 4.0
But it seems more likely it's not for the Zen+ but maybe for Zen3

Then, from Zen+ I would like to see atleast more PCIe 3.0 lanes from CPU
20 lines now it's just a shame
Why the f**k do you need more ?
Are you aware of the fact that EACH PCIe transceiver eats nontrivial power and lowers the power budget for actual computing ?
Do you know that twice faster and more complicated transceivers eat at least that much more power ?

And all that for what - 2% difference in performance ?

With special storage and networking cards its another matter, but for gaming crowd, why would anyone really push into that directiion - what is there to gain ?
You could say the same thing when 2.0 and 3.0 were introduced. Today 1.0 would certainly have a performance hit.
 
#20 ·
Quote:
Originally Posted by brucethemoose View Post

You could say the same thing when 2.0 and 3.0 were introduced.
Back then, I did.

Quote:
Today 1.0 would certainly have a performance hit.
So that's why using higher version makes sense.

When gains from 4.0 exceed losses on transceivers, it will make sense in home gear, too.

Why would you use it on the chip now ? What's there to gain ?
It's kind of like with cars. Almost anything can do 100 mph. 200 is supercar teritorry. 300 mph is yet unreached, even with newest Chevron ( 1500 horsepower) . It will probably take 2000 hp engine to get to that in a car.

Same with logic cells. It's debatable if existing 14nm can even do PCIe4.0 speeds and what sacrifice would have to be made for that.
In future, at 10 or 7nm and with special high performance process, that might be reachable much easier.
 
#21 ·
Quote:
Originally Posted by Brane2 View Post

Quote:
Originally Posted by brucethemoose View Post

You could say the same thing when 2.0 and 3.0 were introduced.
Back then, I did.

Quote:
Today 1.0 would certainly have a performance hit.
So that's why using higher version makes sense.

When gains from 4.0 exceed losses on transceivers, it will make sense in home gear, too.

Why would you use it on the chip now ? What's there to gain ?
It's kind of like with cars. Almost anything can do 100 mph. 200 is supercar teritorry. 300 mph is yet unreached, even with newest Chevron ( 1500 horsepower) . It will probably take 2000 hp engine to get to that in a car.

Same with logic cells. It's debatable if existing 14nm can even do PCIe4.0 speeds and what sacrifice would have to be made for that.
In future, at 10 or 7nm and with special high performance process, that might be reachable much easier.
It's a chicken an egg problem. Wait too long to introduce the next PCIe standard, and no one will want to make hardware for it as the vast majority of existing systems will be PCIe 3.0.

Then, by the time it's actually widely adopted, performance would be a problem. Therefore, it's better to shove it out early if the technology allows it, even with a little extra expense.
 
#22 ·
Quote:
Originally Posted by brucethemoose View Post

Then, by the time it's actually widely adopted, performance would be a problem. Therefore, it's better to shove it out early if the technology allows it, even with a little extra expense.
1. Not true. New generation is implemented where it makes sense. Probably specialised server chips will see it first. Also special networking stuff etc.

2. As AMD makes CPU and GPU stuff, this is even less of a problem. When it makes sense on both sides, they'll implement it.

You seem to think implementing new version is a matter of liberal bias and worldview in the same way as modern teenage daizies broaden their worldview by studuying folkdances etc.
IOW, if only Geller & Co would be transsexual with fetish on animal sex they'd be open enough for "new possibilities" to adopt PCIE4 faster than Angelina Jolie would another kid...
 
#23 ·
Quote:
Originally Posted by CrazyElf View Post

My understanding is that the L1 bandwidth needs to be faster.
Yeah, Intel doubled L1 and L2 to L1 bandwidth to get acceptable AVX 256 performance when adding AVX2 in in Haswell.

AVX1024 would be at least two more doublings over Haswell level, or about eight times the bandwidth per cycle of Ryzen.
Quote:
Originally Posted by CrazyElf View Post

The higher performance node is not as crazy as it sounds. Interestingly, Samsung already is offering 14LPU.
I'd certainly like them to take advantage of a higher performance node, but I'm not sure AMD will find justification for it...unless Intel pushes the envelope further and AMD needs something to counter an increased lightly-threaded performance deficit.
Quote:
Originally Posted by CrazyElf View Post

Interestingly, you could get a 5775C for about 366 USD in 2015
Yep. Which is probably why Intel wasn't eager to repeat the experience with later mainstream parts...HEDT production costs, but mainstream revenue for an Iris Pro CPU on the desktop. No real way to die harvest either since the L4 is on a separate die and if it, or the CPU it's paired with, are defective you just save the eDRAM for a working part.

At least with the mobile parts they can justify charging an arm and a leg...they could never get the same margins out of a Iris Pro desktop part.
Quote:
Originally Posted by CrazyElf View Post

Agree though that for that price, a 5820k is in most cases, a better buy in 2015 (they are very similar in price).
I've been running the lowest end HEDT parts in my first and second string systems since 2010, because this is where the value has been for the mix of work I do.

Ryzen, or it's HEDT equivalent, is looking damn good for my next system.
 
  • Rep+
Reactions: CrazyElf
#25 ·
Quote:
Originally Posted by Brane2 View Post

WRT to AVX2 stuff, forget it.

Such extra wide vector units put great demands on other parts, which means various compromises on other sides and all that goes for little gain.

Nowadays AVX looks like a joke compared to GPU unit, so why bother with it ?
It's fine to have something if old code demands SSEx, but for anything high performance it is just a waste of time.
+1
They said, future is Fusion.
Problem with big caches is, they have zero GFLOPS/mm2 and consume big power. Using a gpu (apu) instead fixes that. Better used silicon.
 
#26 ·
Quote:
Originally Posted by geoxile View Post

Has AMD said if Zen+ (Zen 2? 3?) will be compatible AM4 and current chipsets like x370?
I don't think there has been a clear statement, but based on what i have seen, Zen2 will be on AM4 and Zen3 on AM4+.
 
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top