Overclock.net banner

1 - 20 of 155 Posts

·
Registered
Joined
·
1,796 Posts
Discussion Starter #1
Quote:
Why ? So far we were not able to get real answer to Nvidia. So of course we wanted to take advantage of the GDC to try to learn more and have questioned Nvidia at a meeting organized with Rev Lebaredian, Senior Director GameWorks. Unfortunately for us, this engineer who is part of the technical support group for video game developers was very well prepared for these issues that affect the multi engine support specificities. His answers were initially verbatim those of the brief official statement Nvidia communicated to the technical press in recent months. Namely "GeForce Maxwell can support running concurrently at the SM (groups of processing units)", "it is not yet active in the pilot," "Ashes of the Singularity is one set (not too important) among others. "
Quote:
An unusual wooden language that shows, if it were still needed, that this issue bothers Nvidia. So we changed the approach to the impasse we have approached the subject from a different angle: is the Async Compute is important (for Maxwell GPU)? What Lebaredian Rev relax and open the way for a much more interesting discussion. Two arguments are then developed by Nvidia.

First, if Async Compute is a way to increase performance, what matters in the end it is the overall performance. If GeForce GPUs are the most efficient basis than the Radeon GPU, the use of multi engine in an attempt to boost their performance is not a top priority.

On the other hand, if the rate of use of the various blocks of the GeForce GPU is relatively high at the base, the potential gain from Async Compute is less important. Nvidia says here that overall there are far fewer holes (bubbles in language GPU) at the activity of units of its GPU than its competitor. But the purpose of concurrent execution is to exploit synergies in the treatment of different tasks to fill these "holes".
Quote:
When developing a GPU architecture, much of the work is to provide a profile of tasks that will be supported when the new chips will be marketed. The balance of the architecture between its different types of units, among the computational power and memory bandwidth between the triangles rate and pixel throughput, etc., is a crucial point that requires good visibility, a lot of pragmatism and a strategic vision. It is clear that Nvidia is rather pulls well at this level for several generations of GPUs.
Quote:
If our own thinking leads to rather agree with Nvidia these arguments, there is another important point for the players and that's probably what makes the number one GPU addresses the topic lip: Async Compute provides free gain for Radeon users. While this possibility was provided for in the AMD GPU for more than 4 years, they have not been able to get commercial profit, they have not been sold more expensive for the cause. This changes somewhat with the latest range of AMD that focuses strongly on this point, but in terms of perception, players like to get a free such little boost, even if only a handful of games. Conversely, the overall higher performance GPU Nvidia may have an immediate benefit in up games, and could be included directly in the price of GeForce. And from the perspective of a company whose purpose is not to post losses, it is clear that an approach makes more sense than another.
Quote:
Still we are in 2016 and that the operation of the Async Compute should gradually spread, particularly thanks to the similarity between the architecture of the GPU consoles and that of the Radeon. Nvidia can not totally ignore the possibility that could reduce or eliminate the lead in terms of performance. Without going into detail, Rev Lebaredian thus wished to reiterate that there were indeed opportunities in the drivers' level to implement which they can enjoy in some cases a performance gain with the Async Compute . Opportunities that Nvidia constantly revalues, not without forgetting that its future GPU could change that at this level.
Excellent read. Highly recommended.

Source

Translated
 

·
Sunday League Jibber
Joined
·
4,254 Posts
Interesting stuff, thanks for the find. A-level French finally pays off!
cool.gif
 

·
What should be here ?
Joined
·
5,735 Posts
Mind giving a TL:DR version that would douse part of the forum fire. Appreciated
 

·
Sunday League Jibber
Joined
·
4,254 Posts
Quote:
Originally Posted by huzzug View Post

Mind giving a TL:DR version that would douse part of the forum fire. Appreciated
Re-affirmation that Nvidia products at this time are not hardware-equipped for Async in the same way as GCN cards. No specific indication given as to whether this changes with Pascal/Volta. Reiteration of some kind of gain possibly realised from drivers but nothing concrete given at this time. Nvidia cards perform well in DX11 and that's where games are now. Also something linking back to what @Mahigan has said regarding Nvidia/AMD and Async in that Nvidia's basically saying that because they're already operating at negligible overhead under DX11 and Async relieves some of AMD's overhead issues, that it may not realise the same gains for Maxwell and thus may be less of a priority generally speaking.

I think that's about it.
 

·
Registered
Joined
·
1,157 Posts
Quote:
Originally Posted by SuperZan View Post

Re-affirmation that Nvidia products at this time are not hardware-equipped for Async in the same way as GCN cards. No specific indication given as to whether this changes with Pascal/Volta. Reiteration of some kind of gain possibly realised from drivers but nothing concrete given at this time. Nvidia cards perform well in DX11 and that's where games are now. Also something linking back to what @Mahigan
has said regarding Nvidia/AMD and Async in that Nvidia's basically saying that because they're already operating at negligible overhead under DX11 and Async relieves some of AMD's overhead issues, that it may not realise the same gains for Maxwell and thus may be less of a priority generally speaking.

I think that's about it.
He also says that if we consider Async Compute a method of gaining in performance, the end result is looking at the overall performance of the architecture. To that end, if Nvidia's architecture does not utilize Async Compute but has better overall performance than AMD's which does utilize Async Compute, then is Async Compute really that imporant?

It's an interesting approach on how a company designs its architecture.
 

·
Sunday League Jibber
Joined
·
4,254 Posts
Quote:
Originally Posted by Firann View Post

He also says that if we consider Async Compute a method of gaining in performance, the end result is looking at the overall performance of the architecture. To that end, if Nvidia's architecture does not utilize Async Compute but has better overall performance than AMD's which does utilize Async Compute, then is Async Compute really that imporant?

It's an interesting approach on how a company designs its architecture.
It is. What I took away is that it's more of what we don't know. We don't have enough to draw from in terms of examples and data to really know the importance of Async with regards to performance. That said it does seem to be an efficient way to make certain things possible in some games, as @Kollock has told us when he popped in to the AotS thread. Ultimately I think that both companies will capitalise on that potential with hardware support going forward.
 
  • Rep+
Reactions: LAKEINTEL

·
Registered
Joined
·
1,713 Posts
If nV hardware isn't async then how does it run PhysX?

I think nV is having trouble with developing a non-AMD patentable async.
 

·
Registered
Joined
·
22 Posts
Quote:
Originally Posted by SuperZan View Post

Re-affirmation that Nvidia products at this time are not hardware-equipped for Async in the same way as GCN cards. No specific indication given as to whether this changes with Pascal/Volta. Reiteration of some kind of gain possibly realised from drivers but nothing concrete given at this time. Nvidia cards perform well in DX11 and that's where games are now.

I think that's about it.
I read the original in French and I didn't see where it affirms that geforces are not hardware-equipped for async. It did, in fact, mention that theorically, geforces are capable of a 32 parallel compute queues that were currently not functional.

What I did find more interesting is that the nvidia engineer claims that async's ability to gain "free" performance on amd gpus stem from the latter's holes or "bubbles" in their execution queues; in other words, they aren't able to saturate their cores with as much work as they are capable of at the lowest level due to their architecture being unable to feed them with enough work from gaming loads. Apparently, this has been the case for the better part of 4 years. Async is one way to fill those holes and make better use of the otherwise wasted horsepower and presenting potentially better performance in the future with the rapid adoption of async.

He also claims that geforces natively have a design that is more "balanced" and better capable of saturating the usage of its full theoretical throughput. There remains nonetheless potential for gains from implementing async in geforce drivers, which wouldn't make much sense if the hardware weren't capable of it.
 

·
Simpleton
Joined
·
1,894 Posts
Quote:
Originally Posted by prjindigo View Post

If nV hardware isn't async then how does it run PhysX?

I think nV is having trouble with developing a non-AMD patentable async.
different things.
 

·
Sunday League Jibber
Joined
·
4,254 Posts
Quote:
Originally Posted by fruits View Post

I read the original in French and I didn't see where it affirms that geforces are not hardware-equipped for async. It did, in fact, mention that theorically, geforces are capable of a 32 parallel compute queues that were currently not functional.

What I did find more interesting is that the nvidia engineer claims that async's ability to gain "free" performance on amd gpus stem from the latter's holes or "bubbles" in their execution queues; in other words, they aren't able to saturate their cores with as much work as they are capable of at the lowest level due to their architecture being unable to feed them with enough work from gaming loads. Apparently, this has been the case for the better part of 4 years. Async is one way to fill those holes and make better use of the otherwise wasted horsepower and presenting potentially better performance in the future with the rapid adoption of async.

He also claims that geforces natively have a design that is more "balanced" and better capable of saturating the usage of its full theoretical throughput. There remains nonetheless potential for gains from implementing async in geforce drivers, which wouldn't make much sense if the hardware weren't capable of it.
Pardon, it's capable of executing parallel workloads but that's not quite the same thing, both companies have a different definition and that's crucial to the argument.

"When AMD and Nvidia talk about supporting asynchronous compute, they aren't talking about the same hardware capability. The Asynchronous Command Engines in AMD's GPUs (between 2-8 depending on which card you own) are capable of executing new workloads at latencies as low as a single cycle. A high-end AMD card has eight ACEs and each ACE has eight queues. Maxwell, in contrast, has two pipelines, one of which is a high-priority graphics pipeline. The other has a a queue depth of 31 - but Nvidia can't switch contexts anywhere near as quickly as AMD can."

"According to a talk given at GDC 2015, there are restrictions on Nvidia's preeemption capabilities. Additional text below the slide explains that "the GPU can only switch contexts at draw call boundaries" and "On future GPUs, we're working to enable finer-grained preemption, but that's still a long way off." To explore the various capabilities of Maxwell and GCN, users at Beyond3D and Overclock.net have used an asynchronous compute tests that evaluated the capability on both AMD and Nvidia hardware. The benchmark has been revised multiple times over the week, so early results aren't comparable to the data we've seen in later runs."

Stating that the software has to allow execution of more parallel workloads that more closely align with AMD's definition is essentially saying that the Nvidia card is not equipped with the same sort of innate context-switching priority at a hardware level. There was nothing to 'activate' beyond funnelling code to the multiple "lanes" on a GCN card, as it were. Besides bringing up the interesting point of Maxwell already being so efficient under DX11 that there weren't really any gains to realise, the article doesn't tell us anything that we hadn't surmised.

In other words I'm clearly not saying that Nvidia was 'wrong' or AMD was 'right' with their different architectural approaches, I'm simply saying that as async compute is defined in practice by DirectX 12, one architecture is compatible with that feature while the other requires some software implementation. I'm not saying that there is any gain to be realised by Nvidia here, and as their products already perform at a very high level it's not even clear that any gain is necessary. I agree with the article that Nvidia probably has a way to interact with async compute as utilised under DirectX 12 but it's clearly not the same way as AMD's.
 

·
Registered
Joined
·
22 Posts
Quote:
Originally Posted by SuperZan View Post

Pardon, it's capable of executing parallel workloads but that's not quite the same thing, both companies have a different definition and that's crucial to the argument.
"When AMD and Nvidia talk about supporting asynchronous compute, they aren't talking about the same hardware capability. The Asynchronous Command Engines in AMD's GPUs (between 2-8 depending on which card you own) are capable of executing new workloads at latencies as low as a single cycle. A high-end AMD card has eight ACEs and each ACE has eight queues. Maxwell, in contrast, has two pipelines, one of which is a high-priority graphics pipeline. The other has a a queue depth of 31 - but Nvidia can't switch contexts anywhere near as quickly as AMD can."

"According to a talk given at GDC 2015, there are restrictions on Nvidia's preeemption capabilities. Additional text below the slide explains that "the GPU can only switch contexts at draw call boundaries" and "On future GPUs, we're working to enable finer-grained preemption, but that's still a long way off." To explore the various capabilities of Maxwell and GCN, users at Beyond3D and Overclock.net have used an asynchronous compute tests that evaluated the capability on both AMD and Nvidia hardware. The benchmark has been revised multiple times over the week, so early results aren't comparable to the data we've seen in later runs."

Stating that the software has to allow execution of more parallel workloads that more closely align with AMD's definition is essentially saying that the Nvidia card is not equipped with the same sort of innate context-switching priority at a hardware level. There was nothing to 'activate' beyond funnelling code to the multiple "lanes" on a GCN card, as it were. Besides bringing up the interesting point of Maxwell already being so efficient under DX11 that there weren't really any gains to realise, the article doesn't tell us anything that we hadn't surmised.
Yes, I've read that before and don't dispute the differences between those implementations at all.

I just didn't see anything in the article that "re-affirms nvidia products are not hardware-equipped for Async." In fact, it sounded like the nvidia guy insinuated that there were gains to be made with nvidia's architecture that they were developping for when their performance lead dwindles.
 

·
Performance is the bible
Joined
·
7,134 Posts
The article doesn't really say much we did not already know, except nvidia saying that overall performance is more important than just one thing, as in AMD are concentrating on async compute as their one trick pony for DX12, but nvidia are looking at the overall performance, and if that is a gain, than they are not concentrating on just one thing, even if it means less async compute.
 

·
Sunday League Jibber
Joined
·
4,254 Posts
Quote:
Originally Posted by fruits View Post

Yes, I've read that before and don't dispute the differences between those implementations at all.

I just didn't see anything in the article that "re-affirms nvidia products are not hardware-equipped for Async." In fact, it sounded like the nvidia guy insinuated that there were gains to be made with nvidia's architecture that they were developping for when their performance lead dwindles.
Fair enough. I'll agree that nowhere in black and white did anyone stipulate that Nvidia is not hardware-equipped for Async. But I read it as the Nvidia fellow saying that software implementation at the driver level could enhance Maxwell's existing parallel capabilities, not that Maxwell cards have secret ACE's hidden within the architecture. In other words, I read it as confirming (as much as Nvidia ever will on proprietary tech) that any gains from Async Compute in DX12 will be software improvements to that existing parallel graphics structure. It's just that this structure wasn't designed specifically as an ACE. That's the distinction and that's what some people have been arguing about. I don't deny that Nvidia can realise some gain from Async Compute through optimisation, but that does not a specific ACE make.
 

·
Registered
Joined
·
466 Posts
They cant do dx12 async compute it's pretty clear now, guess we'll see when AOTS drops if nvidia suddenly flip the on switch. Trying to do it in software when as he said they are already at peak hardware utilization isn't going to work if it means taking compute resources from elsewhere and the additional latencies they might incur. The nvidia guy in article is just saying the things someone who is considering switching brands over this might like to hear, just keeping them sweet.
 

·
Down and out, for now.
Joined
·
943 Posts
Seems to suggest that use of software async compute will be on a game specific thing. Usually there are not enough compute resource left over to bother with it but there might be games where it will help in the future. I don't expect my 970 would be helped by acync in the future. Is is running fine now, well when Nvidia get the current, The Division, driver fixed.
 

·
Banned
Joined
·
6,565 Posts
The difficulty is in proving that Directx 11 is not so free from performance bubbles, imo.
TL;DR: it's fine that Hawaii catches up with 980Ti, however the key to Directx 12's success will be Hawaii surpassing 980Ti's directx 11 performance. So far, it has not been established.
 

·
Registered
Joined
·
1,077 Posts
Quote:
Originally Posted by fruits View Post

Yes, I've read that before and don't dispute the differences between those implementations at all.

I just didn't see anything in the article that "re-affirms nvidia products are not hardware-equipped for Async." In fact, it sounded like the nvidia guy insinuated that there were gains to be made with nvidia's architecture that they were developping for when their performance lead dwindles.
They probably dont want to comfirm that there is not hardware support for async because its a dx12 feature and they sold cards advertised as having full hardware support for DX12 before the whole async discussion started.

People that upgraded to be DX12 hardware ready and expected the performance gain when DX12 would beimplented are not happy and the discussion rises if they have been mislead by false advertisement. Tbh its even a bigger issue they the 3.5gb thing.
 

·
Politically incorrect
Joined
·
9,292 Posts
Quote:
Originally Posted by jezzer View Post

They probably dont want to comfirm that there is not hardware support for async because its a dx12 feature and they sold cards advertised as having full hardware support for DX12 before the whole async discussion started.

People that upgraded to be DX12 hardware ready and expected the performance gain when DX12 would beimplented are not happy and the discussion rises if they have been mislead by false advertisement. Tbh its even a bigger issue they the 3.5gb thing.
This is basically a summary of the article, which is nothing more than damage control.
 

·
Registered
Joined
·
706 Posts
I guess what they're saying is that as with most things in life, there are multiple ways of achieving an objective or tackling a problem as long as the end result is the same as the competition's or better.
 

·
Registered
Joined
·
9 Posts
Mahigan wrote some nice posts on a different forum that are valid (I'd say) here as well:

http://forums.anandtech.com/showpost.php?p=38117833&postcount=74
Quote:
As for Kepler, GCN and Maxwell...

It has to do with compute utilization...

Just like Kepler's SMX, each one of Maxwell's SMM has four warp schedulers, but what's changed between SMX and SMM is that each SMMs CUDA cores are assigned to a particular scheduler. So there are less shared units. This simplifies scheduling as each of SMM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width (Warps are 32 thread wide and each scheduler issues its Warps to 32 CUDA cores). You can still dual issue, like with Kepler, but a single issue would result in full CUDA core utilisation. This means that you have less idling CUDA cores. (There's also the dedicated 64KB per SM of Maxwell over Kepler's 16KB + 48KB design).

Why is this important? Because console titles are being optimized for GCN. Optimizing for GCN means using Wavefronts (not Warps). Wavefronts are 64 threads wide (mapping directly to two Warps). Since a Maxwell SMM is composed of 4x32 CUDA core partitions, that means that a wavefront would occupy 2x32CUDA core partitions (half an SMM). With Kepler, you had 192 CUDA cores per SMX. Try mapping Wavefronts to that and you need 3 Wavefronts. If you only have a single wavefront then you're utilizing 50% of a Maxwell SMM while only utilizing 33.3% of an SMX. That's a lot of unused compute resources.

With NVIDIAs architecture, only Kernels belonging to the same program can be executed on the same SM. So with SMX, that's 66.6% of compute resources not being utilized. That's a huge loss.

So what has happened is that:
1. The ratio of render:compute is tilting higher towards compute than a few years ago when Kepler was introduced.
2. The console effect is pushing developers into using more compute resources in order to extract as much performance as possible from the consoles GCN APUs.
3. Console titles are being optimized for GCN.


This has pushed GCN performance upwards as ALL GCN based GPUs (GCN1/2/3) utilize Compute Units which map directly to a wavefront (4 x 16SIMD = 64).

The end result is higher compute utilization on GCN, less wasted resources on GCN, good utilization on Maxwell and very crappy utilisation on Kepler.

NVIDIA is evolving towards a more GCN-like architecture while AMD are refining GCN. GCN is more advanced than any NVIDIA architecture. People who claim that GCN is "old" simply don't understand GPU architectures.

It's the console effect.
http://forums.anandtech.com/showpost.php?p=38117913&postcount=79
Quote:
In games, compute shaders are used in order to run small programs (work items) on the GPU. When a GPU begins work on a work item, it does so by executing kernels (data parallel program), kernels are further broken down into work groups and work groups are broken down into wavefronts (or warps).
...

So the work groups are segmented into wavefronts (GCN) or Warps (Kepler/Maxwell).

The programmer decides the size of the work group and how that work group is split up into smaller segments is up to the hardware to decide.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

Your Octane renderer is probably a relic of that past. It is no longer relevant. Games are now arriving with GCN centric optimizations.

Under those scenario's, Kepler is under utilized to a large degree. This is due to the way the CUDA cores in the SMXs were organized (192 CUDA cores per SMX). NVIDIA took notice of this and reduced the amount of CUDA cores in each SM to 128 for Maxwell's SMM and segmented those 128 CUDA cores into four groups of 32 CUDA cores (mapping directly to a Warp).

So yes, how an application is optimized, written, largely determines performance.
http://forums.anandtech.com/showpost.php?p=38117972&postcount=84
Quote:
It's not relevant for measuring Direct Compute performance which is what we're discussing by discussing DX11/DX12 gaming.

When we discuss gaming, we're not discussing peak compute throughput or theoretical compute throughput. We're discussing GPU compute utilization. Based on game specific optimizations. You're right to say that Kepler can, if developer's optimize their code for it, perform admirably. Sadly that's not the case due to the console effect.

Recent DX11 games have been starting to favor GCN due to the console effect. That being said GCN is API bound under DX11. DX12 allows for the removal of API overhead issues relating to GCN. This further boosts GCNs performance. Throw in Async Compute and you have another boost to GCN compute utilization.

The topic was discussing why Kepler has seemingly regressed relative to Maxwell and GCN in newer titles. That can be explained by compute utilization caused by GCN specific optimizations in the console arena making their way onto PC ports.
nVIDIA talking about already performing to the best the hardware allows in these circumstances seems a bit off, to say the least. If Keppler is underutilized by the way new games are programmed and GCN was the same with older games in mind, wouldn't a features such as asynchronous compute allow all cards to be efficiently used even if the code is not exactly a mach for the best case scenario? Wouldn't async allow with just some minor work (as Dan Baker said about its engine) to flag here and there the code required so that all cards get to perform as best as their hardware allows (of course, if the async hardware side is implemented in the GPU)? I'm just asking this because async seems to nullify (or at least bring to a minimum) possible bottlenecks not only in the API itself, but in the way engines are written for each platform or for each game - see Keppler vs. Maxwell vs. GCN thread above.
 
1 - 20 of 155 Posts
Top