Overclock.net › Forums › Industry News › Video Game News › [computerbase.de] DOOM + Vulkan Benchmarked.
New Posts  All Forums:Forum Nav:

[computerbase.de] DOOM + Vulkan Benchmarked. - Page 39

post #381 of 632
Quote:
Originally Posted by CrazyElf View Post

Q. Mahigan, any other ideas of what else they've done for Vega to the CUs themselves?
A. I think that they may have implemented the power gating which is why Vega appears to have a higher perf/watt than Polaris although that could simply be due to the inclusion of HBM2 memory.
Quote:
Originally Posted by CrazyElf View Post

Q. Do you think that they are still split into units of 2, 4, 8, and 16 then power-gated or something else?
A. AMD did patent this new approach but for all we know it may only arrive with Navi (hence the scalability nomencloture).
Quote:
Originally Posted by CrazyElf View Post

Q. An other ways to resolve the Occupancy issues?
A. Not using the current GCN uarch as it stands... no.
Quote:
Originally Posted by CrazyElf View Post

Q. Hmm, depending on how Volta pans out, wouldn't Nvidia end up with similar challenges?
A. Yes. I also think that Volta will likely be including Hardware-side scheduling.
Quote:
Originally Posted by CrazyElf View Post

Q. They would have to make something like a Command Processor + ACE equivalent to get a truly "parallel" GPU on their end. is there a way they could combine the best of both worlds?
A. Yes... by increasing the size of the instructions buffers found in each SM... but then again most of the industry will have moved to DX12/Vulkan so there would be no reason to include the best of both worlds.
Quote:
Originally Posted by CrazyElf View Post

Q. Unless there is another way to do "parallel" that we don't know about?
A. Possible.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #382 of 632
Even if you optimize everything for the specific target GPU's, each will have strengths and weaknesses depending on how they handle the tasks given to them.

We have seen this argument before with the now ancient 8800 GTX vs HD 2900 XT argument (SIMD vs VLIW respectively). You cant change the fact that both approaches deal with data processing in intrinsically different ways.
μRyzen
(12 items)
 
Mini Box
(4 items)
 
 
CPUMotherboardGraphicsRAM
Ryzen R5 1400 MSI B350M Gaming Pro Zotac GTX 670 4GB G.SKILL FORTIS Series 8GB (2 x 4GB) 
Hard DriveCoolingOSOS
WD Green 3tb Wraith Stealth Windows 10 Debian 8.7 
MonitorKeyboardPowerMouse
ViewSonic VX-2257-8 Chinese backlit mechanical Kingwin 850w Chinese laser optical 
CPUMotherboardGraphicsRAM
Athlon 5350 Asus AM1I-A EVGA GTX 750 Ti SC 2x4GB DDR 3 1333 
  hide details  
Reply
μRyzen
(12 items)
 
Mini Box
(4 items)
 
 
CPUMotherboardGraphicsRAM
Ryzen R5 1400 MSI B350M Gaming Pro Zotac GTX 670 4GB G.SKILL FORTIS Series 8GB (2 x 4GB) 
Hard DriveCoolingOSOS
WD Green 3tb Wraith Stealth Windows 10 Debian 8.7 
MonitorKeyboardPowerMouse
ViewSonic VX-2257-8 Chinese backlit mechanical Kingwin 850w Chinese laser optical 
CPUMotherboardGraphicsRAM
Athlon 5350 Asus AM1I-A EVGA GTX 750 Ti SC 2x4GB DDR 3 1333 
  hide details  
Reply
post #383 of 632
Quote:
Originally Posted by Mahigan View Post

Umm...

It does not matter if Async is not enabled in the driver. That is not the argument. The argument is how Time Spy sends work to the GPU. You see Time Spy has to "mark" every task parallel task (Compute and Graphics) using fences (synchronization point). These fences remain in place even if the driver does not support Asynchronous Compute. These fences show up as a performance loss due to the fact that the GPU resources handling a Graphics task will still "wait" on a long running compute task. This wait is equal to part of the GPU going Idle (Idle as in not doing anything) while it waits for another task to finish.

For Maxwell... this leads to a performance loss. So with 3DMark Time Spy... where is this performance loss on Maxwell?

I mean there are ways you can avert this performance loss...

1. You specifically optimize the code for nVIDIA hardware in that you ensured that there are no long running compute tasks.
2. You keep the level of Async Compute low (which results in less fences) which is perfectly suited for nVIDIA hardware.
3. You do not run Asynchronous Compute + Graphics but instead just Asynchronous Compute (concurrent executions filling up gaps in the pipeline).

Either way... you just optimized everything for nVIDIA hardware yet at the same time claim that creating separate paths for hardware would be tantamount to leading to an optimization war. So AMD hardware received no optimizations while the entire code is optimized for nVIDIA hardware.

That does not sound like an objective way of building a benchmark.

This is precisely why AMD/nVIDIA and Microsoft stated that optimized paths are required for every GPU architecture under DX12. It was an understanding between all three parties which also stated that if you do not have the development time to afford such optimizations then sticking to DX11 would be better.

I think that 3DMark should re-work this benchmark so that it can properly represent the best case scenario for both AMD and nVIDIA.

"The application doesn't know what the driver does with the queues. Switching async compute on/off in 3DMark Time Spy doesn't really change anything because Maxwell driver does the exact same thing in both cases, so any difference is within the error margins of the test."

From FM_Jarnis.
post #384 of 632

Question : with Async ON for Maxwell , Is it possible you get massive performance hit If driver says "No i can't do " ?

post #385 of 632
Quote:
Originally Posted by Greenland View Post

"The application doesn't know what the driver does with the queues. Switching async compute on/off in 3DMark Time Spy doesn't really change anything because Maxwell driver does the exact same thing in both cases, so any difference is within the error margins of the test."

From FM_Jarnis.

So even if it's being asked to do something different it does the same?

Sounds like BS to me.
post #386 of 632
Quote:
Originally Posted by Greenland View Post

"The application doesn't know what the driver does with the queues. Switching async compute on/off in 3DMark Time Spy doesn't really change anything because Maxwell driver does the exact same thing in both cases, so any difference is within the error margins of the test."

From FM_Jarnis.

I am not sure if people have asked about fences or not. The fences are the basis of my argument. It does not matter if the driver switches Async Compute on or off. 3DMark has two paths. One with Async Compute and another without. When you toggle Async On... the Async path is used. When you toggle Async Off... the non-Async path is used.

The argument has to do with the fences in the Async path. Those fences would still incur a performance penalty regardless of what the nVIDIA Maxwell driver is doing. Those fences cause tiny delays... the more fences the more delays are introduced. The more delays... the lower your FPS due to the introduced latency.

What I do not see is the performance hit associated with those delays when running TimeSpy on an nVIDIA Maxwell based GPU. Those fences should be negatively affecting the performance of the Maxwell GPU as they synchronize Graphics and Compute tasks. I have already mentioned ways in which this would not be the case and those ways would be if the application was coded to favor nVIDIA hardware (utilizing short running shaders and very few Asynchronous Compute + Graphics workloads).

Since 3DMark apparently only uses a single path for both AMD and nVIDIA hardware then if that 3DMark path was coded to favor nVIDIA hardware then the benchmark is not objective. Why not? Because the AMD hardware can take even more Asynchronous Compute + Graphics workloads in order to gain even more performance than it curently being shown. Therefore if a separate path were coded for AMD... more tasks could be "marked" to run in parallel (more Graphics + Compute jobs) leading to a reduction in per frame latency (resulting in higher FPS).

The main reason both AMD and nVIDIA (as well as Microsoft) stated that separate paths are the way to go for DX12 applications is specifically due to this issue. If a developer codes in favor of nVIDIA then AMD will suffer bad performance and vice versa. In order to avert this... both architectures ought to have their own optimized paths.

This should be mentioned to that 3DMark rep on steam.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #387 of 632
Quote:
Originally Posted by Xuper View Post

Question : with Async ON for Maxwell , Is it possible you get massive performance hit If driver says "No i can't do " ?

Yes... because the "fences" in the game code are still there. The driver cannot magically remove the fences. Those fences are part of the API specifications under DX12.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #388 of 632
So where can I find these fences, any documentation ?
post #389 of 632
Quote:
Originally Posted by Mahigan View Post


A. I think that they may have implemented the power gating which is why Vega appears to have a higher perf/watt than Polaris although that could simply be due to the inclusion of HBM2 memory.

A. AMD did patent this new approach but for all we know it may only arrive with Navi (hence the scalability nomencloture).

A. Not using the current GCN uarch as it stands... no.

A. Yes. I also think that Volta will likely be including Hardware-side scheduling.

A. Yes... by increasing the size of the instructions buffers found in each SM... but then again most of the industry will have moved to DX12/Vulkan so there would be no reason to include the best of both worlds.

A. Possible.
What power gating? Did I miss something?
post #390 of 632
Quote:
Originally Posted by Greenland View Post

So where can I find these fences, any documentation ?

https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx

Straight from Microsoft. I have supplied Microsoft code samples for Asynchronous Compute + Graphics as well as quotes from Kollock (Oxide developer) in the previous pages. Have a gander.
Quote:
void AsyncPipelinedComputeGraphics()
{
const UINT CpuLatency = 3;
const UINT ComputeGraphicsLatency = 2;

// Compute is 0, graphics is 1
ID3D12Fence *rgpFences[] = { pComputeFence, pGraphicsFence };
HANDLE handles[2];
handles[0] = CreateEvent(nullptr, FALSE, TRUE, nullptr);
handles[1] = CreateEvent(nullptr, FALSE, TRUE, nullptr);
UINT FrameNumbers[] = { 0, 0 };

ID3D12GraphicsCommandList *rgpGraphicsCommandLists[CpuLatency];
CreateGraphicsCommandLists(ARRAYSIZE(rgpGraphicsCommandLists),
rgpGraphicsCommandLists);

// Graphics needs to wait for the first compute frame to complete, this is the
// only wait that the graphics queue will perform.
pGraphicsQueue->Wait(pComputeFence, 1);


while (1)
{
for (auto i = 0; i < 2; ++i)
{
if (FrameNumbers > CpuLatency)
{
rgpFences
->SetEventOnFenceCompletion(
FrameNumbers - CpuLatency,
handles
);
}
else
{
SetEvent(handles);
}
}

auto WaitResult = WaitForMultipleObjects(2, handles, FALSE, INFINITE);
auto Stage = WaitResult = WAIT_OBJECT_0;
++FrameNumbers[Stage];

switch (Stage)
{
case 0:
{
if (FrameNumbers[Stage] > ComputeGraphicsLatency)
{
pComputeQueue->Wait(pGraphicsComputeFence,
FrameNumbers[Stage] - ComputeGraphicsLatency);
}
pComputeQueue->ExecuteCommandLists(1, &pComputeCommandList);
pComputeQueue->Signal(pComputeFence, FrameNumbers[Stage]);
break;
}
case 1:
{
// Recall that the GPU queue started with a wait for pComputeFence, 1
UINT64 CompletedComputeFrames = min(1,
pComputeFence->GetCurrentFenceValue());
UINT64 PipeBufferIndex =
(CompletedComputeFrames - 1) % ComputeGraphicsLatency;
UINT64 CommandListIndex = (FrameNumbers[Stage] - 1) % CpuLatency;
// Update graphics command list based on CPU input and using the appropriate
// buffer index for data produced by compute.
UpdateGraphicsCommandList(PipeBufferIndex,
rgpGraphicsCommandLists[CommandListIndex]);

// Signal *before* new rendering to indicate what compute work
// the graphics queue is DONE with
pGraphicsQueue->Signal(pGraphicsComputeFence, CompletedComputeFrames - 1);
pGraphicsQueue->ExecuteCommandLists(1,
rgpGraphicsCommandLists + PipeBufferIndex);
pGraphicsQueue->Signal(pGraphicsFence, FrameNumbers[Stage]);
break;
}
}
}
}

Basically the Graphics GPC in Maxwell would be going idle until the Compute GPC is done with its work and signals the Graphics GPC that it is time to move onto other work.

This process introduces latency. This latency is true for ALL hardware. AMD and nVIDIA hardware incur this hit but the benefits from Asynchronous Compute + Graphics more than makeup for that hit.
Edited by Mahigan - 7/17/16 at 11:18am
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Video Game News
Overclock.net › Forums › Industry News › Video Game News › [computerbase.de] DOOM + Vulkan Benchmarked.