Overclock.net › Forums › Graphics Cards › Graphics Cards - General › DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)
New Posts  All Forums:Forum Nav:

DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing) - Page 22

post #211 of 252
Quote:
Originally Posted by Mahigan View Post

I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...

1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.

2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.

That's what I figure AMD have done based on recent patent filings.
and more cache on LDS?
post #212 of 252
Thread Starter 
I don't think Local CU caches will change, neither will the LDS or GDS but the L2 cache will get a healthy increase. Increasing the cache sizes takes away from implementing more ROPs, CUs etc. So there's always a delicate balance that must be maintained.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #213 of 252
Quote:
Originally Posted by Mahigan View Post

I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...

1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.

2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.

That's what I figure AMD have done based on recent patent filings.

Reading the patient suggests that it's #1, and with there being a secondary scalar unit in the diagram it's possible that when only one of the 16 ALUs in a cluster are active it can run at x4 speed (ie ~4GHz).

Interesting stuff.
Edited by Paul17041993 - 4/25/16 at 4:16pm
   
build server
(10 items)
 
CPUMotherboardGraphicsGraphics
R5-1600X ASUS Crosshair VI Hero XFX Vega64 AMD R9-290X 
RAMHard DriveHard DriveHard Drive
Corsair Vengeance LPX 16GB 3200 C16 ADATA SX8000NF 2x Samsung 850 pro 256GB 2x WD Red 2TB RAID1 
Hard DriveHard DriveHard DriveCooling
OCZ Vertex 3 120GB Seagate Shingle 8TB WD Red 2TB EK Supremacy EVO + EK Vega + EK 290X R2 + backp... 
CoolingCoolingOSMonitor
Corsair ML140 x2 kit x3 (ie; 6 in total...) + s... XSPC XINRULIAN 1425rpm 140mm x6 + phanteks PWM ... Windows 10 Pro 4k 24" samung PLS freesync 
MonitorKeyboardPowerCase
1200p 24" samsung Corsair K70 Red Seasonic 1000W Platinum Raven RV01 
MouseMouse PadAudioAudio
Corsair M60/M45 Corsair ASUS on-board audio HDMI 8ch LPCM 24@192k DAC 
OtherOther
tobii EyeX development kit CableMod WideBeam RGB+UV 60cm + Phanteks 40cm x... 
CPUMotherboardGraphicsRAM
AMD AthlonII 640 x4 Gigabyte 880GM-USB3 nvidia GTX 460SE Corsair Dominator 2*2GB 
Hard DriveOptical DriveCoolingOS
Seagate 500GB some sony DVD burner stock, case has a nice side duct though Windows 7 home premium 64bit 
PowerCase
Arctic Cooling 550R Gigabyte mATX 
CPUMotherboardGraphicsRAM
Intel Atom 230 some all-in-one Mini-ITX board Intel GMA 950 some 1GB stick 
Hard DriveOptical DriveCoolingOS
Seagate 320GB LightScribe DVD burner? passive stock Oracle Linux 
PowerCase
integrated in the motherboard, 65W power brick compaque stock Mini-ITX case 
  hide details  
Reply
   
build server
(10 items)
 
CPUMotherboardGraphicsGraphics
R5-1600X ASUS Crosshair VI Hero XFX Vega64 AMD R9-290X 
RAMHard DriveHard DriveHard Drive
Corsair Vengeance LPX 16GB 3200 C16 ADATA SX8000NF 2x Samsung 850 pro 256GB 2x WD Red 2TB RAID1 
Hard DriveHard DriveHard DriveCooling
OCZ Vertex 3 120GB Seagate Shingle 8TB WD Red 2TB EK Supremacy EVO + EK Vega + EK 290X R2 + backp... 
CoolingCoolingOSMonitor
Corsair ML140 x2 kit x3 (ie; 6 in total...) + s... XSPC XINRULIAN 1425rpm 140mm x6 + phanteks PWM ... Windows 10 Pro 4k 24" samung PLS freesync 
MonitorKeyboardPowerCase
1200p 24" samsung Corsair K70 Red Seasonic 1000W Platinum Raven RV01 
MouseMouse PadAudioAudio
Corsair M60/M45 Corsair ASUS on-board audio HDMI 8ch LPCM 24@192k DAC 
OtherOther
tobii EyeX development kit CableMod WideBeam RGB+UV 60cm + Phanteks 40cm x... 
CPUMotherboardGraphicsRAM
AMD AthlonII 640 x4 Gigabyte 880GM-USB3 nvidia GTX 460SE Corsair Dominator 2*2GB 
Hard DriveOptical DriveCoolingOS
Seagate 500GB some sony DVD burner stock, case has a nice side duct though Windows 7 home premium 64bit 
PowerCase
Arctic Cooling 550R Gigabyte mATX 
CPUMotherboardGraphicsRAM
Intel Atom 230 some all-in-one Mini-ITX board Intel GMA 950 some 1GB stick 
Hard DriveOptical DriveCoolingOS
Seagate 320GB LightScribe DVD burner? passive stock Oracle Linux 
PowerCase
integrated in the motherboard, 65W power brick compaque stock Mini-ITX case 
  hide details  
Reply
post #214 of 252
I think AMD has laid the ground work to destroy NVIDIA. Taking over the console has established their low level api into the game makers. DX12 is just a low level api. Also DX12 should support multi gpu on a native scale. I see the future top end cards being small die multi chip cards. Small die keeps their cost low and If they push crossfire technology they should lose no performance for multi chip cards.

Thanks for the great read in this thread.
Overkill
(12 items)
 
  
CPUMotherboardGraphicsHard Drive
5820 msi X99s gaming 7 Msi Lightning r9 290x crucial mx 100 
Hard DriveCoolingOSMonitor
wd black Custom watercooling loop 8.1 lg ips led 27mp65hd 
KeyboardPowerCaseMouse
corsair rgb 70 corsair cx750m Thermaltake core x9 razor deathadder 
  hide details  
Reply
Overkill
(12 items)
 
  
CPUMotherboardGraphicsHard Drive
5820 msi X99s gaming 7 Msi Lightning r9 290x crucial mx 100 
Hard DriveCoolingOSMonitor
wd black Custom watercooling loop 8.1 lg ips led 27mp65hd 
KeyboardPowerCaseMouse
corsair rgb 70 corsair cx750m Thermaltake core x9 razor deathadder 
  hide details  
Reply
post #215 of 252
Quote:
Originally Posted by Paul17041993 View Post

Reading the patient suggests that it's #1, and with there being a secondary scalar unit in the diagram it's possible that when only one of the 16 ALUs in a cluster are active it can run at x4 speed (ie ~4GHz).

Interesting stuff.
how a 2nd scalar would allow higher frequency in vectors SIMD?
post #216 of 252
Quote:
Originally Posted by PontiacGTX View Post

how a 2nd scalar would allow higher frequency in vectors SIMD?

The diagram indicates that the first ALU cluster was allocated as a scalar unit, hence a second scalar unit and only 3 remaining ALU clusters (set to 2, 4 and 8 respectively).

http://www.freepatentsonline.com/20160085551.pdf

I also just noticed, page 10, [0030] mentions the CU structure could be both physically designed or dynamically allocated. So the next GCN could possibly follow a hybrid style shader structure (ie not just a simple shader count) that varies between models...
   
build server
(10 items)
 
CPUMotherboardGraphicsGraphics
R5-1600X ASUS Crosshair VI Hero XFX Vega64 AMD R9-290X 
RAMHard DriveHard DriveHard Drive
Corsair Vengeance LPX 16GB 3200 C16 ADATA SX8000NF 2x Samsung 850 pro 256GB 2x WD Red 2TB RAID1 
Hard DriveHard DriveHard DriveCooling
OCZ Vertex 3 120GB Seagate Shingle 8TB WD Red 2TB EK Supremacy EVO + EK Vega + EK 290X R2 + backp... 
CoolingCoolingOSMonitor
Corsair ML140 x2 kit x3 (ie; 6 in total...) + s... XSPC XINRULIAN 1425rpm 140mm x6 + phanteks PWM ... Windows 10 Pro 4k 24" samung PLS freesync 
MonitorKeyboardPowerCase
1200p 24" samsung Corsair K70 Red Seasonic 1000W Platinum Raven RV01 
MouseMouse PadAudioAudio
Corsair M60/M45 Corsair ASUS on-board audio HDMI 8ch LPCM 24@192k DAC 
OtherOther
tobii EyeX development kit CableMod WideBeam RGB+UV 60cm + Phanteks 40cm x... 
CPUMotherboardGraphicsRAM
AMD AthlonII 640 x4 Gigabyte 880GM-USB3 nvidia GTX 460SE Corsair Dominator 2*2GB 
Hard DriveOptical DriveCoolingOS
Seagate 500GB some sony DVD burner stock, case has a nice side duct though Windows 7 home premium 64bit 
PowerCase
Arctic Cooling 550R Gigabyte mATX 
CPUMotherboardGraphicsRAM
Intel Atom 230 some all-in-one Mini-ITX board Intel GMA 950 some 1GB stick 
Hard DriveOptical DriveCoolingOS
Seagate 320GB LightScribe DVD burner? passive stock Oracle Linux 
PowerCase
integrated in the motherboard, 65W power brick compaque stock Mini-ITX case 
  hide details  
Reply
   
build server
(10 items)
 
CPUMotherboardGraphicsGraphics
R5-1600X ASUS Crosshair VI Hero XFX Vega64 AMD R9-290X 
RAMHard DriveHard DriveHard Drive
Corsair Vengeance LPX 16GB 3200 C16 ADATA SX8000NF 2x Samsung 850 pro 256GB 2x WD Red 2TB RAID1 
Hard DriveHard DriveHard DriveCooling
OCZ Vertex 3 120GB Seagate Shingle 8TB WD Red 2TB EK Supremacy EVO + EK Vega + EK 290X R2 + backp... 
CoolingCoolingOSMonitor
Corsair ML140 x2 kit x3 (ie; 6 in total...) + s... XSPC XINRULIAN 1425rpm 140mm x6 + phanteks PWM ... Windows 10 Pro 4k 24" samung PLS freesync 
MonitorKeyboardPowerCase
1200p 24" samsung Corsair K70 Red Seasonic 1000W Platinum Raven RV01 
MouseMouse PadAudioAudio
Corsair M60/M45 Corsair ASUS on-board audio HDMI 8ch LPCM 24@192k DAC 
OtherOther
tobii EyeX development kit CableMod WideBeam RGB+UV 60cm + Phanteks 40cm x... 
CPUMotherboardGraphicsRAM
AMD AthlonII 640 x4 Gigabyte 880GM-USB3 nvidia GTX 460SE Corsair Dominator 2*2GB 
Hard DriveOptical DriveCoolingOS
Seagate 500GB some sony DVD burner stock, case has a nice side duct though Windows 7 home premium 64bit 
PowerCase
Arctic Cooling 550R Gigabyte mATX 
CPUMotherboardGraphicsRAM
Intel Atom 230 some all-in-one Mini-ITX board Intel GMA 950 some 1GB stick 
Hard DriveOptical DriveCoolingOS
Seagate 320GB LightScribe DVD burner? passive stock Oracle Linux 
PowerCase
integrated in the motherboard, 65W power brick compaque stock Mini-ITX case 
  hide details  
Reply
post #217 of 252
http://patents.justia.com/patent/9317296 I am not that versed in this but If they can Interrupt a mask and run an override mask momentarily and then resume where they interrupted I believe that this would be a good thing with Branch prediction, CPU or compiler culling of triangles etc....
Quote:
The provided method and storage medium have several beneficial attributes that promote increased performance of single program multiple thread code on SIMD hardware. For example, higher utilization of the SIMD hardware may be achieved. Furthermore, string comparison and other Standard Template Library (STL) like services within branchy code are improved and software prefetching performance in branchy code is improved. Furthermore, the impact of memory divergence on performance is reduced because workgroups are able to coordinate accesses instead of operating in separate logical execution streams. Additionally, permitting programmers to write more convergent code may improve power efficiency.
Quote:
The execution mask of the SIMD array 121 is overridden at block 216. For example, software code may be generated to override the execution mask. Overriding the execution mask enables certain lanes 123 of the SIMD array 121. For example, an instruction may be included to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction. When the override portion of the code has completed, the execution mask may revert back to the status of the execution mask when the override portion was entered. Accordingly, a programmer may effectively take control of all of the execution resources of the machine when the programmer knows that the parallel nature of the hardware would improve execution of the software.

In some embodiments, a compiler inserts code to perform the operations 200. In some embodiments, the high level application programmer may insert_override_exec_mask_(OxFFFF) to override the execution mask for the lower 16 work items of the workgroup. In some embodiments, the high level language programmer may alter the code of Table 1 to resemble the code presented in Table 2.


TABLE 2 void kernel_begin(int N, char* str1, char* str2) { if ( threadldx < N ) { _override_exec_mask_ { do_string_compare(str1 ,str2); } } }
post #218 of 252
Maybe dynamically allocation is an indicator of fpga? So they adapt to every different workload in milliseconds?
post #219 of 252


Around 24 minutes in. thumb.gifbiggrin.gif
post #220 of 252
Hi everyone.
I'm a long-time reader here and because of this async stuff I decided about 6 months ago to buy a Fury X instead of a 980Ti, which I thought to be best for my 2560*1080 screen. I'm not the typical enthusiast gamer but I like tech and in my opinion AMD has the more advanced technology. Thats the reason I went for a GCN GPU because the tech inside the hardware is what fascinates me, not necessarily the output on the screen. Not complaining about performance tho tongue.gif

One question I do have, which is the reason why i finally signed up, even though it hasnt much to do with DX12. Its widely assumed that GCN, especially the Fury series, is bottlenecked by ROPs and/or geometry. But why is the gap to the big Maxwell cards closing the higher the resolution goes? Doesnt resolution stress pixel throughput? If not, what does (other than tesselation)? Can someone explain me the stuff around pixel throughput, geometry, rasterization and how it relates between each other? Which part does memory bandwith play?

I somewhere read an assumption that GCN might be memory bandwith bottlenecked and so is big Maxwell which is the reason that performance evens out in high resolutions. Might that be true?

Thanks in advance for a more in depth reply because even tho I'm not a tech expert I like reading those topics to understand (my) hardware more.

Sorry if my english isnt perfect, it's not my native language rolleyes.gif

Cheers,
Fungamer
Edited by FunGamer1 - 6/8/16 at 12:25am
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Graphics Cards - General
Overclock.net › Forums › Graphics Cards › Graphics Cards - General › DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)