[PCGamesN]Patents show AMD’s post-Navi GPU could use an Nvidia-like architecture - Page 3 - Overclock.net - An Overclocking Community

Forum Jump: 

[PCGamesN]Patents show AMD’s post-Navi GPU could use an Nvidia-like architecture

Reply
 
Thread Tools
post #21 of 28 (permalink) Old 02-10-2019, 09:15 AM
Null
 
geoxile's Avatar
 
Join Date: Jul 2010
Posts: 6,314
Rep: 159 (Unique: 129)
Must be Raja's obsession with VLIW

geoxile is offline  
Sponsored Links
Advertisement
 
post #22 of 28 (permalink) Old 02-10-2019, 10:46 AM
New to Overclock.net
 
AmericanLoco's Avatar
 
Join Date: Mar 2015
Posts: 982
Rep: 112 (Unique: 70)
Quote: Originally Posted by Defoler View Post
If you can't beat them, copy them?

If this is post navi, I expect we won't see it before 2021. Navi (and most likely its refresh) will dominate AMD's sales since this supper and next year.
If anything it sounds like it's going back to AMD's roots. AMD's VLIW4 and VLIW5 architectures were extremely efficient (once they got past the early teething issues with the HD2000 and 3000 series), small die area parts that punched well above their weight. VLIW requires your drivers to be on point to function well, however.

AMD/ATi have never really done well with large die parts. Hawaii was probably the largest die they pulled off, and it took several months of driver tweaks before it was really performing well.
AmericanLoco is offline  
post #23 of 28 (permalink) Old 02-10-2019, 08:48 PM
New to Overclock.net
 
white owl's Avatar
 
Join Date: Apr 2015
Location: The land of Nod
Posts: 5,374
Rep: 136 (Unique: 103)
Quote: Originally Posted by rluker5 View Post
I agree.
The use of the word cores in reference to a gpu is misleading when you compare it to a cpu. Cores in a gpu are more like instructions per cycle in a cpu. If the game is made in a way that it's frames can't be split up and run well then it can't be used in a "dual core cpu" fashion. It then needs a powerful single threaded gpu with more instructions per cycle.
Just like some poorly made games are single threaded cpu, but the good ones are multi threaded, many newer games are intentionally made with postprocessing stuff like TAA that only works in a single threaded gpu scenario.
Running Skyrim with Threadripper won't make it a multithreaded game any more than running AC Odyssey with sli 1080tis will make it mgputhreaded . And changing the physical arrangement of the gpu parts won't change that. Any "smart" chiplet would more or less act like an independent gpu. You could make the chiplet responsible for less, but what if a game doesn't work with that segmentation?
Unless you are thinking of just having "dumb" cores sitting out there on separate dies, completely controlled by a main scheduler. But the latency that would add is big. I imagine that a 2ghz gpu needs instructions pretty frequently. A much lesser latency change was made with Kepler to Maxwell where the L2 cache was used more end vram less and the per core/ per clock performance went up by 40%. A similar scale of latency change is when the gpu runs out of vram and has to use system ram.

Like you said, the games have to be made compatible to the chiplet method, just like the mgpu method or the multicore cpu method, not the other way around.
I don't think there would be a large issue with current games running on chiplet GPUs, the driver is how the OS/app communicates with the GPU so the main thing would be driver and firmware. The goal of the chiplet design is for both pieces of silicon to be treated as one component. Latency between the dies is an issue but that's a different conversation.
New games would likely benefit from having the GPU to work with so they can optimize for the arch/design but I don't think there's an issue with older games running on them. Take Threadripper for example, right off the bat any application that could benefit from multi-threading could take advantage of the present dies. There was optimization afterward but it's on the GPU/CPU maker to make a driver that's able to communicate with the game/app in such a way that the instructions are going where they should. Software must come later since you can't optimize for something that doesn't work yet lol.

And yes, cores on CPUs aren't comparable with those on a GPU becasue what GPU makers market as "cores" aren't really cores, I think they're closer to FPUs than cores becasue to be a core (by definition) it need to have it's own cache (among other things). It's a marketing thing IMO, calling them floating point units doesn't sound as good.

Quote: Originally Posted by SpeedyVT
If you're not doing extreme things to parts for the sake of extreme things regardless of the part you're not a real overclocker.
Quote: Originally Posted by doyll View Post
The key is generally not which brands are good but which specific products are. Motherboards and GPUs are perfect examples of companies having everything from golden to garbage function/quality.
Hot n Bothered
(12 items)
CPU
4790k 4.7Ghz
Motherboard
Asus Sabertooth Z97 MkII 2
GPU
EVGA GTX 1080 SC
RAM
16gb G.Skill Sniper 2400Mhz
Hard Drive
2x Kingston v300 120gb RAID 0
Hard Drive
WD Blue
Power Supply
Seasonic 620w M12 II EVO
Cooling
Cooler Master 212 Evo
Case
Corsair 450D
Operating System
Windows 10
Monitor
Nixeus EDG27
Other
I have pretty lights.
▲ hide details ▲
white owl is offline  
Sponsored Links
Advertisement
 
post #24 of 28 (permalink) Old 02-11-2019, 05:51 PM
Not a linux lobbyist
 
rluker5's Avatar
 
Join Date: Feb 2014
Location: Wisconsin
Posts: 1,729
Rep: 44 (Unique: 35)
Quote: Originally Posted by white owl View Post
I don't think there would be a large issue with current games running on chiplet GPUs, the driver is how the OS/app communicates with the GPU so the main thing would be driver and firmware. The goal of the chiplet design is for both pieces of silicon to be treated as one component. Latency between the dies is an issue but that's a different conversation.
New games would likely benefit from having the GPU to work with so they can optimize for the arch/design but I don't think there's an issue with older games running on them. Take Threadripper for example, right off the bat any application that could benefit from multi-threading could take advantage of the present dies. There was optimization afterward but it's on the GPU/CPU maker to make a driver that's able to communicate with the game/app in such a way that the instructions are going where they should. Software must come later since you can't optimize for something that doesn't work yet lol.

And yes, cores on CPUs aren't comparable with those on a GPU becasue what GPU makers market as "cores" aren't really cores, I think they're closer to FPUs than cores becasue to be a core (by definition) it need to have it's own cache (among other things). It's a marketing thing IMO, calling them floating point units doesn't sound as good.
You are squirming around on what exactly a chiplet is. If you are using threadripper as an example, that is multiple complete dies with some communication method, much like multiple gpus on a card connected with a plx. We already have that and know that it takes a lot more than os/app communication method to get them to scale well. You need games to be written in a compatible fashion to use them.

If you are talking about straight dumb chiplet components controlled remotely by a master controller, that could look like one gpu. But gpus do many, many simple things and they need many, many instructions. The latency would kill that if the power consumption from all of the remote communication didn't melt it first. See that pic from Anand.

Are all games written so that they can do without substantial amounts of information for any given frame for extended periods of time? Remember Physx? That needed games to be written in a special way to enable this to work. It didn't just work across the board. The games would again have to be compatible. And this is what some hybrid of the previous two would need.

If DX13 has mgpu support like DX12 has multicore support then chiplets may work. But I don't see a push for that yet. Most are happier than ever to completely abandon and disdain the idea of mgpu. Man I wish I could get sli working well in AC Odyssey
Attached Thumbnails
Click image for larger version

Name:	IF%20Power%20EPYC_575px.png
Views:	9
Size:	100.8 KB
ID:	253022  


L5
(17 items)
Lea2
(11 items)
L7
(11 items)
CPU
5775c
Motherboard
Maximus VII Hero
GPU
Aorus 1080ti Waterforce
RAM
16 Gb Gskill Trident @ 2400,cas10,1.575v
RAM
8 Gb Gskill Trident @ 2400,cas10,1.575v
Hard Drive
1Tb Team ssd
Hard Drive
seagate barracuda 3T
Hard Drive
Optane 900p 480G OS
Optical Drive
Asus BW-16D1HT
Power Supply
EVGA Supernova 1300 G2
Cooling
Cooler Master MasterLiquid Pro 120 (cpu)
Cooling
2 140mm case fans, 2 120mm
Case
Fractal Design R4 (no window)
Operating System
W10 64 pro
Monitor
panasonic TC-58AX800U
Audio
Focal Elear, Nova 40, 598se, HE4xx, DT990pro w b.boost earpads
Audio
SoundbasterX AE-5, onboard
CPU
4770k
Motherboard
Asus Z87 Deluxe
GPU
Fury Nitro
RAM
8Gb klevv urbane 2133
Hard Drive
ROG Raidr 240Gb pcie
Hard Drive
1Tb WD blue
Power Supply
Pc Power&Cooling silencer Mk2 950w
Cooling
Deepcool Lucifer V2
Case
DIYPC P48-W
Operating System
W10 64 pro
Monitor
40"tv
CPU
4980hq
Motherboard
Asus H81T/CSM
RAM
8Gb 1600 samsung
Hard Drive
Samsung 850 evo 120gb
Power Supply
Skyvast 90w brick for hp pavilion something
Cooling
SilverStone Tek Super Slim
Case
SilverStone Tek PT13B
Operating System
W10 64 pro
Monitor
24" samsung 1080p
Keyboard
Logitech K400+
Other
Intel wifi ac card and noname antennas
▲ hide details ▲
rluker5 is offline  
post #25 of 28 (permalink) Old 02-12-2019, 02:15 AM
Frequency is Megabytes
 
Seronx's Avatar
 
Join Date: Jun 2010
Posts: 2,996
Rep: 231 (Unique: 112)
Quote: Originally Posted by geoxile View Post
Must be Raja's obsession with VLIW
Super-SIMD is more EPIC or Superscalar. Essentially, two non-dependent regular instructions will fill a single VLIW2. It can be done by the compiler(EPIC-like) or the scheduler(OoO-like) if need be.

It is still Graphic Core Next. It is not going back to VLIW5/VLIW4.


Last edited by Seronx; 02-12-2019 at 02:24 AM.
Seronx is offline  
post #26 of 28 (permalink) Old 02-12-2019, 08:35 AM
New to Overclock.net
 
EniGma1987's Avatar
 
Join Date: Sep 2011
Posts: 6,303
Rep: 338 (Unique: 248)
Quote: Originally Posted by Hydroplane View Post
A new architecture would be great to see. GCN was great in 2012 but really ran out of steam by like 2015. Hopefully a chiplet design will be coming for the GPUs. Should not be difficult considering GPUs are already massively parallel. This would increase yield (especially vs. the massive die Titan RTX / 2080 Ti) and allow for chip standardization across the lineup similar to Ryzen. Better economies of scale. Chiplets would also help spread out the heat from the small 7nm dies across a greater area.

I believe the main issue is bandwidth. Right now there are no real numbers for internal die bandwidth because they arent needed, but die to vmem we get 400+ gigabytes per second. And it needs all that bandwidth. Right now, the absolute fastest, most insane interconnects, taking up all available lanes of the interconnect only reach as high as 200GB/s. Typically they are closer to 50GB/s with normal lane use per die.
If they use chiplets to create a big GPU with 2-4 "core dies" that have 2-3k shader cores each, they would need to have an interconnect on them that takes up half the die space itself just for the interconnect to reach the necessary 400-500GB/s per die. Then they would need a similar "front end die" that is the scheduler itself and that die would be only a scheduler, some small cache, and basically a polaris sized section of nothing but interconnects to the core dies. The energy cost on it would be massive, the actual die cost would be massive. Really 7nm is even the first year we can even hope for such a design possibility, with realistically more like 3-5nm designs being where it is truly feasible. Then they have to work out whether each core die will have its own render back-end and how to sync them up on the displayed image, or if they will include a "back end die" that they then must include all the interconnects for and double the number of interconnects in the core dies again. And based on what is decided you would need memory controllers in the core die of course, but you may also need additional memory chips and controllers ion the back end die. Its quite complicated to do and a huge chunk of the space would really be wasted on just the interconnects to make it possible. Might even have to go to double PCB cards where the back card is just the power input and VRMs with a smallish heatsink on the back, and the front PCB would have the display outputs and all the dies and memory chips on it since we would need way more space to do a design like this.
As it is right now we just dont have the necessary bandwidth in an interconnect. NVidia is closest with NVLink2 having more bandwidth than PCI-E 5.0, but they would still have to include a full 16 lane NVLink2 in each core die, and a 32-64 lane setup on the front end section. The hardware exists, in the form of NVSwitch that they just introduced. But the cost on these parts is huge. If they actually integrated it into a gaming GPU I would be the price would be $2000-2500 and they really wouldnt even have that much profit on the sale. And just for reference, the NVSwitch chip, where all it has is a front end scheduler (to make each GPU look like 1 to the OS) and the interconnects has a TDP of 100w BY ITSELF. It has enough bandwidth for running 2 big dies worth of internal chips, giving 450GB/s to each die. Maybe we could even get away with doing 300GB/s each to 3 dies for gaming use. If we shrink that to 7nm, we could get the TDP down to *maybe* 75w, but thats just the front end power draw. We may also need to double that for back end chip, and the draw from each core die. So think about that power draw just from interconnects. Ya...


Last edited by EniGma1987; 02-12-2019 at 10:47 AM.
EniGma1987 is online now  
post #27 of 28 (permalink) Old 02-12-2019, 07:00 PM
New to Overclock.net
 
guttheslayer's Avatar
 
Join Date: Apr 2015
Posts: 3,718
Rep: 110 (Unique: 65)
Quote: Originally Posted by EniGma1987 View Post
I believe the main issue is bandwidth. Right now there are no real numbers for internal die bandwidth because they arent needed, but die to vmem we get 400+ gigabytes per second. And it needs all that bandwidth. Right now, the absolute fastest, most insane interconnects, taking up all available lanes of the interconnect only reach as high as 200GB/s. Typically they are closer to 50GB/s with normal lane use per die.
If they use chiplets to create a big GPU with 2-4 "core dies" that have 2-3k shader cores each, they would need to have an interconnect on them that takes up half the die space itself just for the interconnect to reach the necessary 400-500GB/s per die. Then they would need a similar "front end die" that is the scheduler itself and that die would be only a scheduler, some small cache, and basically a polaris sized section of nothing but interconnects to the core dies. The energy cost on it would be massive, the actual die cost would be massive. Really 7nm is even the first year we can even hope for such a design possibility, with realistically more like 3-5nm designs being where it is truly feasible. Then they have to work out whether each core die will have its own render back-end and how to sync them up on the displayed image, or if they will include a "back end die" that they then must include all the interconnects for and double the number of interconnects in the core dies again. And based on what is decided you would need memory controllers in the core die of course, but you may also need additional memory chips and controllers ion the back end die. Its quite complicated to do and a huge chunk of the space would really be wasted on just the interconnects to make it possible. Might even have to go to double PCB cards where the back card is just the power input and VRMs with a smallish heatsink on the back, and the front PCB would have the display outputs and all the dies and memory chips on it since we would need way more space to do a design like this.
As it is right now we just dont have the necessary bandwidth in an interconnect. NVidia is closest with NVLink2 having more bandwidth than PCI-E 5.0, but they would still have to include a full 16 lane NVLink2 in each core die, and a 32-64 lane setup on the front end section. The hardware exists, in the form of NVSwitch that they just introduced. But the cost on these parts is huge. If they actually integrated it into a gaming GPU I would be the price would be $2000-2500 and they really wouldnt even have that much profit on the sale. And just for reference, the NVSwitch chip, where all it has is a front end scheduler (to make each GPU look like 1 to the OS) and the interconnects has a TDP of 100w BY ITSELF. It has enough bandwidth for running 2 big dies worth of internal chips, giving 450GB/s to each die. Maybe we could even get away with doing 300GB/s each to 3 dies for gaming use. If we shrink that to 7nm, we could get the TDP down to *maybe* 75w, but thats just the front end power draw. We may also need to double that for back end chip, and the draw from each core die. So think about that power draw just from interconnects. Ya...

Long story short, you means that chiplet design is only possible with post 7nm fabrication process?

guttheslayer is offline  
post #28 of 28 (permalink) Old 02-13-2019, 07:29 AM
New to Overclock.net
 
EniGma1987's Avatar
 
Join Date: Sep 2011
Posts: 6,303
Rep: 338 (Unique: 248)
Quote: Originally Posted by guttheslayer View Post
Long story short, you means that chiplet design is only possible with post 7nm fabrication process?

Yep pretty much. We know Nvidia was working on this sort of approach but I suspect they realized how unfeasible it is and thats why the product of their research was instead branded as its own product and is only used on deep learning servers. It simply wasnt feasible to stick 1 in each GPU itself just yet. I dont think it will be feasible until they can come up with an "NVLink3" with twice as much bandwidth per set of lanes, and a total power draw for the front end chip of 25W or less. So it probably wont happen until 5nm at the earliest.

EniGma1987 is online now  
Reply

Quick Reply
Message:
Options

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page


Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off