Originally Posted by KyadCK
Right, so how do you plan to solve the PCI-e latency issue that you completely ignored?
Oh, and PCI-e can just barely provide enough bandwidth to even tie with RAM, let alone be faster. GDDR5's bandwidth does not apply over the 8 or 16 bit bus that is the GPU's connection to the system.
Not to mention overhead that has to be accounted for, the extra jumps it takes to get there (VRAM -> GPU -> NB -> HT -> CPU/NB -> Cache -> CPU vs RAM -> IMC -> Cache -> CPU), the speed of HyperTransport which cuts down the bandwidth even more, the fact that going over HT means waiting in line with everything else that needs to be talked to, and so on.
There is less VRAM then there is System RAM, even today, and GDDR5 already has worse latency even without having to jump through hoops to get there. VRAM GDDR5's speed is completely negated by the protocols needed to get it where it needs to go. If you need more speed, VRAM hurts
Oh, and I was wrong, it's "Heter
Memory Access". One memory source for all things, not
unified address space. We've had that for years now. Lets take a look:SOURCE
HUMA: Combined memory for the CPU and GPU.
HSA: Programing to make using the GPU half easier.
So, who still thinks trying to get the CPU to use a GPU's VRAM is a good idea and can actually back up their statement with fact that wouldn't make it a worse alternative to just using system RAM.
You either didn't read or didn't understand my post. I would encourage you to completely read the comments you choose to reply to in the future. If you don't understand what the other person is posting, I would encourage you to ask for clarification instead of writing some rant arguing against points the other person didn't even promote.
For example: I didn't ignore the PCI-E latency or bandwidth issues. If you had bothered to read the part where I wrote "memory managers could certainly be made smart enough to intelligently allocate space in a way that makes sense (ie a CPU thread allocates a chunk of available memory and that memory is chosen to be in system memory instead of GPU memory if available but bleeds over if not)," you would have understood that. Clearly, a program running on the CPU would allocate system memory if available. Likewise, a thread running on the GPU is going to want to allocate memory on the GPU. I almost didn't bother writing that sentence because it's so obvious, but I did just in case someone wasn't very familiar with the technology happened to read about it. Yet somehow, you still managed to miss the point and decided to rail against me when the point you think I missed actually happens to be the very point I made.
Secondly: Like it or not, uniform memory access refers to a memory model of accessing the memory pools of multiple devices. UMA is uniform where all devices access any memory location with the same speed and latency (such as older SMP architectures when the memory controller was still in the Northbridge). NUMA is non-uniform where there are varying degrees of latency and bandwidth depending on which memory address you access (such as modern SMP architectures from AMD where each CPU has its own memory bank and accesses that go over hops are slower). A unified address space is a big part of accomplishing ease of access to these various pools of memory, and with HUMA the CPU and GPU can access each other's memory without any special commands. The CPU may read a texture from the GPU's memory and modify it or the GPU may read a block of data directly from system memory without the program necessarily even being aware of the underlying memory hardware for instance. The same commands work for both. This is a big bonus for developers.
Your graphic serves to reinforce the point I originally implied. I will now be more verbose. The APU model shows a NUMA architecture where the GPU's memory is still logically segmented away from regular system memory even though physically it's the same pool. The HSA model using a HUMA architecture completely unifies access to the entire pool between both classes of devices--CPU and GPU alike. You could extend the memory addressing benefits to discrete cards as well, and if the memory management logic were updated accordingly it could still be of real benefit even if there are underlying physical differences among the various devices adding to the memory pool.
I think a big part of the mistake you're making is to assume anyone who wants to access GPU memory from the CPU is wanting to do so for speed reasons. It would certainly be folly to think that accessing a fast pool of memory will still be fast if you have to go through a slow link to get there. There would be little benefit to the CPU using GPU memory "just because", but there are many valid use cases where it can be faster to access GPU memory directly. The reason I would want it, as a developer, would be for the ease of working on data with different devices more easily. If you're going to chew on a chunk of data a lot, you'd copy it over (say a piece of a texture from a GPU's memory to the CPU's memory), work on it (some CPU-based transform perhaps), and copy it back to where it needs to go (back to the GPU since ultimately it's what is going to need it). If you're only going to do a very quick process to a chunk of data (like flip a few bits inside a large chunk of memory), you'd be better served by simply doing the process directly if copying it twice would be slower. That is a clear use case that satisfies your request's conditions.