Here is an article on why this is moving forward on PC:
Choosing between DDR4 and HBM in memory-intensive applications
It has always been a battle to balance the performance of processors and the memory systems that provide their raw data and digest their results. As advanced semiconductor process technologies further concentrate computing power on individual die, the issue is becoming acute, especially in applications such as high-end graphics, high-performance computing (HPC), and some areas of networking.
The heart of the problem is that processor performance is growing at a rate that eclipses that of memory performance. Every year the gap between the two is increasing.
Memory makers have been bridging this gap with successive generations of double data rate (DDR) memories, but their performance is limited by the signal integrity of the DDR parallel interface and the lack of an embedded clock as used in high-speed SerDes interfaces.
This leaves system designers with multiple memory issues to solve: bandwidth, latency, power consumption, capacity and cost. The solutions for higher memory bandwidth are simple: either go faster and/or go wider.
If you can improve latency and cut the energy consumed per bit transferred, that is a bonus.
HBM has enabled designers to concentrate large amounts of memory close to processors, with enough bandwidth between the two to redress the growing imbalance between processor and memory performance. This is proving attractive in HPC, parallel computing, data center accelerators, digital image and video processing, scientific computing, computer vision, and deep learning applications.
How does the connectivity of DDR and HBM compare? With DDR4 and 5, the DRAM die are packaged and mounted on small PCBs which become dual inline memory modules (DIMMs), and then connected to a motherboard through an edge connector. For HBM2 memory, the hierarchy begins with DRAM die, which are then stacked and interconnected using TSVs, before being connected to a base logic die, which is turn is connected to a 2.5D interposer, which is finally packaged and mounted on the motherboard.
The shorter paths between memory and CPU on HBM2 systems mean that they can run without termination and consume much less energy per bit transmitted than terminated DDR systems.
It can be useful to think about this being measured in picoJoules per bit transmitted, or milliwatts per Gigabit per second. The same kind of metrics can be applied to express the power efficiency of the DDR or HBM PHY on the hot SoC
The lower the picojoule the better. Energy efficiency is IMO is just as important as latency, if not more.
To sum up this comparison, DDR4 memory subsystem implementations are useful for creating large capacities with modest bandwidth. The approach has room for improvement. Capacity can be improved by using 3D stacked DRAMs, and RDIMMs or LRDIMMs. HBM2, on the other hand, offers large bandwidth with low capacity. Both the capacity and bandwidth can be improved by adding channels, but there is no option for moving to DIMM style approach, and the approach already uses 3D stacked die. A comparison of present (DDR4, HBM2, GDDR5 and LPDDR4) and future (DDR5, LPDDR5) DRAM features is presented in Figure 6.