Itanium was originally conceived in the early 1990’s by the architects and engineers who had worked on HP’s PA-RISC. Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry. They believed that single threaded performance would not scale in the future. It is certainly true that many of the circuits in out-of-order designs can be power hungry - the re-order buffer, schedulers and renaming logic are fairly complicated and do not scale well to very large sizes. Instead of relying on extensive scheduling and renaming logic, the architects from HP and Intel took a different approach – embracing a VLIW (Very Long Instruction Word) philosophy. Itanium pushed the instruction scheduling burden onto the compiler and designed a number of ISA features that would assist software scheduling. The hardware was intended to be extremely simple with totally static scheduling. In theory, removing all the complicated scheduling and out-of-order logic would reduce power and scale better to smaller process nodes.
However, these gloomy predictions about out-of-order execution were not entirely accurate. The scheduling windows of modern CPU cores like Bulldozer or Sandy Bridge are 3-4X larger than aggressive x86 designs like the Pentium Pro (40 entry ROB) and larger still than the K5 (16 entry). The execution width of out-of-order designs has grown more slowly. Early microarchitectures were 2 and 3-issue wide, and have grown to 4-issue, but each uop in a modern core is much more powerful than before. Considering these factors, the execution width has probably grown by a factor of 2 – and more if a workload can be vectorized. In terms of single threaded performance, dynamic scheduling and out-of-order designs have significantly improved over the last decade, contrary to expectations from the early Itanium architects.
Poulson is a radical departure from the initial Itanium philosophy, and takes into account years of experience, and technology and market changes. Poulson abandons the idea of simple hardware controlled by the compiler and is the first dynamically scheduled Itanium design, with modest out-of-order execution. The microarchitecture was rebalanced to favor server workloads, rather than HPC and workstations. Poulson has a more sophisticated multi-threading and multi-core architecture, recognizing the need for tolerating memory latency and technical changes in the industry that have occurred since the first Itaniums debuted on 180nm in 2000. For all the changes though, some things remain the same. Poulson focuses on wide execution and instructions-per-cycle (IPC) rather than frequency, and has excellent reliability features. The die size is a substantial 544mm2 for massive on-die caches and scalability features for large servers.
Poulson has already taped out, which is a requirement for ISSCC papers. But products are slated for release in 2012 (most likely in the first half), reflecting the extremely long test and validation process for mission critical systems...
The changes to Poulson’s microarchitecture are comprehensive and encompass every part of the pipeline, but instruction fetch is perhaps the least impacted. Fine grained multi-threading is the biggest change for the fetch part of the front-end. Previously, fetching was essentially single threaded, while for Poulson, it must be shared between two threads dynamically. In all likelihood, the two threads alternate cycles based on priority counters with the goal of keeping the further parts of the pipeline full.
The Itanium instruction set was influenced by the RISC philosophy, with an emphasis on simple instructions and relying on the compiler for complex operations. The ISA is a strict load-store model and specifically designed to avoid any complex instructions that would have to be decoded into multiple uops – unlike x86, zArch and even the ostensibly simple Power and ARM. Itanium also has no microcode, and instead stole a page from Alpha. The firmware uses a Processor and System Abstraction Layers (PAL/SAL) to create a standard software interface to the outside world and handle tasks like booting, power management and machine check error handling. Lack of virtualization was an oversight in the original ISA, but it was later added through hardware and PAL code.
Decoding takes two stages and is where Poulson begins to significantly deviate from Tukwila and resemble a more conventional in-order pipeline. Rather than preserve Itanium’s VLIW semantics, Poulson actually breaks bundles apart into constituent instructions. These individual instructions, instead of bundles, form the basis of further execution.
Tukwila and all earlier Itanium designs were VLIW microarchitectures; compiled bundles formed the basis of execution and instructions were statically scheduled. Any dependencies were resolved by global stalls. The global stall microarchitecture would halt the entire pipeline until the problem had been resolved.
Poulson is fundamentally different and much more akin to traditional RISC or CISC microprocessors. Instructions, rather than explicitly parallel bundles, are dynamically scheduled and executed. Dependencies are resolved by flushing bad results and replaying instructions; no more global stalls. There is even a minimal degree of out-of-order execution – a profound repudiation of some of the underlying assumptions behind Itanium.
Poulson has 3 branch units, 2 simple ALUs, 2 integer units, 2 FPUs and 2 memory pipelines. Tukwila had 3 branch units and 2 FPUs, but no pipelines for simple ALU instructions, which could execute on any of the 4 memory pipelines or 2 integer units. While Poulson’s FPU latency is unknown, most integer operations are single cycle latency for dependent integer operations. In addition, there is a new 4-cycle, 64-bit integer multiplier on at least one of the two integer pipelines, used for both multiply and multiply-add instructions.
Tukwila has an incredible 4 load/store pipelines tightly integrated with the cache and TLB hierarchy to achieve low latency and high bandwidth. The L1D cache and L1 DTLB are only used for integer load instructions, while all stores and floating point loads rely on the L2 D-cache. This is a great example of microarchitecture and circuit co-design with impressive results. The overall cache system is quad-ported, with single cycle latency for integer loads and high bandwidth for floating point data accesses. Only the first two of the memory pipelines can access the L1 D-cache, although they can also issue FP loads to the L2D. The second set of memory pipelines are specialized for integer stores and any FP memory accesses; they generally interface with the L2D.
Poulson’s cache hierarchy was glossed over at ISSCC and remains somewhat of a mystery...
From its conception, the goal of Itanium was to address the entire server and workstation market – from HPC to mainframes. In contrast, notebooks and desktops make up the overwhelming majority of x86 microprocessors from AMD and Intel. While x86 designs have grown up and can tackle most of the workloads meant for Itanium, they have stayed true to their roots. There is a limit to how much additional hardware Intel and AMD can put into mainstream x86 designs, without compromising the volume economics. The system architecture for Itanium has a much greater focus on system scalability and reliability. As Figure 7 shows, both Tukwila and Poulson have more QPI links than Westmere-EX for scalability.
Poulson is socket compatible with Tukwila and relies on a similar system architecture. Both processors use a variant of the QuickPath Interconnect found in x86 designs, which is tuned for scalability and reliability. All x86 microprocessors rely on snoop-based cache coherency; whenever a core misses in the last level cache and reads from memory, it must also send a request to the caches in all other sockets to check for copies of the cache line. Snooping is very low latency for 1-4 sockets, but is inefficient for larger systems.
In contrast, Tukwila and Poulson have a directory-based coherency protocol that scales much better. For every cache line, the directory lists which cores have a copy. When a memory access misses in the L3, it first checks the directory to determine which other cores have the cache line and whether it should get the data from memory or another cache. Either way only a single request and response are sent, compared to N requests and N responses in a snooping system. Checking the directory adds a small bit of latency, but for 4 or 16 socket system, the bandwidth savings are huge. To accelerate the whole process, Tukwila and Poulson also include specialized caches for the directory.
Poulson's performance has not been discussed, but there are enough clues to put together some intelligent estimates. Given the scope of the changes, performance per core could improve by 25-40%, through a combination of higher frequency and IPC. On top of that, the core count has doubled, so the net gain could be as high as 2.8X. For workloads that are memory and I/O bandwidth limited, the gains will be substantially smaller, but still significant.
Poulson's microarchitecture (Figure 8) should increase instructions per cycle by 10-15%. Dynamic scheduling will boost IPC, although to a lesser extent than full blown out-of-order execution; and removing the NOPs is also fairly helpful. The 12-wide back-end can swiftly clear all the stalled instructions when a cache miss is resolved; helping average IPC, even if the core is only 6-wide due to fetch and decode constraints. Poulson's better multithreading and replicated DTLBs will raise utilization of the execution pipelines and data caches significantly and help hide low latency events (e.g. L1 or L2 cache misses). The only loss of IPC in the core should come from scaling back to 2 memory pipelines - but for most software, this is a small factor.