At first glance it looks like the only drawback of the new processor family from Intel is their high price. Especially, since excellent performance is not the only advantage of the Core 2 Duo processors: they also boast comparatively low heat dissipation and power consumption as well as significant overclocking potential. However, in the choir praising the newcomer, there are a few voices that try to pinpoint the drawbacks of the new processor family that may theoretically slightly spoil its triumph in the market. One of the most insistent claims is the fact that the new Intel processors that have totally defeated all their competitors in the today’s most widely spread benchmarks will not be able to repeat their success in 64-bit work mode.
Note that from the micro-architectural standpoint, it is not that hard to implement 64bit extensions of the classical x86 architecture. x86-64 requires more general-purpose registers (16) with higher capacity (64bit), more 128-bit SSE registers (16) and linear 64-bit addressing. Of course, CPU developers need to apply some effort to implement the x86-64 support properly. However, they do not need to radically change the architecture, which is an indisputable advantage of the x86-64 compared with IA64, for instance, which has been introduced in Intel Itanium solutions.
All the claims of relatively low Core 2 Duo performance in 64-bit modes are based on two facts. According to some info confirmed by Intel representatives, there are two limitations imposed over the EM64T support in Core microarchitecture. Firstly, Core 2 Duo processors do not support Macrofusion technology in 64-bit mode. Secondly, the processor code decoding may slow down because of the instructions working with additional registers available only with EM64T enabled. Let’s try and get to the roots of these two problems.
Thanks to Intel’s marketing people, Macrofusion is known as one of the key peculiarities of the new Core microarchitecture. This technology serves to increase the number of instructions processed per clock cycle. Namely, the processor recognizes some pairs of sequential x86 instructions as a single microinstruction. A good example of a pair like that is a comparison followed by conditional branch, for instance. The scheduler and the execution units see this microinstruction as a single command and process it accordingly. This way the code is processed faster allowing the CPU to execute up to 5 instructions per clock cycle at best.
However, non-operational Macrofusion technology in 64-bit mode can hardly affect the CPU performance that dramatically. Ideally, when there is a branch per every five x86 instructions and when all these five instructions fall into the 16-byte sample processed within a single clock cycle, the theoretical acceleration will make 25%. However in reality, this technology will ensure steady performance improvement only if the whole bunch of conditions are fulfilled. At least because the above describe frequency of conditional branches is not realistic at all. Moreover, Macrofusion technology is really efficient only if the average instruction length equals less than 4 bytes. As a result, the engineers estimate the possible improvement to be 3%-5% at the most. In other words, the absence of Macrofusion support in EM64T should be no reason for panic, because it doesn’t really affect the performance that much.
As for the overall performance slowdown caused by instructions working with additional registers, it results from the single-byte REX prefix that is added for all 64-bit operations. This prefix probably affects the average length of instructions processed by the CPU in 64-bit modes. As a result, there may be fewer instructions within the 16-byte code sample from the L1 cache that is decoded in a single clock cycle. In other words, the average instruction length in x86 code is about 2.5-3.5 bytes, while in 64-bit mode it increases because of the REX prefix. When the average instruction length exceed 4 bytes, the CPU may lose its ability to process 4 instructions per clock.
To be fair we should say that the increasing instruction length caused by the REX prefix is typical not only of the CPUs from Intel on the new Core microarchitecture, but also of the competitor’s K8 processors. The only difference is that K8 can handle maximum 3 instructions from this 16-byte sample to load the execution units to the full extent, while Core 2 Duo from Intel can process 4 instructions per clock cycle thanks to Intel Wide Dynamic Execution technology.
This way, we don’t think that the EM64T implementation issues discussed above are that dead serious for Core based Intel processors. The code is fully similar to the regular 32-bit code and it is processed just a little bit slower on Core 2 Duo processors because of the non-operational Macrofusion technology. As for the performance drop caused by the 64-bit operations, the ability of the CPU to work with more registers with higher capacity will definitely make up for the slowdown.