x86-64 Machine-Level Programming
Randal E. Bryant
David R. O’Hallaron
September 9, 2005
Intel’s IA32 instruction set architecture (ISA), colloquially known as “x86”, is the dominant instruction
format for the world’s computers. IA32 is the platform of choice for most Windows and Linux machines.
The ISA we use today was defined in 1985 with the introduction of the i386 microprocessor, extending the
16-bit instruction set defined by the original 8086 to 32 bits. Even though subsequent processor generations
have introduced new instruction types and formats, many compilers, including GCC, have avoided using
these features in the interest of maintaining backward compatibility.
A shift is underway to a 64-bit version of the Intel instruction set. Originally developed by Advanced Micro
Devices (AMD) and named x86-64, it is now supported by high end processors from AMD (who now call it
AMD64) and by Intel, who refer to it as EM64T. Most people still refer to it as “x86-64,” and we follow this
convention. Newer versions of Linux and GCC support this extension. In making this switch, the developers
of GCC saw an opportunity to also make use of some of the instruction-set features that had been added in
more recent generations of IA32 processors.
This combination of new hardware and revised compiler makes x86-64 code substantially different in form
and in performance than IA32 code. In creating the 64-bit extension, the AMD engineers also adopted some
of the features found in reduced-instruction set computers (RISC) [7] that made them the favored targets for
optimizing compilers. For example, there are now 16 general-purpose registers, rather than the performance-
limiting eight of the original 8086. The developers of GCC were able to exploit these features, as well as
those of more recent generations of the IA32 architecture, to obtain substantial performance improvements.
For example, procedure parameters are now passed via registers rather than on the stack, greatly reducing
the number of memory read and write operations.
This document serves as a supplement to Chapter 3 of Computer Systems: A Programmer’s Perspective
(CS:APP), describing some of the differences. We start with a brief history of how AMD and Intel arrived
at x86-64, followed by a summary of the main features that distinguish x86-64 code from IA32 code, and
then work our way through the individual features.
Copyright c
2005, R. E. Bryant, D. R. O’Hallaron. All rights reserved.
11 History and Motivation for x86-64
Over the twenty years since the introduction of the i386, the capabilities of microprocessors have changed
dramatically. In 1985, a fully configured, high-end personal computer had around 1 megabyte of random-
access memory (RAM) and 50 megabytes of disk storage. Microprocessor-based “workstation” systems
were just becoming the machines of choice for computing and engineering professionals. A typical micro-
processor had a 5-megahertz clock and ran around one million instructions per second. Nowadays, a typical
high-end system has 1 gigabyte of RAM, 500 gigabytes of disk storage, and a 4-gigahertz clock, running
around 5 billion instructions per second. Microprocessor-based systems have become pervasive. Even to-
day’s supercomputers are based on harnessing the power of many microprocessors computing in parallel.
Given these large quantitative improvements, it is remarkable that the world’s computing base mostly runs
code that is binary compatible with machines that existed 20 years ago.
The 32-bit word size of the IA32 has become a major limitation in growing the capacity of microprocessors.
Most significantly, the word size of a machine defines the range of virtual addresses that programs can use,
giving a 4-gigabyte virtual address space in the case of 32 bits. It is now feasible to buy more than this
amount of RAMfor a machine, but the system cannot make effective use of it. For applications that involve
manipulating large data sets, such as scientific computing, databases, and data mining, the 32-bit word size
makes life difficult for programmers. They must write code using out-of-core algorithms1
, where the data
reside on disk and are explicitly read into memory for processing.
Further progress in computing technology requires a shift to a larger word size. Following the tradition of
growing word sizes by doubling, the next logical step is 64 bits. In fact, 64-bit machines have been available
for some time. Digital Equipment Corporation introduced its Alpha processor in 1992, and it became
a popular choice for high-end computing. Sun Microsystems introduced a 64-bit version of its SPARC
architecture in 1995. At the time, however, Intel was not a serious contender for high-end computers, and
so the company was under less pressure to switch to 64 bits.
Intel’s first foray into 64-bit computers were the Itanium processors, based on the IA64 instruction set.
Unlike Intel’s historic strategy of maintaining backward compatibility as it introduced each new generation
of microprocessor, IA64 is based on a radically new approach jointly developed with Hewlett-Packard.
Its Very Large Instruction Word (VLIW) format packs multiple instructions into bundles, allowing higher
degrees of parallel execution. Implementing IA64 proved to be very difficult, and so the first Itanium chips
did not appear until 2001, and these did not achieve the expected level of performance on real applications.
Although the performance of Itanium-based systems has improved, they have not captured a significant
share of the computer market. Itanium machines can execute IA32 code in a compatibility mode but not
with very good performance. Most users have preferred to make do with less expensive, and often faster,
IA32-based systems.
Meanwhile, Intel’s archrival, Advanced Micro Devices (AMD) saw an opportunity to exploit Intel’s misstep
with IA64. For years AMD had lagged just behind Intel in technology, and so they were relegated to
competing with Intel on the basis of price. Typically, Intel would introduce a new microprocessor at a
price premium. AMD would come along 6 to 12 months later and have to undercut Intel significantly to
get any sales—a strategy that worked but yielded very low profits. In 2002, AMD introduced a 64-bit
1The physical memory of a machine is often referred to as core memory, dating to an era when each bit of a random-access
memory was implemented with a magnetized ferrite core.
2microprocessor based on its “x86-64” instruction set. As the name implies, x86-64 is an evolution of the
Intel instruction set to 64 bits. It maintains full backward compatibility with IA32, but it adds new data
formats, as well as other features that enable higher capacity and higher performance. With x86-64, AMD
has sought to capture some of the high-end market that had historically belonged to Intel. AMD’s recent
generations of Opteron and Athlon 64 processors have indeed proved very successful as high performance
machines. Most recently, AMD has renamed this instruction set AMD64, but “x86-64” persists as the
favored name.
Intel realized that its strategy of a complete shift from IA32 to IA64 was not working, and so began sup-
porting their own variant of x86-64 in 2004 with processors in the Pentium 4 Xeon line. Since they had
already used the name “IA64” to refer to Itanium, they then faced a difficulty in finding their own name for
this 64-bit extension. In the end, they decided to describe x86-64 as an enhancement to IA32, and so they
refer to it as IA32-EM64T for “Enhanced Memory 64-bit Technology.”
The developers of GCC steadfastly maintained binary compatibility with the i386, even though useful fea-
tures had been added to the IA32 instruction set. The PentiumPro introduced a set of conditional move
instructions that could greatly improve the performance of code involving conditional operations. More
recent generations of Pentium processors introduced new floating point operations that could replace the
rather awkward and quirky approach dating back to the 8087, the floating point coprocessor that accompa-
nied the 8086 and is now incorporated within the main microprocessors chips. Switching to x86-64 as a
target provided an opportunity for GCC to give up backward compatibility and instead exploit these newer
features.
In this document, we use “IA32” to refer to the combination of hardware and GCC code found in traditional,
32-bit versions of Linux running on Intel-based machines. We use “x86-64” to refer to the hardware and
code combination running on the newer 64-bit machines from AMD and Intel. In the Linux world, these
two platforms are referred to as “i386” and “x86 64,” respectively.
2 Finding Documentation
Both Intel and AMD provide extensive documentation on their processors. This includes general overviews
of the assembly language programmer’s view of the hardware [2, 4], as well as detailed references about
the individual instructions [3, 5, 6]. The organization amd64.org has been responsible for defining the
Application Binary Interface (ABI) for x86-64 code running on Linux systems [8]. This interface describes
details for procedure linkages, binary code files, and a number of other features that are required for object
code programs to execute properly.
Warning: Both the Intel and the AMD documentation use the Intel assembly code notation. This differs
from the notation used by the Gnu assembler GAS. Most significantly, it lists operands in the opposite order.
3 An Overview of x86-64
The combination of the new hardware supplied by Intel and AMD, as well as the new version of GCC
targeting these machines makes x86-64 code substantially different from that generated for IA32 machines.
3C declaration Intel data type GAS suffix x86-64 Size (Bytes)
char Byte b 1
short Word w 2
int Double word l 4
unsigned Double word l 4
long int Quad word q 8
unsigned long Quad word q 8
char * Quad word q 8
float Single precision s 4
double Double precision d 8
long double Extended precision t 16
Figure 1: Sizes of standard data types with x86-64 Both long integers and pointers require 8 bytes, as
compared to 4 for IA32.
The main features include:
Pointers and long integers are 64 bits long. Integer arithmetic operations support 8, 16, 32, and 64-bit
data types.
The set of general-purpose registers is expanded from 8 to 16.
Much of the program state is held in registers rather than on the stack. Integer and pointer procedure
arguments (up to 6) are passed via registers. Some procedures do not need to access the stack at all.
Conditional operations are implemented using conditional move instructions when possible, yielding
better performance than traditional branching code.
Floating-point operations are implemented using a register-oriented instruction set, rather than the
stack-based approach supported by IA32.
3.1 Data Types
Figure 1 shows the sizes of different C data types for x86-64. Comparing these to the IA32 sizes (CS:APP
Figure 3.1), we see that pointers (shown here as data type char *) require 8 bytes rather than 4. In
principal, this gives programs the ability to access 16 exabytes of memory (around 18:4 1018 bytes).
That seems like an astonishing amount of memory, but keep in mind that 4 gigabytes seemed astonishing
when the first 32-bit machines appeared in the late 1970s. In practice, most machines do not really support
the full address range—the current generations of AMD and Intel x86-64 machines support 256 terabytes
(248) bytes of virtual memory—but allocating this much memory for pointers is a good idea for long term
compatibility.
We also see that the prefix “long” changes integers to 64 bits, allowing a considerably larger range of
values. Whereas a 32-bit unsigned value can range up to 4,294,967,295 (CS:APP Figure 2.8), increasing
the word size to 64 bits gives a maximum value of 18,446,744,073,709,551,615.
4As with IA32, the long prefix also changes a floating point double to use the 80-bit format supported
by IA32 (CS:APP Section 2.4.6.) These are stored in memory with an allocation of 16 bytes for x86-64,
compared to 12 bytes for IA32. This improves the performance of memory read and write operations, which
typically fetch 8 or 16 bytes at a time. Whether 12 or 16 bytes are allocated, only the low-order 10 bytes are
actually used.
3.2 Assembly Code Example
Section 3.2.3 of CS:APP illustrated the IA32 assembly code generated by GCC for a function simple.
Below is the C code for simple l, similar to simple, except that it uses long integers:
long int simple_l(long int *xp, long int y)
{
long int t = *xp + y;
*xp = t;
return t;
}
When GCC is run on an x86-64 machine with the command line
unix> gcc -O2 -S -m32 code.c
it generates code that is compatible with any IA32 machine:
IA32 version of function simple_l.
Arguments in stack locations 8(%ebp) (xp) and 12(%ebp) (y)
1 simple_l:
2 pushl %ebp Save frame pointer
3 movl %esp, %ebp Create new frame pointer
4 movl 8(%ebp), %edx Get xp
5 movl (%edx), %eax Retrieve *xp
6 addl 12(%ebp), %eax Add y to get t (and return value)
7 movl %eax, (%edx) Store t at *xp
8 leave Restore stack and frame pointers
9 ret Return
This code is almost identical to that shown in CS:APP, except that it uses the single leave instruction
(CS:APP Section 3.7.2), rather than the sequence movl %ebp, %esp and popl %ebp to deallocate the
stack frame.
When we instruct GCC to generate x86-64 code
unix> gcc -O2 -S -m64 code.c
(on most machines, the flag -m64 is not required), we get very different code:
x86-64 version of function simple_l.
Arguments in registers %rdi (xp) and %rsi (y)
51 simple_l:
2 addq (%rdi), %rsi Add *xp to y to get t
3 movq %rsi, %rax Set t as return value
4 movq %rsi, (%rdi) Store t at *xp
5 ret Return
Some of the key differences include
Instead of movl and addl instructions, we see movq and addq. The pointers and variables declared
as long integers are now 64 bits (quad words) rather than 32 bits (long words).
We see the 64-bit versions of the registers, e.g., %rsi, %rdi. The procedure returns a value by
storing it in register %rax.
No stack frame gets generated in the x86-64 version. This eliminates the instructions that set up (lines
2–3) and remove (line 8) the stack frame in the IA32 code.
Arguments xp and y are passed in registers %rdi and %rsi, rather than on the stack. These registers
are the 64-bit versions of registers %edi and %esi. This eliminates the need to fetch the arguments
from memory. As a consequence, the two instructions on lines 2 and 3 can retrieve *xp, add it to y,
and set it as the return value, whereas the IA32 code required three lines of code: 4–6.
The net effect of these changes is that the IA32 code consists of 8 instructions making 7 memory refer-
ences, while the x86-64 code consists of 4 instructions making 3 memory references. Running on an Intel
Pentium 4 Xeon, our experiments show that the IA32 code requires around 17 clock cycles per call, while
the x86-64 code requires 12 cycles per call. Running on an AMD Opteron, we get 9 and 7 cycles per call,
respectively. Getting a performance increase of 1.3–1.4X on the same machine with the same C code is a
significant achievement. Clearly x86-64 represents a important step forward.
4 Accessing Information
Figure 2 shows the set of general-purpose registers under x86-64. Compared to the registers for IA32
(CS:APP Figure 3.2), we see a number of differences:
The number of registers has been doubled to 16. The new registers are numbered 8–15.
All registers are 64 bits long. The 64-bit extensions of the IA32 registers are named %rax, %rcx,
%rdx, %rbx, %rsi, %rdi, %rsp, and %rbp. The new registers are named %r8–%r15.
The low-order 32 bits of each register can be accessed directly. This gives us the familiar registers
from IA32: %eax, %ecx, %edx, %ebx, %esi, %edi, %esp, and %ebp, as well as eight new 32-bit
registers: %r8d–%r15d.
The low-order 16 bits of each register can be accessed directly, as is the case for IA32. The word-size
versions of the new registers are named %r8w–%r15w.
663 31 15 8 7 0
Return value %rax %eax %ax %ah %al
Callee saved %rbx %ebx %ax %bh %bl
4th argument %rcx %ecx %cx %ch %cl
3rd argument %rdx %edx %dx %dh %dl
2nd argument %rsi %esi %si %sil
1st argument %rdi %edi %di %dil
Callee saved %rbp %ebp %bp %bpl
Stack pointer %rsp %esp %sp %spl
5th argument %r8 %r8d %r8w %r8b
6th argument %r9 %r9d %r9w %r9b
Callee saved %r10 %r10d %r10w %r10b
Used for linking %r11 %r11d %r11w %r11b
Unused for C %r12 %r12d %r12w %r12b
Callee saved %r13 %r13d %r13w %r13b
Callee saved %r14 %r14d %r14w %r14b
Callee saved %r15 %r15d %r15w %r15b
Figure 2: Integer registers. The existing eight registers are extended to 64-bit versions, and eight new
registers are added. Each register can be accessed as either 8 bits (byte), 16 bits (word), 32 bits (double
word), or 64 bits (quad word).
7 The low-order 8 bits of each register can be accessed directly. This is true in IA32 only for the first 4
registers (%al, %cl, %dl, %bl). The byte-size versions of the other IA32 registers are named %sil,
%dil, %spl, and %bpl. The byte-size versions of the new registers are named %r8b–%r15b.
For backward compatibility, the second byte of registers %rax, %rcx, %rdx, and %rbx can be
directly accessed by instructions having single-byte operands.
As with IA32, most of the registers can be used interchangeably, but there are some special cases. Register
%rsp has special status, in that it holds a pointer to the top stack element. Unlike in IA32, however, there
is no frame pointer register; register %rbp is available for use as a general-purpose register. Particular
conventions are used for passing procedure arguments via registers and for how registers are to be saved
and restored registers during procedure calls, as is discussed in Section 6. In addition, some arithmetic
instructions make special use of registers %rax and %rdx.
For the most part, the operand specifiers of x86-64 are just the same as those in IA32 (see CS:APP Fig-
ure 3.3). One minor difference is that some forms of PC-relative operand addressing are supported. With
IA32, this form of addressing is only supported for jump and other control transfer instructions (see CS:APP
Section 3.6.3). This mode is provided to compensate for the fact that the offsets (shown in CS:APP Fig-
ure 3.3 as Imm) are only 32 bits long. By viewing this field as a 32-bit, two’s complement number, instruc-
tions can access data within a window of around 2:15109 relative to the program counter. With x86-64,
the program counter is named %rip.
As an example of PC-relative data addressing, consider the following procedure, which calls the function
call simple l examined earlier:
long int gval1 = 567;
long int gval2 = 763;
long int call_simple_l()
{
long int z = simple_l(&gval1, 12L);
return z + gval2;
}
This code references global variables gval1 and gval2. When this function is compiled, assembled, and
linked, we get the following executable code (as generated by the disassembler objdump)
1 0000000000400500 <call_simple_l>:
2 400500: be 0c 00 00 00 mov $0xc,%esi Load 12 as 1st argument
3 400505: bf 08 12 50 00 mov $0x501208,%edi Load &gval1 as 2nd argument
4 40050a: e8 b1 ff ff ff callq 4004c0 <simple_l> Call simple_l
5 40050f: 48 03 05 ea 0c 10 00 add 1051882(%rip),%rax Add gval2 to result
6 400516: c3 retq
The instruction on line 3 stores the address of global variable gval1 in register %rdi. It does this by simply
copying the constant value 0x501208 into register %edi. The upper 32 bits of %rdi are then automat-
ically set to zero. The instruction on line 5 retrieves the value of gval2 and adds it to the value returned
8Instruction Effect Description
movq S, D D S Move quad word
movabsq I, R R I Move quad word
movslq S, R R SignExtend(S) Move sign-extended double word
movsbq S, R R SignExtend(S) Move sign-extended byte
movzbq S, R R ZeroExtend(S) Move zero-extended byte
pushq S R[%rsp] R[%rsp]