HelloWorld

ASM,C,LUA,LINUX(gentoo)
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

[转]x86-64 Machine-Level Programming

Posted on 2011-09-29 18:26  光铭  阅读(445)  评论(0编辑  收藏  举报

x86-64 Machine-Level Programming

Randal E. Bryant

David R. O’Hallaron

September 9, 2005

Intel’s IA32 instruction set architecture (ISA), colloquially known as “x86”, is the dominant instruction

format for the world’s computers. IA32 is the platform of choice for most Windows and Linux machines.

The ISA we use today was defined in 1985 with the introduction of the i386 microprocessor, extending the

16-bit instruction set defined by the original 8086 to 32 bits. Even though subsequent processor generations

have introduced new instruction types and formats, many compilers, including GCC, have avoided using

these features in the interest of maintaining backward compatibility.

A shift is underway to a 64-bit version of the Intel instruction set. Originally developed by Advanced Micro

Devices (AMD) and named x86-64, it is now supported by high end processors from AMD (who now call it

AMD64) and by Intel, who refer to it as EM64T. Most people still refer to it as “x86-64,” and we follow this

convention. Newer versions of Linux and GCC support this extension. In making this switch, the developers

of GCC saw an opportunity to also make use of some of the instruction-set features that had been added in

more recent generations of IA32 processors.

This combination of new hardware and revised compiler makes x86-64 code substantially different in form

and in performance than IA32 code. In creating the 64-bit extension, the AMD engineers also adopted some

of the features found in reduced-instruction set computers (RISC) [7] that made them the favored targets for

optimizing compilers. For example, there are now 16 general-purpose registers, rather than the performance-

limiting eight of the original 8086. The developers of GCC were able to exploit these features, as well as

those of more recent generations of the IA32 architecture, to obtain substantial performance improvements.

For example, procedure parameters are now passed via registers rather than on the stack, greatly reducing

the number of memory read and write operations.

This document serves as a supplement to Chapter 3 of Computer Systems: A Programmer’s Perspective

(CS:APP), describing some of the differences. We start with a brief history of how AMD and Intel arrived

at x86-64, followed by a summary of the main features that distinguish x86-64 code from IA32 code, and

then work our way through the individual features.

Copyright c 

2005, R. E. Bryant, D. R. O’Hallaron. All rights reserved.

11 History and Motivation for x86-64

Over the twenty years since the introduction of the i386, the capabilities of microprocessors have changed

dramatically. In 1985, a fully configured, high-end personal computer had around 1 megabyte of random-

access memory (RAM) and 50 megabytes of disk storage. Microprocessor-based “workstation” systems

were just becoming the machines of choice for computing and engineering professionals. A typical micro-

processor had a 5-megahertz clock and ran around one million instructions per second. Nowadays, a typical

high-end system has 1 gigabyte of RAM, 500 gigabytes of disk storage, and a 4-gigahertz clock, running

around 5 billion instructions per second. Microprocessor-based systems have become pervasive. Even to-

day’s supercomputers are based on harnessing the power of many microprocessors computing in parallel.

Given these large quantitative improvements, it is remarkable that the world’s computing base mostly runs

code that is binary compatible with machines that existed 20 years ago.

The 32-bit word size of the IA32 has become a major limitation in growing the capacity of microprocessors.

Most significantly, the word size of a machine defines the range of virtual addresses that programs can use,

giving a 4-gigabyte virtual address space in the case of 32 bits. It is now feasible to buy more than this

amount of RAMfor a machine, but the system cannot make effective use of it. For applications that involve

manipulating large data sets, such as scientific computing, databases, and data mining, the 32-bit word size

makes life difficult for programmers. They must write code using out-of-core algorithms1

, where the data

reside on disk and are explicitly read into memory for processing.

Further progress in computing technology requires a shift to a larger word size. Following the tradition of

growing word sizes by doubling, the next logical step is 64 bits. In fact, 64-bit machines have been available

for some time. Digital Equipment Corporation introduced its Alpha processor in 1992, and it became

a popular choice for high-end computing. Sun Microsystems introduced a 64-bit version of its SPARC

architecture in 1995. At the time, however, Intel was not a serious contender for high-end computers, and

so the company was under less pressure to switch to 64 bits.

Intel’s first foray into 64-bit computers were the Itanium processors, based on the IA64 instruction set.

Unlike Intel’s historic strategy of maintaining backward compatibility as it introduced each new generation

of microprocessor, IA64 is based on a radically new approach jointly developed with Hewlett-Packard.

Its Very Large Instruction Word (VLIW) format packs multiple instructions into bundles, allowing higher

degrees of parallel execution. Implementing IA64 proved to be very difficult, and so the first Itanium chips

did not appear until 2001, and these did not achieve the expected level of performance on real applications.

Although the performance of Itanium-based systems has improved, they have not captured a significant

share of the computer market. Itanium machines can execute IA32 code in a compatibility mode but not

with very good performance. Most users have preferred to make do with less expensive, and often faster,

IA32-based systems.

Meanwhile, Intel’s archrival, Advanced Micro Devices (AMD) saw an opportunity to exploit Intel’s misstep

with IA64. For years AMD had lagged just behind Intel in technology, and so they were relegated to

competing with Intel on the basis of price. Typically, Intel would introduce a new microprocessor at a

price premium. AMD would come along 6 to 12 months later and have to undercut Intel significantly to

get any sales—a strategy that worked but yielded very low profits. In 2002, AMD introduced a 64-bit

1The physical memory of a machine is often referred to as core memory, dating to an era when each bit of a random-access

memory was implemented with a magnetized ferrite core.

2microprocessor based on its “x86-64” instruction set. As the name implies, x86-64 is an evolution of the

Intel instruction set to 64 bits. It maintains full backward compatibility with IA32, but it adds new data

formats, as well as other features that enable higher capacity and higher performance. With x86-64, AMD

has sought to capture some of the high-end market that had historically belonged to Intel. AMD’s recent

generations of Opteron and Athlon 64 processors have indeed proved very successful as high performance

machines. Most recently, AMD has renamed this instruction set AMD64, but “x86-64” persists as the

favored name.

Intel realized that its strategy of a complete shift from IA32 to IA64 was not working, and so began sup-

porting their own variant of x86-64 in 2004 with processors in the Pentium 4 Xeon line. Since they had

already used the name “IA64” to refer to Itanium, they then faced a difficulty in finding their own name for

this 64-bit extension. In the end, they decided to describe x86-64 as an enhancement to IA32, and so they

refer to it as IA32-EM64T for “Enhanced Memory 64-bit Technology.”

The developers of GCC steadfastly maintained binary compatibility with the i386, even though useful fea-

tures had been added to the IA32 instruction set. The PentiumPro introduced a set of conditional move

instructions that could greatly improve the performance of code involving conditional operations. More

recent generations of Pentium processors introduced new floating point operations that could replace the

rather awkward and quirky approach dating back to the 8087, the floating point coprocessor that accompa-

nied the 8086 and is now incorporated within the main microprocessors chips. Switching to x86-64 as a

target provided an opportunity for GCC to give up backward compatibility and instead exploit these newer

features.

In this document, we use “IA32” to refer to the combination of hardware and GCC code found in traditional,

32-bit versions of Linux running on Intel-based machines. We use “x86-64” to refer to the hardware and

code combination running on the newer 64-bit machines from AMD and Intel. In the Linux world, these

two platforms are referred to as “i386” and “x86 64,” respectively.

2 Finding Documentation

Both Intel and AMD provide extensive documentation on their processors. This includes general overviews

of the assembly language programmer’s view of the hardware [2, 4], as well as detailed references about

the individual instructions [3, 5, 6]. The organization amd64.org has been responsible for defining the

Application Binary Interface (ABI) for x86-64 code running on Linux systems [8]. This interface describes

details for procedure linkages, binary code files, and a number of other features that are required for object

code programs to execute properly.

Warning: Both the Intel and the AMD documentation use the Intel assembly code notation. This differs

from the notation used by the Gnu assembler GAS. Most significantly, it lists operands in the opposite order.

3 An Overview of x86-64

The combination of the new hardware supplied by Intel and AMD, as well as the new version of GCC

targeting these machines makes x86-64 code substantially different from that generated for IA32 machines.

3C declaration Intel data type GAS suffix x86-64 Size (Bytes)

char Byte b 1

short Word w 2

int Double word l 4

unsigned Double word l 4

long int Quad word q 8

unsigned long Quad word q 8

char * Quad word q 8

float Single precision s 4

double Double precision d 8

long double Extended precision t 16

Figure 1: Sizes of standard data types with x86-64 Both long integers and pointers require 8 bytes, as

compared to 4 for IA32.

The main features include:

 Pointers and long integers are 64 bits long. Integer arithmetic operations support 8, 16, 32, and 64-bit

data types.

 The set of general-purpose registers is expanded from 8 to 16.

 Much of the program state is held in registers rather than on the stack. Integer and pointer procedure

arguments (up to 6) are passed via registers. Some procedures do not need to access the stack at all.

 Conditional operations are implemented using conditional move instructions when possible, yielding

better performance than traditional branching code.

 Floating-point operations are implemented using a register-oriented instruction set, rather than the

stack-based approach supported by IA32.

3.1 Data Types

Figure 1 shows the sizes of different C data types for x86-64. Comparing these to the IA32 sizes (CS:APP

Figure 3.1), we see that pointers (shown here as data type char *) require 8 bytes rather than 4. In

principal, this gives programs the ability to access 16 exabytes of memory (around 18:4  1018 bytes).

That seems like an astonishing amount of memory, but keep in mind that 4 gigabytes seemed astonishing

when the first 32-bit machines appeared in the late 1970s. In practice, most machines do not really support

the full address range—the current generations of AMD and Intel x86-64 machines support 256 terabytes

(248) bytes of virtual memory—but allocating this much memory for pointers is a good idea for long term

compatibility.

We also see that the prefix “long” changes integers to 64 bits, allowing a considerably larger range of

values. Whereas a 32-bit unsigned value can range up to 4,294,967,295 (CS:APP Figure 2.8), increasing

the word size to 64 bits gives a maximum value of 18,446,744,073,709,551,615.

4As with IA32, the long prefix also changes a floating point double to use the 80-bit format supported

by IA32 (CS:APP Section 2.4.6.) These are stored in memory with an allocation of 16 bytes for x86-64,

compared to 12 bytes for IA32. This improves the performance of memory read and write operations, which

typically fetch 8 or 16 bytes at a time. Whether 12 or 16 bytes are allocated, only the low-order 10 bytes are

actually used.

3.2 Assembly Code Example

Section 3.2.3 of CS:APP illustrated the IA32 assembly code generated by GCC for a function simple.

Below is the C code for simple l, similar to simple, except that it uses long integers:

long int simple_l(long int *xp, long int y)

{

long int t = *xp + y;

*xp = t;

return t;

}

When GCC is run on an x86-64 machine with the command line

unix> gcc -O2 -S -m32 code.c

it generates code that is compatible with any IA32 machine:

IA32 version of function simple_l.

Arguments in stack locations 8(%ebp) (xp) and 12(%ebp) (y)

1 simple_l:

2 pushl %ebp Save frame pointer

3 movl %esp, %ebp Create new frame pointer

4 movl 8(%ebp), %edx Get xp

5 movl (%edx), %eax Retrieve *xp

6 addl 12(%ebp), %eax Add y to get t (and return value)

7 movl %eax, (%edx) Store t at *xp

8 leave Restore stack and frame pointers

9 ret Return

This code is almost identical to that shown in CS:APP, except that it uses the single leave instruction

(CS:APP Section 3.7.2), rather than the sequence movl %ebp, %esp and popl %ebp to deallocate the

stack frame.

When we instruct GCC to generate x86-64 code

unix> gcc -O2 -S -m64 code.c

(on most machines, the flag -m64 is not required), we get very different code:

x86-64 version of function simple_l.

Arguments in registers %rdi (xp) and %rsi (y)

51 simple_l:

2 addq (%rdi), %rsi Add *xp to y to get t

3 movq %rsi, %rax Set t as return value

4 movq %rsi, (%rdi) Store t at *xp

5 ret Return

Some of the key differences include

 Instead of movl and addl instructions, we see movq and addq. The pointers and variables declared

as long integers are now 64 bits (quad words) rather than 32 bits (long words).

 We see the 64-bit versions of the registers, e.g., %rsi, %rdi. The procedure returns a value by

storing it in register %rax.

 No stack frame gets generated in the x86-64 version. This eliminates the instructions that set up (lines

2–3) and remove (line 8) the stack frame in the IA32 code.

 Arguments xp and y are passed in registers %rdi and %rsi, rather than on the stack. These registers

are the 64-bit versions of registers %edi and %esi. This eliminates the need to fetch the arguments

from memory. As a consequence, the two instructions on lines 2 and 3 can retrieve *xp, add it to y,

and set it as the return value, whereas the IA32 code required three lines of code: 4–6.

The net effect of these changes is that the IA32 code consists of 8 instructions making 7 memory refer-

ences, while the x86-64 code consists of 4 instructions making 3 memory references. Running on an Intel

Pentium 4 Xeon, our experiments show that the IA32 code requires around 17 clock cycles per call, while

the x86-64 code requires 12 cycles per call. Running on an AMD Opteron, we get 9 and 7 cycles per call,

respectively. Getting a performance increase of 1.3–1.4X on the same machine with the same C code is a

significant achievement. Clearly x86-64 represents a important step forward.

4 Accessing Information

Figure 2 shows the set of general-purpose registers under x86-64. Compared to the registers for IA32

(CS:APP Figure 3.2), we see a number of differences:

 The number of registers has been doubled to 16. The new registers are numbered 8–15.

 All registers are 64 bits long. The 64-bit extensions of the IA32 registers are named %rax, %rcx,

%rdx, %rbx, %rsi, %rdi, %rsp, and %rbp. The new registers are named %r8–%r15.

 The low-order 32 bits of each register can be accessed directly. This gives us the familiar registers

from IA32: %eax, %ecx, %edx, %ebx, %esi, %edi, %esp, and %ebp, as well as eight new 32-bit

registers: %r8d–%r15d.

 The low-order 16 bits of each register can be accessed directly, as is the case for IA32. The word-size

versions of the new registers are named %r8w–%r15w.

663 31 15 8 7 0

Return value %rax %eax %ax %ah %al

Callee saved %rbx %ebx %ax %bh %bl

4th argument %rcx %ecx %cx %ch %cl

3rd argument %rdx %edx %dx %dh %dl

2nd argument %rsi %esi %si %sil

1st argument %rdi %edi %di %dil

Callee saved %rbp %ebp %bp %bpl

Stack pointer %rsp %esp %sp %spl

5th argument %r8 %r8d %r8w %r8b

6th argument %r9 %r9d %r9w %r9b

Callee saved %r10 %r10d %r10w %r10b

Used for linking %r11 %r11d %r11w %r11b

Unused for C %r12 %r12d %r12w %r12b

Callee saved %r13 %r13d %r13w %r13b

Callee saved %r14 %r14d %r14w %r14b

Callee saved %r15 %r15d %r15w %r15b

Figure 2: Integer registers. The existing eight registers are extended to 64-bit versions, and eight new

registers are added. Each register can be accessed as either 8 bits (byte), 16 bits (word), 32 bits (double

word), or 64 bits (quad word).

7 The low-order 8 bits of each register can be accessed directly. This is true in IA32 only for the first 4

registers (%al, %cl, %dl, %bl). The byte-size versions of the other IA32 registers are named %sil,

%dil, %spl, and %bpl. The byte-size versions of the new registers are named %r8b–%r15b.

 For backward compatibility, the second byte of registers %rax, %rcx, %rdx, and %rbx can be

directly accessed by instructions having single-byte operands.

As with IA32, most of the registers can be used interchangeably, but there are some special cases. Register

%rsp has special status, in that it holds a pointer to the top stack element. Unlike in IA32, however, there

is no frame pointer register; register %rbp is available for use as a general-purpose register. Particular

conventions are used for passing procedure arguments via registers and for how registers are to be saved

and restored registers during procedure calls, as is discussed in Section 6. In addition, some arithmetic

instructions make special use of registers %rax and %rdx.

For the most part, the operand specifiers of x86-64 are just the same as those in IA32 (see CS:APP Fig-

ure 3.3). One minor difference is that some forms of PC-relative operand addressing are supported. With

IA32, this form of addressing is only supported for jump and other control transfer instructions (see CS:APP

Section 3.6.3). This mode is provided to compensate for the fact that the offsets (shown in CS:APP Fig-

ure 3.3 as Imm) are only 32 bits long. By viewing this field as a 32-bit, two’s complement number, instruc-

tions can access data within a window of around 2:15109 relative to the program counter. With x86-64,

the program counter is named %rip.

As an example of PC-relative data addressing, consider the following procedure, which calls the function

call simple l examined earlier:

long int gval1 = 567;

long int gval2 = 763;

long int call_simple_l()

{

long int z = simple_l(&gval1, 12L);

return z + gval2;

}

This code references global variables gval1 and gval2. When this function is compiled, assembled, and

linked, we get the following executable code (as generated by the disassembler objdump)

1 0000000000400500 <call_simple_l>:

2 400500: be 0c 00 00 00 mov $0xc,%esi Load 12 as 1st argument

3 400505: bf 08 12 50 00 mov $0x501208,%edi Load &gval1 as 2nd argument

4 40050a: e8 b1 ff ff ff callq 4004c0 <simple_l> Call simple_l

5 40050f: 48 03 05 ea 0c 10 00 add 1051882(%rip),%rax Add gval2 to result

6 400516: c3 retq

The instruction on line 3 stores the address of global variable gval1 in register %rdi. It does this by simply

copying the constant value 0x501208 into register %edi. The upper 32 bits of %rdi are then automat-

ically set to zero. The instruction on line 5 retrieves the value of gval2 and adds it to the value returned

8Instruction Effect Description

movq S, D D   S Move quad word

movabsq I, R R   I Move quad word

movslq S, R R   SignExtend(S) Move sign-extended double word

movsbq S, R R   SignExtend(S) Move sign-extended byte

movzbq S, R R   ZeroExtend(S) Move zero-extended byte

pushq S R[%rsp]   R[%rsp]