bootparams从bootloader到内核
setup中的header
在header.S结构中定义了一个使用汇编语言定义的hdr结构,这个结构是bootloader和内核setup代码之间通过boot协议约定的:
在哪个位置是什么字段,字段是什么意义都是bootloader和内核达成共识的,我们甚至可以认为:这个协议类似于tcp/ip的报文格式:特定偏移的特定变量具有特定意义。
有些字段是bootloader从这里读取的(例如setup的扇区数量和内核的字节数量),有些是bootloader向这里写入的(例如bootloader的类型)。
这里要注意的是:
- hdr变量定义
在这个汇编代码中,定义了一个符号hdr,这个变量也是接下来的main.c代码中引用的hdr变量的定义:
- cmd_line_ptr
该变量存储了BootLoader传递给kernel的参数列表。
///@file: //linux-3.12.6\arch\x86\boot\header.S
.globl hdr
.globl hdr
hdr:
setup_sects: .byte 0 /* Filled in by build.c */
root_flags: .word ROOT_RDONLY
syssize: .long 0 /* Filled in by build.c */
ram_size: .word 0 /* Obsolete */
vid_mode: .word SVGA_MODE
root_dev: .word 0 /* Filled in by build.c */
boot_flag: .word 0xAA55
# offset 512, entry point
.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
# Part 2 of the header, from the old setup.S
.ascii "HdrS" # header signature
.word 0x020c # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG # obsolete and meaningless, but just
# in case something decided to "use" it
.word kernel_version-512 # pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don't
# change it.
type_of_loader: .byte 0 # 0 means ancient bootloader, newer
# bootloaders know to change this.
# See Documentation/x86/boot.txt for
# assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
.byte LOADED_HIGH # The kernel is to be loaded high
setup_move_size: .word 0x8000 # size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.
code32_start: # here loaders can put a different
# start address for 32-bit code.
.long 0x100000 # 0x100000 = default for big kernel
ramdisk_image: .long 0 # address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.
ramdisk_size: .long 0 # its size in bytes
bootsect_kludge:
.long 0 # obsolete
heap_end_ptr: .word _end+STACK_SIZE-512
# (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.
ext_loader_ver:
.byte 0 # Extended boot loader version
ext_loader_type:
.byte 0 # Extended boot loader type
cmd_line_ptr: .long 0 # (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read. If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.
ramdisk_max: .long 0x7fffffff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
# The current kernel allows up to 4 GB,
# but leave it at 2 GB to avoid
# possible bootloader bugs.
kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel: .byte 1
#else
relocatable_kernel: .byte 0
#endif
min_alignment: .byte MIN_KERNEL_ALIGN_LG2 # minimum alignment
xloadflags:
#ifdef CONFIG_X86_64
# define XLF0 XLF_KERNEL_64 /* 64-bit kernel */
#else
# define XLF0 0
#endif
#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64)
/* kernel/boot_param/ramdisk could be loaded above 4g */
# define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
#else
# define XLF1 0
#endif
#ifdef CONFIG_EFI_STUB
# ifdef CONFIG_X86_64
# define XLF23 XLF_EFI_HANDOVER_64 /* 64-bit EFI handover ok */
# else
# define XLF23 XLF_EFI_HANDOVER_32 /* 32-bit EFI handover ok */
# endif
#else
# define XLF23 0
#endif
.word XLF0 | XLF1 | XLF23
cmdline_size: .long COMMAND_LINE_SIZE-1 #length of the command line,
#added with boot protocol
#version 2.06
hardware_subarch: .long 0 # subarchitecture, added with 2.07
# default to 0 for normal x86 PC
hardware_subarch_data: .quad 0
payload_offset: .long ZO_input_data
payload_length: .long ZO_z_input_len
setup_data: .quad 0 # 64-bit physical pointer to
# single linked list of
# struct setup_data
pref_address: .quad LOAD_PHYSICAL_ADDR # preferred load addr
#define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
#define VO_INIT_SIZE (VO__end - VO__text)
#if ZO_INIT_SIZE > VO_INIT_SIZE
#define INIT_SIZE ZO_INIT_SIZE
#else
#define INIT_SIZE VO_INIT_SIZE
#endif
init_size: .long INIT_SIZE # kernel initialization size
handover_offset:
#ifdef CONFIG_EFI_STUB
.long 0x30 # offset to the handover
# protocol entry point
#else
.long 0
#endif
# End of setup header #####################################################
setup中的boot_params
在setup的copy_boot_params函数中,将bootloader写入的参数整体拷贝到setup中boot_params变量的hdr中。
正如代码开始注释所说,这个时候还是实模式(real-mode),此时C语言代码可以和汇编代码无缝混编。
还要注意一点,这个变量是在setup部分而不在内核(vmlinux)中。
下面代码会将header.S代码中定义的hdr结构拷贝到boot_params.hdr中。
//linux-3.12.6\arch\x86\boot\main.c
/*
* Main module for the real-mode kernel code
*/
#include "boot.h"
struct boot_params boot_params __attribute__((aligned(16)));
char *HEAP = _end;
char *heap_end = _end; /* Default end of heap = no heap */
/*
* Copy the header into the boot parameter block. Since this
* screws up the old-style command line protocol, adjust by
* filling in the new-style command line pointer instead.
*/
static void copy_boot_params(void)
{
struct old_cmdline {
u16 cl_magic;
u16 cl_offset;
};
const struct old_cmdline * const oldcmd =
(const struct old_cmdline *)OLD_CL_ADDRESS;
BUILD_BUG_ON(sizeof boot_params != 4096);
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
if (!boot_params.hdr.cmd_line_ptr &&
oldcmd->cl_magic == OLD_CL_MAGIC) {
/* Old-style command line protocol. */
u16 cmdline_seg;
/* Figure out if the command line falls in the region
of memory that an old kernel would have copied up
to 0x90000... */
if (oldcmd->cl_offset < boot_params.hdr.setup_move_size)
cmdline_seg = ds();
else
cmdline_seg = 0x9000;
boot_params.hdr.cmd_line_ptr =
(cmdline_seg << 4) + oldcmd->cl_offset;
}
}
下面是汇编代码定义的C语言描述,可以和汇编代码变量定义做一个对比。
///@file: linux-3.12.6\arch\x86\include\uapi\asm\bootparam.h
struct setup_header {
__u8 setup_sects;
__u16 root_flags;
__u32 syssize;
__u16 ram_size;
__u16 vid_mode;
__u16 root_dev;
__u16 boot_flag;
__u16 jump;
__u32 header;
__u16 version;
__u32 realmode_swtch;
__u16 start_sys;
__u16 kernel_version;
__u8 type_of_loader;
__u8 loadflags;
__u16 setup_move_size;
__u32 code32_start;
__u32 ramdisk_image;
__u32 ramdisk_size;
__u32 bootsect_kludge;
__u16 heap_end_ptr;
__u8 ext_loader_ver;
__u8 ext_loader_type;
__u32 cmd_line_ptr;
__u32 initrd_addr_max;
__u32 kernel_alignment;
__u8 relocatable_kernel;
__u8 min_alignment;
__u16 xloadflags;
__u32 cmdline_size;
__u32 hardware_subarch;
__u64 hardware_subarch_data;
__u32 payload_offset;
__u32 payload_length;
__u64 setup_data;
__u64 pref_address;
__u32 init_size;
__u32 handover_offset;
} __attribute__((packed));
protected-mode
在setup的最后,会进入保护模式(protected mode),此时传入了两个最为重要的参数:入口代码位置和boot_params的地址。
要注意的是,这些地址还都是实模式下的物理地址,而且它们属于bzImage的setup部分。
/*
* Actual invocation sequence
*/
void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
realmode_switch_hook();
/* Enable the A20 gate */
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...\n");
die();
}
/* Reset coprocessor (IGNNE#) */
reset_coprocessor();
/* Mask all interrupts in the PIC */
mask_all_interrupts();
/* Actual transition to protected mode... */
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}
顺便提一下:386模式下函数参数的传递和X64不同:在386模式下,寄存器是按照eax、edx、ebx顺序传递,所以在调用protected_mode_jump函数时,第一个参数在eax寄存器,第二个参数在edx寄存器。
tsecer@harry: gdb -quiet arch/x86/boot/pm.o
Reading symbols from arch/x86/boot/pm.o...
(gdb) set architecture i
i386 i386:intel i386:x64-32 i386:x64-32:intel i386:x86-64 i386:x86-64:intel i8086
(gdb) set architecture i8086
warning: A handler for the OS ABI "GNU/Linux" is not built into this configuration
of GDB. Attempting to continue with the default i8086 settings.
The target architecture is set to "i8086".
(gdb) disas go_to_protected_mode
Dump of assembler code for function go_to_protected_mode:
0x00000000 <+0>: push %ebx
0x00000002 <+2>: cmpl $0x0,0x208
0x00000008 <+8>: je 0x10 <go_to_protected_mode+16>
0x0000000a <+10>: lcall *0x208
0x0000000e <+14>: jmp 0x17 <go_to_protected_mode+23>
0x00000010 <+16>: cli
0x00000011 <+17>: mov $0x80,%al
0x00000013 <+19>: out %al,$0x70
0x00000015 <+21>: out %al,$0x80
0x00000017 <+23>: calll 0x19 <go_to_protected_mode+25>
0x0000001d <+29>: test %eax,%eax
0x00000020 <+32>: je 0x34 <go_to_protected_mode+52>
0x00000022 <+34>: mov $0x0,%eax
0x00000028 <+40>: calll 0x2a <go_to_protected_mode+42>
0x0000002e <+46>: calll 0x30 <go_to_protected_mode+48>
0x00000034 <+52>: xor %eax,%eax
0x00000037 <+55>: out %al,$0xf0
0x00000039 <+57>: out %al,$0x80
0x0000003b <+59>: out %al,$0xf1
0x0000003d <+61>: out %al,$0x80
0x0000003f <+63>: mov $0xff,%al
0x00000041 <+65>: out %al,$0xa1
0x00000043 <+67>: out %al,$0x80
0x00000045 <+69>: mov $0xfb,%al
0x00000047 <+71>: out %al,$0x21
0x00000049 <+73>: out %al,$0x80
0x0000004b <+75>: lidtl 0x28
0x00000051 <+81>: movw $0x27,0x0
0x00000057 <+87>: mov %ds,%dx
0x00000059 <+89>: movzwl %dx,%edx
0x0000005d <+93>: shl $0x4,%edx
0x00000061 <+97>: lea 0x0(%edx),%eax
0x00000069 <+105>: mov %eax,0x2
0x0000006d <+109>: lgdtl 0x0
0x00000073 <+115>: add $0x0,%edx
0x0000007a <+122>: mov 0x214,%eax
0x0000007e <+126>: calll 0x80 <go_to_protected_mode+128>
End of assembler dump.
(gdb)
kernel启动代码对数据的拷贝
正如其中注释所说,rsi存储的是实模式下的boot_params结构指针。为了和X86_64函数调动ABI一致,主动将rsi的值拷贝到rdi寄存器,然后调用x86_64_start_kernel函数。
//linux-3.12.6\arch\x86\kernel\head_64.S
/* rsi is pointer to real mode structure with interesting info.
pass it to C */
movq %rsi, %rdi
///...
movq initial_code(%rip),%rax
///...
GLOBAL(initial_code)
.quad x86_64_start_kernel
内核启动时从实模式的拷贝
在x86_64_start_kernel函数中,通过__va将实模式地址转换为虚拟地址,并拷贝到自己的本地变量中。
asmlinkage void __init x86_64_start_kernel(char * real_mode_data)
{
int i;
/*
* Build-time sanity checks on the kernel image and module
* area mappings. (these are purely build-time and produce no code)
*/
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
/* Kill off the identity-map trampoline */
reset_early_page_tables();
/* clear bss before set_intr_gate with early_idt_handler */
clear_bss();
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, &early_idt_handlers[i]);
load_idt((const struct desc_ptr *)&idt_descr);
copy_bootdata(__va(real_mode_data));
/*
* Load microcode early on BSP.
*/
load_ucode_bsp();
if (console_loglevel == 10)
early_printk("Kernel alive\n");
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
x86_64_start_reservations(real_mode_data);
}
虚实地址转换
使用到的宏
///\linux-3.12.6\arch\x86\include\asm\page_64_types.h
/*
* Set __PAGE_OFFSET to the most negative possible address +
* PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a
* hypervisor to fit. Choosing 16 slots here is arbitrary, but it's
* what Xen requires.
*/
#define __PAGE_OFFSET _AC(0xffff880000000000, UL)
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)
/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
#define __PHYSICAL_MASK_SHIFT 46
#define __VIRTUAL_MASK_SHIFT 47
/*
* Kernel image size is limited to 512 MB (see level2_kernel_pgt in
* arch/x86/kernel/head_64.S), and it is mapped here:
*/
#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
这样可以精确的根据地址本身确定它是否属于内核(vmlinux)。
static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;
/* use the carry flag to determine if x was < __START_KERNEL_map */
x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));
return x;
}
内核的逻辑地址
在生成内核的时候,假设内核是被加载到逻辑地址为(__START_KERNEL_map + __PHYSICAL_START)的位置。也就是说,对于内核本身来说,它有一个专门的逻辑映射区域。
_PAGE_OFF这个偏移相当于是没有任何逻辑的,把内存就作为资源页面建立的映射,而内核的映射__START_KERNEL_map映射可以认为是特有的映射。两种方法都可以访问到相同的(内核使用的)物理页面,但是如果在__START_KERNEL_map范围可以更加明确的知道它是内核的地址空间。能想到的一个好处就是发生缺页或者异常的时候可以明确知道是内核区域。猜测可能是给一些debug之类的场景使用。
//linux-3.12.6\arch\x86\kernel\vmlinux.lds.S
#ifdef CONFIG_X86_32
#define LOAD_OFFSET __PAGE_OFFSET
#else
#define LOAD_OFFSET __START_KERNEL_map
#endif
///
SECTIONS
{
#ifdef CONFIG_X86_32
. = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
phys_startup_32 = startup_32 - LOAD_OFFSET;
#else
. = __START_KERNEL;
phys_startup_64 = startup_64 - LOAD_OFFSET;
#endif
内核文档说明
在4G模式下,内核占用1G,用户态占用3G。但是在x64模式下,只使用了64bit中的48,高16bit并没有使用(全部为1)。也就是总共用户态地址是128T,内核态也是128T空间。
//linux-3.12.6\Documentation\x86\x86_64\mm.txt
<previous description obsolete, deleted>
Virtual memory map with 4 level page tables:
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
hole caused by [48:63] sign extension
ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
... unused hole ...
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
holes).
vmalloc space is lazily synchronized into the different PML4 pages of
the processes using the page fault handler, with init_level4_pgt as
reference.
Current X86-64 implementations only support 40 bits of address space,
but we support up to 46 bits. This expands into MBZ space in the page tables.
-Andi Kleen, Jul 2004
为什么地址只使用48bits而不是所有64bits
这里有一个说明。大致的意思是内存的增长应该是线性的等差数列(而不是等比数列),从16bits增加32bits增加了16bits,所以32bits扩展的时候也是增加16bits到48bits。
No 64-bit processor that I know of fully supports 64-bit addresses. The registers are 64 bits wide, and 8 bytes are used for storing a pointer, but the pointer values are typically constrained to effective 48 bits by forcing the most significant bits to be all zeroes or all ones.
The reason for this is that a full 64-bit address space is not (yet) needed, and it would be a waste of silicon to support something that is not needed. Supporting the full 64-bit address space would complicate the virtual to physical mapping for no good.
Exponential growth in memory address space means adding a constant number of address bits in a given time, not doubling the number of bits. So, if one evolutionary step was going from 16 address bits to 32 bits, then the next step up from 32 bits is 32+16 = 48 bits, not 64 bits. Expanding the address registers to 64 bits makes sense, because 48 bits would be somewhat awkward to handle, and provides an architecture that is ready for "real" 64-bit addresses when the time for them has come.
Q2: why not? A bigger physical address address space enables you to have multiple processes each with a virtual address space up to the 2^48 limit. Most 32-bit x86 processors in the last decade (two decades?) have supported the Physical Address Extension (PAE), which supports 64 GB physical memory, although the virtual address space is limited to 4 GB as defined by the original 80386 instruction set architecture.
Q3: yes, the page tables are managed by the kernel, and are stored in kernel memory.