bootparams从bootloader到内核

setup中的header

在header.S结构中定义了一个使用汇编语言定义的hdr结构,这个结构是bootloader和内核setup代码之间通过boot协议约定的:
在哪个位置是什么字段,字段是什么意义都是bootloader和内核达成共识的,我们甚至可以认为:这个协议类似于tcp/ip的报文格式:特定偏移的特定变量具有特定意义。
有些字段是bootloader从这里读取的(例如setup的扇区数量和内核的字节数量),有些是bootloader向这里写入的(例如bootloader的类型)。

这里要注意的是:

  • hdr变量定义

在这个汇编代码中,定义了一个符号hdr,这个变量也是接下来的main.c代码中引用的hdr变量的定义:

  • cmd_line_ptr

该变量存储了BootLoader传递给kernel的参数列表。

///@file: //linux-3.12.6\arch\x86\boot\header.S
	.globl	hdr
	.globl	hdr
hdr:
setup_sects:	.byte 0			/* Filled in by build.c */
root_flags:	.word ROOT_RDONLY
syssize:	.long 0			/* Filled in by build.c */
ram_size:	.word 0			/* Obsolete */
vid_mode:	.word SVGA_MODE
root_dev:	.word 0			/* Filled in by build.c */
boot_flag:	.word 0xAA55

	# offset 512, entry point

	.globl	_start
_start:
		# Explicitly enter this as bytes, or the assembler
		# tries to generate a 3-byte jump here, which causes
		# everything else to push off to the wrong offset.
		.byte	0xeb		# short (2-byte) jump
		.byte	start_of_setup-1f
1:

	# Part 2 of the header, from the old setup.S

		.ascii	"HdrS"		# header signature
		.word	0x020c		# header version number (>= 0x0105)
					# or else old loadlin-1.5 will fail)
		.globl realmode_swtch
realmode_swtch:	.word	0, 0		# default_switch, SETUPSEG
start_sys_seg:	.word	SYSSEG		# obsolete and meaningless, but just
					# in case something decided to "use" it
		.word	kernel_version-512 # pointing to kernel version string
					# above section of header is compatible
					# with loadlin-1.5 (header v1.5). Don't
					# change it.

type_of_loader:	.byte	0		# 0 means ancient bootloader, newer
					# bootloaders know to change this.
					# See Documentation/x86/boot.txt for
					# assigned ids

# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
		.byte	LOADED_HIGH	# The kernel is to be loaded high

setup_move_size: .word  0x8000		# size to move, when setup is not
					# loaded at 0x90000. We will move setup
					# to 0x90000 then just before jumping
					# into the kernel. However, only the
					# loader knows how much data behind
					# us also needs to be loaded.

code32_start:				# here loaders can put a different
					# start address for 32-bit code.
		.long	0x100000	# 0x100000 = default for big kernel

ramdisk_image:	.long	0		# address of loaded ramdisk image
					# Here the loader puts the 32-bit
					# address where it loaded the image.
					# This only will be read by the kernel.

ramdisk_size:	.long	0		# its size in bytes

bootsect_kludge:
		.long	0		# obsolete

heap_end_ptr:	.word	_end+STACK_SIZE-512
					# (Header version 0x0201 or later)
					# space from here (exclusive) down to
					# end of setup code can be used by setup
					# for local heap purposes.

ext_loader_ver:
		.byte	0		# Extended boot loader version
ext_loader_type:
		.byte	0		# Extended boot loader type

cmd_line_ptr:	.long	0		# (Header version 0x0202 or later)
					# If nonzero, a 32-bit pointer
					# to the kernel command line.
					# The command line should be
					# located between the start of
					# setup and the end of low
					# memory (0xa0000), or it may
					# get overwritten before it
					# gets read.  If this field is
					# used, there is no longer
					# anything magical about the
					# 0x90000 segment; the setup
					# can be located anywhere in
					# low memory 0x10000 or higher.

ramdisk_max:	.long 0x7fffffff
					# (Header version 0x0203 or later)
					# The highest safe address for
					# the contents of an initrd
					# The current kernel allows up to 4 GB,
					# but leave it at 2 GB to avoid
					# possible bootloader bugs.

kernel_alignment:  .long CONFIG_PHYSICAL_ALIGN	#physical addr alignment
						#required for protected mode
						#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel:    .byte 1
#else
relocatable_kernel:    .byte 0
#endif
min_alignment:		.byte MIN_KERNEL_ALIGN_LG2	# minimum alignment

xloadflags:
#ifdef CONFIG_X86_64
# define XLF0 XLF_KERNEL_64			/* 64-bit kernel */
#else
# define XLF0 0
#endif

#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_X86_64)
   /* kernel/boot_param/ramdisk could be loaded above 4g */
# define XLF1 XLF_CAN_BE_LOADED_ABOVE_4G
#else
# define XLF1 0
#endif

#ifdef CONFIG_EFI_STUB
# ifdef CONFIG_X86_64
#  define XLF23 XLF_EFI_HANDOVER_64		/* 64-bit EFI handover ok */
# else
#  define XLF23 XLF_EFI_HANDOVER_32		/* 32-bit EFI handover ok */
# endif
#else
# define XLF23 0
#endif
			.word XLF0 | XLF1 | XLF23

cmdline_size:   .long   COMMAND_LINE_SIZE-1     #length of the command line,
                                                #added with boot protocol
                                                #version 2.06

hardware_subarch:	.long 0			# subarchitecture, added with 2.07
						# default to 0 for normal x86 PC

hardware_subarch_data:	.quad 0

payload_offset:		.long ZO_input_data
payload_length:		.long ZO_z_input_len

setup_data:		.quad 0			# 64-bit physical pointer to
						# single linked list of
						# struct setup_data

pref_address:		.quad LOAD_PHYSICAL_ADDR	# preferred load addr

#define ZO_INIT_SIZE	(ZO__end - ZO_startup_32 + ZO_z_extract_offset)
#define VO_INIT_SIZE	(VO__end - VO__text)
#if ZO_INIT_SIZE > VO_INIT_SIZE
#define INIT_SIZE ZO_INIT_SIZE
#else
#define INIT_SIZE VO_INIT_SIZE
#endif
init_size:		.long INIT_SIZE		# kernel initialization size
handover_offset:
#ifdef CONFIG_EFI_STUB
  			.long 0x30		# offset to the handover
						# protocol entry point
#else
			.long 0
#endif

# End of setup header #####################################################


setup中的boot_params

在setup的copy_boot_params函数中,将bootloader写入的参数整体拷贝到setup中boot_params变量的hdr中。
正如代码开始注释所说,这个时候还是实模式(real-mode),此时C语言代码可以和汇编代码无缝混编。
还要注意一点,这个变量是在setup部分而不在内核(vmlinux)中。

下面代码会将header.S代码中定义的hdr结构拷贝到boot_params.hdr中。

//linux-3.12.6\arch\x86\boot\main.c

/*
 * Main module for the real-mode kernel code
 */

#include "boot.h"

struct boot_params boot_params __attribute__((aligned(16)));

char *HEAP = _end;
char *heap_end = _end;		/* Default end of heap = no heap */

/*
 * Copy the header into the boot parameter block.  Since this
 * screws up the old-style command line protocol, adjust by
 * filling in the new-style command line pointer instead.
 */

static void copy_boot_params(void)
{
	struct old_cmdline {
		u16 cl_magic;
		u16 cl_offset;
	};
	const struct old_cmdline * const oldcmd =
		(const struct old_cmdline *)OLD_CL_ADDRESS;

	BUILD_BUG_ON(sizeof boot_params != 4096);
	memcpy(&boot_params.hdr, &hdr, sizeof hdr);

	if (!boot_params.hdr.cmd_line_ptr &&
	    oldcmd->cl_magic == OLD_CL_MAGIC) {
		/* Old-style command line protocol. */
		u16 cmdline_seg;

		/* Figure out if the command line falls in the region
		   of memory that an old kernel would have copied up
		   to 0x90000... */
		if (oldcmd->cl_offset < boot_params.hdr.setup_move_size)
			cmdline_seg = ds();
		else
			cmdline_seg = 0x9000;

		boot_params.hdr.cmd_line_ptr =
			(cmdline_seg << 4) + oldcmd->cl_offset;
	}
}

下面是汇编代码定义的C语言描述,可以和汇编代码变量定义做一个对比。

///@file: linux-3.12.6\arch\x86\include\uapi\asm\bootparam.h
struct setup_header {
	__u8	setup_sects;
	__u16	root_flags;
	__u32	syssize;
	__u16	ram_size;
	__u16	vid_mode;
	__u16	root_dev;
	__u16	boot_flag;
	__u16	jump;
	__u32	header;
	__u16	version;
	__u32	realmode_swtch;
	__u16	start_sys;
	__u16	kernel_version;
	__u8	type_of_loader;
	__u8	loadflags;
	__u16	setup_move_size;
	__u32	code32_start;
	__u32	ramdisk_image;
	__u32	ramdisk_size;
	__u32	bootsect_kludge;
	__u16	heap_end_ptr;
	__u8	ext_loader_ver;
	__u8	ext_loader_type;
	__u32	cmd_line_ptr;
	__u32	initrd_addr_max;
	__u32	kernel_alignment;
	__u8	relocatable_kernel;
	__u8	min_alignment;
	__u16	xloadflags;
	__u32	cmdline_size;
	__u32	hardware_subarch;
	__u64	hardware_subarch_data;
	__u32	payload_offset;
	__u32	payload_length;
	__u64	setup_data;
	__u64	pref_address;
	__u32	init_size;
	__u32	handover_offset;
} __attribute__((packed));

protected-mode

在setup的最后,会进入保护模式(protected mode),此时传入了两个最为重要的参数:入口代码位置和boot_params的地址。
要注意的是,这些地址还都是实模式下的物理地址,而且它们属于bzImage的setup部分。

/*
 * Actual invocation sequence
 */
void go_to_protected_mode(void)
{
	/* Hook before leaving real mode, also disables interrupts */
	realmode_switch_hook();

	/* Enable the A20 gate */
	if (enable_a20()) {
		puts("A20 gate not responding, unable to boot...\n");
		die();
	}

	/* Reset coprocessor (IGNNE#) */
	reset_coprocessor();

	/* Mask all interrupts in the PIC */
	mask_all_interrupts();

	/* Actual transition to protected mode... */
	setup_idt();
	setup_gdt();
	protected_mode_jump(boot_params.hdr.code32_start,
			    (u32)&boot_params + (ds() << 4));
}

顺便提一下:386模式下函数参数的传递和X64不同:在386模式下,寄存器是按照eax、edx、ebx顺序传递,所以在调用protected_mode_jump函数时,第一个参数在eax寄存器,第二个参数在edx寄存器。

tsecer@harry: gdb -quiet arch/x86/boot/pm.o 
Reading symbols from arch/x86/boot/pm.o...
(gdb) set architecture i
i386               i386:intel         i386:x64-32        i386:x64-32:intel  i386:x86-64        i386:x86-64:intel  i8086              
(gdb) set architecture i8086 
warning: A handler for the OS ABI "GNU/Linux" is not built into this configuration
of GDB.  Attempting to continue with the default i8086 settings.

The target architecture is set to "i8086".
(gdb) disas go_to_protected_mode
Dump of assembler code for function go_to_protected_mode:
   0x00000000 <+0>:     push   %ebx
   0x00000002 <+2>:     cmpl   $0x0,0x208
   0x00000008 <+8>:     je     0x10 <go_to_protected_mode+16>
   0x0000000a <+10>:    lcall  *0x208
   0x0000000e <+14>:    jmp    0x17 <go_to_protected_mode+23>
   0x00000010 <+16>:    cli    
   0x00000011 <+17>:    mov    $0x80,%al
   0x00000013 <+19>:    out    %al,$0x70
   0x00000015 <+21>:    out    %al,$0x80
   0x00000017 <+23>:    calll  0x19 <go_to_protected_mode+25>
   0x0000001d <+29>:    test   %eax,%eax
   0x00000020 <+32>:    je     0x34 <go_to_protected_mode+52>
   0x00000022 <+34>:    mov    $0x0,%eax
   0x00000028 <+40>:    calll  0x2a <go_to_protected_mode+42>
   0x0000002e <+46>:    calll  0x30 <go_to_protected_mode+48>
   0x00000034 <+52>:    xor    %eax,%eax
   0x00000037 <+55>:    out    %al,$0xf0
   0x00000039 <+57>:    out    %al,$0x80
   0x0000003b <+59>:    out    %al,$0xf1
   0x0000003d <+61>:    out    %al,$0x80
   0x0000003f <+63>:    mov    $0xff,%al
   0x00000041 <+65>:    out    %al,$0xa1
   0x00000043 <+67>:    out    %al,$0x80
   0x00000045 <+69>:    mov    $0xfb,%al
   0x00000047 <+71>:    out    %al,$0x21
   0x00000049 <+73>:    out    %al,$0x80
   0x0000004b <+75>:    lidtl  0x28
   0x00000051 <+81>:    movw   $0x27,0x0
   0x00000057 <+87>:    mov    %ds,%dx
   0x00000059 <+89>:    movzwl %dx,%edx
   0x0000005d <+93>:    shl    $0x4,%edx
   0x00000061 <+97>:    lea    0x0(%edx),%eax
   0x00000069 <+105>:   mov    %eax,0x2
   0x0000006d <+109>:   lgdtl  0x0
   0x00000073 <+115>:   add    $0x0,%edx
   0x0000007a <+122>:   mov    0x214,%eax
   0x0000007e <+126>:   calll  0x80 <go_to_protected_mode+128>
End of assembler dump.
(gdb) 

kernel启动代码对数据的拷贝

正如其中注释所说,rsi存储的是实模式下的boot_params结构指针。为了和X86_64函数调动ABI一致,主动将rsi的值拷贝到rdi寄存器,然后调用x86_64_start_kernel函数。

//linux-3.12.6\arch\x86\kernel\head_64.S
	/* rsi is pointer to real mode structure with interesting info.
	   pass it to C */
	movq	%rsi, %rdi
///...
	movq	initial_code(%rip),%rax
///...
	GLOBAL(initial_code)
	.quad	x86_64_start_kernel

内核启动时从实模式的拷贝

在x86_64_start_kernel函数中,通过__va将实模式地址转换为虚拟地址,并拷贝到自己的本地变量中。

asmlinkage void __init x86_64_start_kernel(char * real_mode_data)
{
	int i;

	/*
	 * Build-time sanity checks on the kernel image and module
	 * area mappings. (these are purely build-time and produce no code)
	 */
	BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
	BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
	BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
	BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
	BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
	BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
	BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
				(__START_KERNEL & PGDIR_MASK)));
	BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

	/* Kill off the identity-map trampoline */
	reset_early_page_tables();

	/* clear bss before set_intr_gate with early_idt_handler */
	clear_bss();

	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
		set_intr_gate(i, &early_idt_handlers[i]);
	load_idt((const struct desc_ptr *)&idt_descr);

	copy_bootdata(__va(real_mode_data));

	/*
	 * Load microcode early on BSP.
	 */
	load_ucode_bsp();

	if (console_loglevel == 10)
		early_printk("Kernel alive\n");

	clear_page(init_level4_pgt);
	/* set init_level4_pgt kernel high mapping*/
	init_level4_pgt[511] = early_level4_pgt[511];

	x86_64_start_reservations(real_mode_data);
}

虚实地址转换

使用到的宏

///\linux-3.12.6\arch\x86\include\asm\page_64_types.h
/*
 * Set __PAGE_OFFSET to the most negative possible address +
 * PGDIR_SIZE*16 (pgd slot 272).  The gap is to allow a space for a
 * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
 * what Xen requires.
 */
#define __PAGE_OFFSET           _AC(0xffff880000000000, UL)

#define __START_KERNEL_map	_AC(0xffffffff80000000, UL)

/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
#define __PHYSICAL_MASK_SHIFT	46
#define __VIRTUAL_MASK_SHIFT	47

/*
 * Kernel image size is limited to 512 MB (see level2_kernel_pgt in
 * arch/x86/kernel/head_64.S), and it is mapped here:
 */
#define KERNEL_IMAGE_SIZE	(512 * 1024 * 1024)

#define __START_KERNEL		(__START_KERNEL_map + __PHYSICAL_START)

这样可以精确的根据地址本身确定它是否属于内核(vmlinux)。

static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
	unsigned long y = x - __START_KERNEL_map;

	/* use the carry flag to determine if x was < __START_KERNEL_map */
	x = y + ((x > y) ? phys_base : (__START_KERNEL_map - PAGE_OFFSET));

	return x;
}

内核的逻辑地址

在生成内核的时候,假设内核是被加载到逻辑地址为(__START_KERNEL_map + __PHYSICAL_START)的位置。也就是说,对于内核本身来说,它有一个专门的逻辑映射区域
_PAGE_OFF这个偏移相当于是没有任何逻辑的,把内存就作为资源页面建立的映射,而内核的映射__START_KERNEL_map映射可以认为是特有的映射。两种方法都可以访问到相同的(内核使用的)物理页面,但是如果在__START_KERNEL_map范围可以更加明确的知道它是内核的地址空间。能想到的一个好处就是发生缺页或者异常的时候可以明确知道是内核区域。猜测可能是给一些debug之类的场景使用。

//linux-3.12.6\arch\x86\kernel\vmlinux.lds.S
#ifdef CONFIG_X86_32
#define LOAD_OFFSET __PAGE_OFFSET
#else
#define LOAD_OFFSET __START_KERNEL_map
#endif
///
SECTIONS
{
#ifdef CONFIG_X86_32
        . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
        phys_startup_32 = startup_32 - LOAD_OFFSET;
#else
        . = __START_KERNEL;
        phys_startup_64 = startup_64 - LOAD_OFFSET;
#endif

内核文档说明

在4G模式下,内核占用1G,用户态占用3G。但是在x64模式下,只使用了64bit中的48,高16bit并没有使用(全部为1)。也就是总共用户态地址是128T,内核态也是128T空间。

//linux-3.12.6\Documentation\x86\x86_64\mm.txt

<previous description obsolete, deleted>

Virtual memory map with 4 level page tables:

0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
hole caused by [48:63] sign extension
ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
... unused hole ...
ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole

The direct mapping covers all memory in the system up to the highest
memory address (this means in some cases it can also include PCI memory
holes).

vmalloc space is lazily synchronized into the different PML4 pages of
the processes using the page fault handler, with init_level4_pgt as
reference.

Current X86-64 implementations only support 40 bits of address space,
but we support up to 46 bits. This expands into MBZ space in the page tables.

-Andi Kleen, Jul 2004

为什么地址只使用48bits而不是所有64bits

这里有一个说明。大致的意思是内存的增长应该是线性的等差数列(而不是等比数列),从16bits增加32bits增加了16bits,所以32bits扩展的时候也是增加16bits到48bits。

No 64-bit processor that I know of fully supports 64-bit addresses. The registers are 64 bits wide, and 8 bytes are used for storing a pointer, but the pointer values are typically constrained to effective 48 bits by forcing the most significant bits to be all zeroes or all ones.

The reason for this is that a full 64-bit address space is not (yet) needed, and it would be a waste of silicon to support something that is not needed. Supporting the full 64-bit address space would complicate the virtual to physical mapping for no good.

Exponential growth in memory address space means adding a constant number of address bits in a given time, not doubling the number of bits. So, if one evolutionary step was going from 16 address bits to 32 bits, then the next step up from 32 bits is 32+16 = 48 bits, not 64 bits. Expanding the address registers to 64 bits makes sense, because 48 bits would be somewhat awkward to handle, and provides an architecture that is ready for "real" 64-bit addresses when the time for them has come.

Q2: why not? A bigger physical address address space enables you to have multiple processes each with a virtual address space up to the 2^48 limit. Most 32-bit x86 processors in the last decade (two decades?) have supported the Physical Address Extension (PAE), which supports 64 GB physical memory, although the virtual address space is limited to 4 GB as defined by the original 80386 instruction set architecture.

Q3: yes, the page tables are managed by the kernel, and are stored in kernel memory.

posted on 2023-07-29 12:17  tsecer  阅读(162)  评论(0编辑  收藏  举报

导航