内核启动过程
内核启动过程
上一篇文章解释了计算机如何启动到启动加载程序在将内核映像填充到内存后即将跳转到内核入口点的位置。最后一篇关于启动的文章将深入探讨内核的内部结构,以了解操作系统如何启动生命。由于我有实证倾向,因此我将在Linux Cross Reference上大量链接到 Linux 内核 2.6.25.6 的源代码。如果您熟悉类似 C 的语法,那么源代码的可读性非常强;即使您错过了一些细节,您也可以了解正在发生的事情的要点。主要障碍是缺乏某些代码的上下文,例如它何时或为何运行,或者机器的底层功能。我希望能提供一些相关背景信息。由于简洁(哈!),很多有趣的东西——比如中断和内存——目前只得到了认可。这篇文章最后重点介绍了 Windows 启动。
此时,在 Intel x86 启动故事中,处理器正在实模式下运行,能够寻址 1 MB 内存,对于现代 Linux 系统,RAM 如下所示:
引导加载程序完成后的 RAM 内容
引导加载程序已使用 BIOS 磁盘 I/O 服务将内核映像加载到内存中。该映像是硬盘驱动器中包含内核的文件的精确副本,例如/boot/vmlinuz-2.6.22-14-server。映像被分成两部分:一小部分包含实模式内核代码,加载到 640K 屏障以下;在保护模式下运行的大部分内核是在第一个兆字节内存之后加载的。
该操作从上图所示的实模式内核头中开始。该内存区域用于实现引导加载程序和内核之间的 Linux 引导协议。引导加载程序在工作时会读取其中的一些值。其中包括诸如包含内核版本的人类可读字符串之类的设施,但也包括诸如实模式内核部分的大小之类的关键信息。引导加载程序还写入该区域的值,例如用户在启动菜单中给出的命令行参数的内存地址。一旦引导加载程序完成,它就填充了内核头所需的所有参数。然后就可以跳转到内核入口点了。下图显示了内核初始化的代码序列,以及源目录、文件和行号:
特定于体系结构的 Linux 内核初始化
Intel 架构的早期内核启动位于文件 arch/x86/boot/header.S中。它采用汇编语言,这对于整个内核来说很少见,但对于引导代码来说很常见。该文件的开头实际上包含引导扇区代码,这是 Linux 可以在没有引导加载程序的情况下工作的时代遗留下来的。如今,该引导扇区如果执行,只会向用户打印“bugger_off_msg”并重新启动。现代引导加载程序会忽略此遗留代码。在引导扇区代码之后,我们有实模式内核头的前 15 个字节;这两部分加起来总计 512 字节,这是 Intel 硬件上典型磁盘扇区的大小。
在这 512 个字节之后,在偏移 0x200 处,我们找到了作为 Linux 内核一部分运行的第一条指令:实模式入口点。它位于header.S:110中,是一个 2 字节跳转,直接用机器代码编写为 0x3aeb。您可以通过在内核映像上运行 hexdump 并查看该偏移处的字节来验证这一点 - 这只是一个健全性检查,以确保这不是一个梦想。引导加载程序完成后会跳转到此位置,然后跳转到 header.S:229,其中我们有一个名为 start_of_setup 的常规汇编例程。这个简短的例程为实模式内核设置了一个堆栈,将bss段(包含静态变量的区域,因此它们以零值开始)清零,然后跳转到位于arch/x86/boot/main.c:122。
main() 执行一些内务处理,例如检测内存布局、设置视频模式等。然后它调用 go_to_protected_mode()。然而,在将 CPU 设置为保护模式之前,必须完成一些任务。有两个主要问题:中断和内存。在实模式下,处理器的中断向量表始终位于内存地址 0,而在保护模式下,中断向量表的位置存储在称为 IDTR 的 CPU 寄存器中。同时,逻辑内存地址(程序操作的地址)到线性内存地址(从 0 到内存顶部的原始数字)的转换在实模式和保护模式之间是不同的。保护模式需要一个名为 GDTR 的寄存器来加载全局描述符表的地址为了记忆。所以go_to_protected_mode()调用 setup_idt()和 setup_gdt()来安装临时中断描述符表和全局描述符表。
现在我们已经准备好进入保护模式,这是由另一个汇编例程protected_mode_jump完成的。该例程通过设置 CR0 CPU 寄存器中的 PE 位来启用保护模式。此时我们正在禁用分页 运行;分页是处理器的一个可选功能,即使在保护模式下也是如此,而且目前还没有必要。重要的是,我们不再局限于 640K 的障碍,现在可以处理高达 4GB 的 RAM。然后,该例程调用 32 位内核入口点, 对于压缩内核来说,该入口点是startup_32 。该例程执行一些基本的寄存器初始化并调用 decompress_kernel()(一个 C 函数)来执行实际的解压缩。
decompress_kernel() 打印熟悉的“正在解压 Linux...”消息。解压缩就地进行,一旦完成,未压缩的内核映像就会覆盖第一张图中所示的压缩内核映像。因此,未压缩的内容也从 1MB 开始。decompress_kernel() 然后打印“完成”。以及令人欣慰的“引导内核”。“引导”意味着跳转到整个故事的最后一个入口点,这是上帝亲自在Halti 山顶上给 Linus 的,这是第二兆 RAM (0x100000) 开始处的保护模式内核入口点。这个神圣的位置包含一个名为,呃,startup_32 的例程。但您会发现,这个位于不同的目录中。
startup_32 的第二个版本也是一个汇编例程,但它包含 32 位模式初始化。它清除保护模式内核的 bss 段(这是真正的内核,现在将运行直到机器重新启动或关闭),设置内存的最终全局描述符表,构建页表以便可以打开分页,启用分页,初始化堆栈,创建最终的中断描述符表,最后跳转到与体系结构无关的内核启动 start_kernel()。下图显示了引导最后一段的代码流程:
独立于体系结构的 Linux 内核初始化
start_kernel() 看起来更像典型的内核代码,几乎完全与 C 语言和机器无关。该函数是对各种内核子系统和数据结构的初始化的一长串调用。其中包括调度程序、内存区域、计时等等。start_kernel() 然后调用rest_init(),此时一切几乎都正常工作。rest_init() 创建一个内核线程,传递另一个函数kernel_init()作为入口点。rest_init() 然后调用schedule()启动任务调度,并通过调用 cpu_idle()进入睡眠状态,这是 Linux 内核的空闲线程。cpu_idle() 永远运行,承载它的进程 0 也永远运行。每当有工作要做时(可运行的进程),进程零就会从 CPU 中启动,只有在没有可运行的进程可用时才返回。
但这是我们的关键。这个空闲循环是我们自启动以来所遵循的长线程的结尾,它是处理器在加电后执行的第一个跳转的最终后代。所有这些混乱,从复位向量到BIOS到MBR到引导加载程序到实模式内核到保护模式内核,所有这些都通向这里,跳到跳到引导处理器的空闲循环,cpu_idle ()。这真的很酷。然而,这并不是故事的全部,否则计算机将无法工作。
此时,先前启动的内核线程已准备好启动,取代进程 0 及其空闲线程。确实如此,此时 kernel_init() 开始运行,因为它被指定为线程入口点。 kernel_init()负责初始化系统中剩余的 CPU,这些 CPU 自启动以来就已停止。到目前为止,我们看到的所有代码都是在单个 CPU(称为启动处理器)中执行的。当其他 CPU(称为应用处理器)启动时,它们会以实模式启动,并且也必须运行多次初始化。许多代码路径都是通用的,正如您在startup_32的代码中看到的那样,但是后来出现的应用程序处理器采取了一些轻微的分叉。最后kernel_init()调用 init_post(),它尝试按以下顺序执行用户模式进程:/sbin/init、/etc/init、/bin/init 和 /bin/sh。如果全部失败,内核将会出现恐慌。幸运的是 init 通常在那里,并开始以 PID 1 运行。它检查其配置文件以确定要启动哪些进程,其中可能包括 X11 Windows、用于在控制台上登录的程序、网络守护进程等。这样,当另一个 Linux 机器开始在某处运行时,引导过程就结束了。愿您的正常运行时间长久且无忧无虑。
鉴于通用架构,Windows 的过程在很多方面都相似。面临许多相同的问题并且必须进行类似的初始化。在启动方面,最大的区别之一是 Windows 将所有实模式内核代码和一些初始保护模式代码打包到启动加载程序本身 (C:\NTLDR) 中。因此,Windows 使用不同的二进制映像,而不是在同一内核映像中使用两个区域。另外Linux将引导加载程序和内核完全分离;在某种程度上,这会自动脱离开源流程。下图显示了 Windows 内核的主要部分:
Windows 内核初始化
Windows用户模式的启动自然是非常不同的。没有 /sbin/init,而是 Csrss.exe 和 Winlogon.exe。Winlogon 生成Services.exe(启动所有 Windows 服务)和 Lsass.exe(本地安全身份验证子系统)。经典的 Windows 登录对话框在 Winlogon 上下文中运行。
本引导系列到此结束。感谢大家的阅读和反馈。我很抱歉有些事情受到了肤浅的对待;我必须从某个地方开始,但只有这么多适合博客大小的内容。但没有什么比第二天更好的了;我的计划是定期发布像本系列这样的“软件图解”帖子以及其他主题。同时,这里有一些资源:
- 最好、最重要的资源是真实内核的源代码,无论是 Linux 还是 BSD 之一。
- 英特尔发布了优秀的 软件开发人员手册,您可以免费下载。
- 《理解 Linux 内核》是一本好书,介绍了许多 Linux 内核源代码。它已经过时而且很干燥,但我仍然会向任何想要了解内核的人推荐它。 Linux 设备驱动程序更有趣,教得也更好,但范围有限。最后,Patrick Moroney在这篇文章的评论中建议了 Robert Love 的Linux 内核开发。我听说过这本书的其他正面评论,所以听起来值得一读。
- 对于 Windows,迄今为止最好的参考是 David Solomon 和 Mark Russinovich编写的Windows Internals,后者因 Sysinternals 而闻名。这是一本很棒的书,写得很好,很透彻。主要缺点是缺乏源代码。
[更新:在下面的评论中,Nix 介绍了我忽略的初始根文件系统的很多内容。感谢Marius Barbu发现我写了“CR3”而不是 GDTR 的错误]
The Kernel Boot Process | Many But Finite
The Kernel Boot Process
The previous post explained how computers boot up right up to the point where the boot loader, after stuffing the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life. Since I have an empirical bent I'll link heavily to the sources for Linux kernel 2.6.25.6 at the Linux Cross Reference. The sources are very readable if you are familiar with C-like syntax; even if you miss some details you can get the gist of what's happening. The main obstacle is the lack of context around some of the code, such as when or why it runs or the underlying features of the machine. I hope to provide a bit of that context. Due to brevity (hah!) a lot of fun stuff - like interrupts and memory - gets only a nod for now. The post ends with the highlights for the Windows boot.
At this point in the Intel x86 boot story the processor is running in real-mode, is able to address 1 MB of memory, and RAM looks like this for a modern Linux system:
RAM contents after boot loader is done
The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O services. This image is an exact copy of the file in your hard drive that contains the kernel, e.g. /boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the kernel, which runs in protected mode, is loaded after the first megabyte of memory.
The action starts in the real-mode kernel header pictured above. This region of memory is used to implement the Linux boot protocol between the boot loader and the kernel. Some of the values there are read by the boot loader while doing its work. These include amenities such as a human-readable string containing the kernel version, but also crucial information like the size of the real-mode kernel piece. The boot loader also writes values to this region, such as the memory address for the command-line parameters given by the user in the boot menu. Once the boot loader is finished it has filled in all of the parameters required by the kernel header. It's then time to jump into the kernel entry point. The diagram below shows the code sequence for the kernel initialization, along with source directories, files, and line numbers:
Architecture-specific Linux Kernel Initialization
The early kernel start-up for the Intel architecture is in file arch/x86/boot/header.S. It's in assembly language, which is rare for the kernel at large but common for boot code. The start of this file actually contains boot sector code, a left over from the days when Linux could work without a boot loader. Nowadays this boot sector, if executed, only prints a "bugger_off_msg" to the user and reboots. Modern boot loaders ignore this legacy code. After the boot sector code we have the first 15 bytes of the real-mode kernel header; these two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.
After these 512 bytes, at offset 0x200, we find the very first instruction that runs as part of the Linux kernel: the real-mode entry point. It's in header.S:110 and it is a 2-byte jump written directly in machine code as 0x3aeb. You can verify this by running hexdump on your kernel image and seeing the bytes at that offset - just a sanity check to make sure it's not all a dream. The boot loader jumps into this location when it is finished, which in turn jumps to header.S:229 where we have a regular assembly routine called start_of_setup. This short routine sets up a stack, zeroes the bss segment (the area that contains static variables, so they start with zero values) for the real-mode kernel and then jumps to good old C code at arch/x86/boot/main.c:122.
main() does some house keeping like detecting memory layout, setting a video mode, etc. It then calls go_to_protected_mode(). Before the CPU can be set to protected mode, however, a few tasks must be done. There are two main issues: interrupts and memory. In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR. Meanwhile, the translation of logical memory addresses (the ones programs manipulate) to linear memory addresses (a raw number from 0 to the top of the memory) is different between real-mode and protected mode. Protected mode requires a register called GDTR to be loaded with the address of a Global Descriptor Table for memory. So go_to_protected_mode() calls setup_idt() and setup_gdt() to install a temporary interrupt descriptor table and global descriptor table.
We're now ready for the plunge into protected mode, which is done by protected_mode_jump, another assembly routine. This routine enables protected mode by setting the PE bit in the CR0 CPU register. At this point we're running with paging disabled; paging is an optional feature of the processor, even in protected mode, and there's no need for it yet. What's important is that we're no longer confined to the 640K barrier and can now address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is startup_32 for compressed kernels. This routine does some basic register initializations and calls decompress_kernel(), a C function to do the actual decompression.
decompress_kernel() prints the familiar "Decompressing Linux..." message. Decompression happens in-place and once it's finished the uncompressed kernel image has overwritten the compressed one pictured in the first diagram. Hence the uncompressed contents also start at 1MB. decompress_kernel() then prints "done." and the comforting "Booting the kernel." By "Booting" it means a jump to the final entry point in this whole story, given to Linus by God himself atop Mountain Halti, which is the protected-mode kernel entry point at the start of the second megabyte of RAM (0x100000). That sacred location contains a routine called, uh, startup_32. But this one is in a different directory, you see.
The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit mode initializations. It clears the bss segment for the protected-mode kernel (which is the true kernel that will now run until the machine reboots or shuts down), sets up the final global descriptor table for memory, builds page tables so that paging can be turned on, enables paging, initializes a stack, creates the final interrupt descriptor table, and finally jumps to to the architecture-independent kernel start-up, start_kernel(). The diagram below shows the code flow for the last leg of the boot:
Architecture-independent Linux Kernel Initialization
start_kernel() looks more like typical kernel code, which is nearly all C and machine independent. The function is a long list of calls to initializations of the various kernel subsystems and data structures. These include the scheduler, memory zones, time keeping, and so on. start_kernel() then calls rest_init(), at which point things are almost all working. rest_init() creates a kernel thread passing another function, kernel_init(), as the entry point. rest_init() then calls schedule() to kickstart task scheduling and goes to sleep by calling cpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and so does process zero, which hosts it. Whenever there is work to do - a runnable process - process zero gets booted out of the CPU, only to return when no runnable processes are available.
But here's the kicker for us. This idle loop is the end of the long thread we followed since boot, it's the final descendent of the very first jump executed by the processor after power up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this can't be the whole story otherwise the computer would do no work.
At this point, the kernel thread started previously is ready to kick in, displacing process 0 and its idle thread. And so it does, at which point kernel_init() starts running since it was given as the thread entry point. kernel_init() is responsible for initializing the remaining CPUs in the system, which have been halted since boot. All of the code we've seen so far has been executed in a single CPU, called the boot processor. As the other CPUs, called application processors, are started they come up in real-mode and must run through several initializations as well. Many of the code paths are common, as you can see in the code for startup_32, but there are slight forks taken by the late-coming application processors. Finally, kernel_init() calls init_post(), which tries to execute a user-mode process in the following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic. Luckily init is usually there, and starts running as PID 1. It checks its configuration file to figure out which processes to launch, which might include X11 Windows, programs for logging in on the console, network daemons, and so on. Thus ends the boot process as yet another Linux box starts running somewhere. May your uptime be long and untroubled.
The process for Windows is similar in many ways, given the common architecture. Many of the same problems are faced and similar initializations must be done. When it comes to boot one of the biggest differences is that Windows packs all of the real-mode kernel code, and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). So instead of having two regions in the same kernel image, Windows uses different binary images. Plus Linux completely separates boot loader and kernel; in a way this automatically falls out of the open source process. The diagram below shows the main bits for the Windows kernel:
Windows Kernel Initialization
The Windows user-mode start-up is naturally very different. There's no /sbin/init, but rather Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the Windows Services, and Lsass.exe, the local security authentication subsystem. The classic Windows login dialog runs in the context of Winlogon.
This is the end of this boot series. Thanks everyone for reading and for feedback. I'm sorry some things got superficial treatment; I've gotta start somewhere and only so much fits into blog-sized bites. But nothing like a day after the next; my plan is to do regular "Software Illustrated" posts like this series along with other topics. Meanwhile, here are some resources:
- The best, most important resource, is source code for real kernels, either Linux or one of the BSDs.
- Intel publishes excellent Software Developer's Manuals, which you can download for free.
- Understanding the Linux Kernel is a good book and walks through a lot of the Linux Kernel sources. It's getting outdated and it's dry, but I'd still recommend it to anyone who wants to grok the kernel. Linux Device Drivers is more fun, teaches well, but is limited in scope. Finally, Patrick Moroney suggested Linux Kernel Development by Robert Love in the comments for this post. I've heard other positive reviews for that book, so it sounds worth checking out.
- For Windows, the best reference by far is Windows Internals by David Solomon and Mark Russinovich, the latter of Sysinternals fame. This is a great book, well-written and thorough. The main downside is the lack of source code.
[Update: In a comment below, Nix covered a lot of ground on the initial root file system that I glossed over. Thanks to Marius Barbu for catching a mistake where I wrote “CR3” instead of GDTR]