kernel base

基础知识

Kernel：又称核心

维基百科：在计算机科学中是一个用来管理软件发出的数据I/O（输入与输出）要求的电脑程序，将这些要求转译为数据处理的指令并交由中央处理器（CPU）及电脑中其他电子组件进行处理，是现代操作系统中最基本的部分。它是为众多应用程序提供对计算机硬件的安全访问的一部分软件，这种访问是有限的，并由内核决定一个程序在什么时候对某部分硬件操作多长时间。直接对硬件操作是非常复杂的。所以内核通常提供一种硬件抽象的方法，来完成这些操作。有了这个，通过进程间通信机制及系统调用，应用进程可间接控制所需的硬件资源（特别是处理器及IO设备）。

严格地说，内核并不是计算机系统中必要的组成部分。有些程序可以直接地被调入计算机中执行；这样的设计，说明了设计者不希望提供任何硬件抽象和操作系统的支持；它常见于早期计算机系统的设计中。但随着电脑技术的发展，最终，一些辅助性程序，例如程序加载器和调试器，被设计到机器内核当中，或者写入在只读记忆体里。这些变化发生时，操作系统内核的概念就渐渐明晰起来了!

在这里插入图片描述
kernel 最主要的功能有两点：

控制并与硬件进行交互
提供 application 能运行的环境

权限

intel CPU 将 CPU 的特权级别分为 4 个级别：Ring 0, Ring 1, Ring 2, Ring 3

Ring0 只给 OS 使用，Ring 3 所有程序都可以使用，内层 Ring 可以随便使用外层 Ring 的资源。

Loadable Kernel Modules(LKMs)

LKMs 的文件格式和用户态的可执行程序相同，Linux 下为 ELF，Windows 下为 exe/dll，mac 下为 MACH-O，因此我们可以用 IDA 等工具来分析内核模块。

syscall

系统调用，指的是用户空间的程序向操作系统内核请求需要更高权限的服务，比如 IO 操作或者进程间通信。系统调用提供用户程序与操作系统间的接口，部分库函数（如 scanf，puts 等 IO 相关的函数实际上是对系统调用的封装（read 和 write)）。

在 /usr/include/x86_64-linux-gnu/asm/unistd_64.h 和 /usr/include/x86_64-linux-gnu/asm/unistd_32.h 分别可以查看 64 位和 32 位的系统调用号。

ioctl

#include <sys/ioctl.h>

       int ioctl(int fd, unsigned long request, ...);

ioctl 也是一个系统调用，用于与设备通信。

int ioctl(int fd, unsigned long request, ...) 的第一个参数为打开设备 (open) 返回的文件描述符，第二个参数为用户程序对设备的控制命令，再后边的参数则是一些补充参数，与设备有关。

状态切换

user space to kernel space

当发生 系统调用，产生异常，外设产生中断等事件时，会发生用户态到内核态的切换，具体的过程为：

通过 swapgs 切换 GS 段寄存器，将 GS 寄存器值和一个特定位置的值进行交换，目的是保存 GS 值，同时将该位置的值作为内核执行时的 GS 值使用。
将当前栈顶（用户空间栈顶）记录在 CPU 独占变量区域里，将 CPU 独占区域里记录的内核栈顶放入 rsp/esp。
通过 push 保存各寄存器值，具体的代码如下:

 ENTRY(entry_SYSCALL_64)
 /* SWAPGS_UNSAFE_STACK是一个宏，x86直接定义为swapgs指令 */
 SWAPGS_UNSAFE_STACK

 /* 保存栈值，并设置内核栈 */
 movq %rsp, PER_CPU_VAR(rsp_scratch)
 movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp


/* 通过push保存寄存器值，形成一个pt_regs结构 */
/* Construct struct pt_regs on stack */
pushq  $__USER_DS      /* pt_regs->ss */
pushq  PER_CPU_VAR(rsp_scratch)  /* pt_regs->sp */
pushq  %r11             /* pt_regs->flags */
pushq  $__USER_CS      /* pt_regs->cs */
pushq  %rcx             /* pt_regs->ip */
pushq  %rax             /* pt_regs->orig_ax */
pushq  %rdi             /* pt_regs->di */
pushq  %rsi             /* pt_regs->si */
pushq  %rdx             /* pt_regs->dx */
pushq  %rcx tuichu    /* pt_regs->cx */
pushq  $-ENOSYS        /* pt_regs->ax */
pushq  %r8              /* pt_regs->r8 */
pushq  %r9              /* pt_regs->r9 */
pushq  %r10             /* pt_regs->r10 */
pushq  %r11             /* pt_regs->r11 */
sub $(6*8), %rsp      /* pt_regs->bp, bx, r12-15 not saved */

通过汇编指令判断是否为 x32_abi。
通过系统调用号，跳到全局变量 sys_call_table 相应位置继续执行系统调用。

kernel space to user space

退出时，流程如下：

通过 swapgs 恢复 GS 值
通过 sysretq 或者 iretq 恢复到用户控件继续执行。如果使用 iretq 还需要给出用户空间的一些信息（CS, eflags/rflags, esp/rsp 等）

struct cred

cred 结构体记录的是关于kernel进程的权限，每个进程中都有一个 cred 结构，这个结构保存了该进程的权限等信息（uid，gid 等），如果能修改某个进程的 cred，那么也就修改了这个进程的权限。

源码如下:

struct cred {
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
#ifdef CONFIG_KEYS
	unsigned char	jit_keyring;	/* default keyring to attach requested
					 * keys to */
	struct key __rcu *session_keyring; /* keyring inherited over fork */
	struct key	*process_keyring; /* keyring private to this process */
	struct key	*thread_keyring; /* keyring private to this thread */
	struct key	*request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
	void		*security;	/* subjective LSM security */
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	struct rcu_head	rcu;		/* RCU deletion hook */
} __randomize_layout;

主要提权手法：

commit_creds(prepare_kernel_cred(0));

关于perpare_kernel_cred函数的定义如下：

/**
 * prepare_kernel_cred - Prepare a set of credentials for a kernel service
 * @daemon: A userspace daemon to be used as a reference
 *
 * Prepare a set of credentials for a kernel service.  This can then be used to
 * override a task's own credentials so that work can be done on behalf of that
 * task that requires a different subjective context.
 *
 * @daemon is used to provide a base for the security record, but can be NULL.
 * If @daemon is supplied, then the security data will be derived from that;
 * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
 *
 * The caller may change these controls afterwards if desired.
 *
 * Returns the new credentials or NULL if out of memory.
 *
 * Does not take, and does not return holding current->cred_replace_mutex.
 */
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
	const struct cred *old;
	struct cred *new;

	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
	if (!new)
		return NULL;

	kdebug("prepare_kernel_cred() alloc %p", new);

	if (daemon)
		old = get_task_cred(daemon);
	else
		old = get_cred(&init_cred);

	validate_creds(old);

	*new = *old;
	atomic_set(&new->usage, 1);
	set_cred_subscribers(new, 0);
	get_uid(new->user);
	get_user_ns(new->user_ns);
	get_group_info(new->group_info);

#ifdef CONFIG_KEYS
	new->session_keyring = NULL;
	new->process_keyring = NULL;
	new->thread_keyring = NULL;
	new->request_key_auth = NULL;
	new->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
#endif

#ifdef CONFIG_SECURITY
	new->security = NULL;
#endif
	if (security_prepare_creds(new, old, GFP_KERNEL) < 0)
		goto error;

	put_cred(old);
	validate_creds(new);
	return new;

error:
	put_cred(new);
	put_cred(old);
	return NULL;
}

注释中已经把函数功能描述得很具体了，简单来说，这个函数主要是生成一个cred结构体，主要根据传入的参数struct task_struct *daemon来确定一些内核服务的credentials，以便于给当前task提供在特定的context执行的权限。

在参数为NULL的情况下，也其实就是理解为把0号进程的task_struct作为参数的情况下，返回一个相应的cred结构体，这个结构体具有最高的root权限。

而commit_creds函数定义为：

/**
 * commit_creds - Install new credentials upon the current task
 * @new: The credentials to be assigned
 *
 * Install a new set of credentials to the current task, using RCU to replace
 * the old set.  Both the objective and the subjective credentials pointers are
 * updated.  This function may not be called if the subjective credentials are
 * in an overridden state.
 *
 * This function eats the caller's reference to the new credentials.
 *
 * Always returns 0 thus allowing this function to be tail-called at the end
 * of, say, sys_setgid().
 */
int commit_creds(struct cred *new)
{
	struct task_struct *task = current;
	const struct cred *old = task->real_cred;

	kdebug("commit_creds(%p{%d,%d})", new,
	       atomic_read(&new->usage),
	       read_cred_subscribers(new));

	BUG_ON(task->cred != old);
#ifdef CONFIG_DEBUG_CREDENTIALS
	BUG_ON(read_cred_subscribers(old) < 2);
	validate_creds(old);
	validate_creds(new);
#endif
	BUG_ON(atomic_read(&new->usage) < 1);

	get_cred(new); /* we will require a ref for the subj creds too */

	/* dumpability changes */
	if (!uid_eq(old->euid, new->euid) ||
	    !gid_eq(old->egid, new->egid) ||
	    !uid_eq(old->fsuid, new->fsuid) ||
	    !gid_eq(old->fsgid, new->fsgid) ||
	    !cred_cap_issubset(old, new)) {
		if (task->mm)
			set_dumpable(task->mm, suid_dumpable);
		task->pdeath_signal = 0;
		smp_wmb();
	}

	/* alter the thread keyring */
	if (!uid_eq(new->fsuid, old->fsuid))
		key_fsuid_changed(task);
	if (!gid_eq(new->fsgid, old->fsgid))
		key_fsgid_changed(task);

	/* do it
	 * RLIMIT_NPROC limits on user->processes have already been checked
	 * in set_user().
	 */
	alter_cred_subscribers(new, 2);
	if (new->user != old->user)
		atomic_inc(&new->user->processes);
	rcu_assign_pointer(task->real_cred, new);
	rcu_assign_pointer(task->cred, new);
	if (new->user != old->user)
		atomic_dec(&old->user->processes);
	alter_cred_subscribers(old, -2);

	/* send notifications */
	if (!uid_eq(new->uid,   old->uid)  ||
	    !uid_eq(new->euid,  old->euid) ||
	    !uid_eq(new->suid,  old->suid) ||
	    !uid_eq(new->fsuid, old->fsuid))
		proc_id_connector(task, PROC_EVENT_UID);

	if (!gid_eq(new->gid,   old->gid)  ||
	    !gid_eq(new->egid,  old->egid) ||
	    !gid_eq(new->sgid,  old->sgid) ||
	    !gid_eq(new->fsgid, old->fsgid))
		proc_id_connector(task, PROC_EVENT_GID);

	/* release the old obj and subj refs both */
	put_cred(old);
	put_cred(old);
	return 0;
}

从注释里也可以看到，这个函数的功能就是给当前task写入新的cred的结构体，从而改变了当前task的权限。

配合通过prepare_kernel_cred(0)得到的root权限的cred结构体，从而赋予当前task同样的root权限，这样就完成了提权。

内核态函数

相比用户态库函数，内核态的函数有了一些变化

printf() -> printk()，但需要注意的是 printk() 不一定会把内容显示到终端上，但一定在内核缓冲区里，可以通过 dmesg 查看效果
memcpy() -> copy_from_user()/copy_to_user()
- copy_from_user() 实现了将用户空间的数据传送到内核空间
- copy_to_user()实现了将内核空间的数据传送到用户空间
Copy_to_user( to, &from, sizeof(from)

To:用户空间函数（可以是数组）

From:内核空间函数（可以是数组）

sizeof(from)：表示从用户空间想内核空间拷贝数据的字节数。

Copy_from_user(&from , to , sizeof(to) )

To:用户空间函数（可以是数组）

From:内核空间函数（可以是数组）

sizeof(from)：内核空间要传递的数组的长度

成功返回0，失败返回失败数目。
malloc() -> kmalloc()，内核态的内存分配函数，和 malloc() 相似，但使用的是 slab/slub 分配器
free() -> kfree()，同 kmalloc()

另外要注意的是，kernel 管理进程，因此 kernel 也记录了进程的权限。kernel 中有两个可以方便的改变权限的函数：

int commit_creds(struct cred *new)
struct cred* prepare_kernel_cred(struct task_struct* daemon)

从函数名也可以看出，执行 commit_creds(prepare_kernel_cred(0)) 即可获得 root 权限，0 表示以 0 号进程作为参考准备新的 credentials。

更多关于 prepare_kernel_cred 的信息可以参考源码

执行 commit_creds(prepare_kernel_cred(0)) 也是最常用的提权手段，两个函数的地址都可以在 /proc/kallsyms 中查看（较老的内核版本中是 /proc/ksyms)。

post sudo grep commit_creds /proc/kallsyms 
[sudo] m4x 的密码：
ffffffffbb6af9e0 T commit_creds
ffffffffbc7cb3d0 r __ksymtab_commit_creds
ffffffffbc7f06fe r __kstrtab_commit_creds
post sudo grep prepare_kernel_cred /proc/kallsyms
ffffffffbb6afd90 T prepare_kernel_cred
ffffffffbc7d4f20 r __ksymtab_prepare_kernel_cred
ffffffffbc7f06b7 r __kstrtab_prepare_kernel_cred

一般情况下，/proc/kallsyms 的内容需要 root 权限才能查看

Mitigation

canary, dep, PIE, RELRO 等保护与用户态原理和作用相同

smep: Supervisor Mode Execution Protection，当处理器处于 ring0 模式，执行 用户空间 的代码会触发页错误。（在 arm 中该保护称为 PXN)
smap: Superivisor Mode Access Protection，类似于 smep，通常是在访问数据时。
mmap_min_addr:

CTF kernel pwn 相关

给定的文件

一般会给三个或者四个文件：

boot.sh: 一个用于启动 kernel 的 shell 的脚本，多用 qemu，保护措施与 qemu 不同的启动参数有关
bzImage: kernel binary（打包的内核代码，可以用来寻找gadget）
rootfs.cpio: 文件系统映像
file.ko: 有bug的程序，可以用ida打开

qemu 启动的参数：

-initrd rootfs.cpio，使用 rootfs.cpio 作为内核启动的文件系统

-kernel bzImage，使用 bzImage 作为 kernel 映像

-cpu kvm64,+smep，设置 CPU 的安全选项，这里开启了 smep

-m 64M，设置虚拟 RAM 为 64M，默认为 128M 其他的选项可以通过 --help 查看。

-initrd 设置根文件系统

-vnc :2，打开一个vnc，这样可以通过vncviewer访问 localhost:2看到VM的控制台

-S 表示启动后就挂起，等待gdb连接

-s 是-gdb :1234的缩小，就是打开1234这个gdb调试端口

-net nic 表示为虚拟机创建一个虚拟网卡

-net user 表示QEMU使用user模式

-append kernel的启动参数，后面最好用引号（”）引起来。

root=/dev/sda 告诉qemu单板运行内核镜像路径（指定根文件系统的挂载点，是在QEMU Guest OS上的位置）

console=ttyS0 告诉内核vexpress单板运行，串口设备是哪个tty。这个值可以从生成的.config文件CONFIG_CONSOLE宏找到。

nokaslr 是传递给内核的参数。表示禁用kaslr(Kernel Address Space Layout Randomization) 。(kaslr：kernel加载到内存后他的地址会进行随机化)

rw 文件系统读写权限，这里是可读可写。其他可选值有ro

oops linux内核的行为不正确，并产生了一份相关的错误日志

panic 操作系统在监测到内部的致命错误，并无法安全处理此错误时采取的动作。

-nographic 不使用图形化界面，只使用串口

-netdev user 配置内部用户网络，与其它任何vm和外部网络都不通，属于宿主host和qemu内部的网络通道。

Linux内核模块的若干知识

1.fop结构体

内核模块程序的结构中包括一些callback回调表，对应的函数存在一个file_operations(fop)结构体中，这也是对我们pwn手来说最重要的结构体；结构体中实现了的回调函数就会静态初始化上函数地址，而未实现的函数，值为NULL。

2.proc_create创建文件

3.数据的通信

小知识

查看装载驱动

lsmod

查看所开保护

cat /proc/cpuinfo

Kaslr 地址随机化

Smep 内核态不可执行用户态代码

Smap 内核态不可访问用户态内存

查看内核堆块

cat /proc/slabinfo

查看prepare_kernel_cred和commit_creds地址

grep prepare_kernel_cred  /proc/kallsyms 
grep commit_creds  /proc/kallsyms 

cat /proc/kallsyms | grep prepare_kernel_cred
cat /proc/kallsyms | grep commit_creds