READMSR和CPUID指令在Guest中的代码执行路径学习
READMSR和CPUID指令在Guest中的代码执行路径学习
内核版本:5.3.0
qemu版本:4.2.0
READMSR指令
作用
读MSR,MSR由ECX(RCX)的内容指定,读出的内容保存在EDX(RDX):EAX(RAX)中.
VMX相关
如果guest中执行rdmsr指令,并且以下情况之一成立,就会触发vmexit.
- "use MSR bitmaps" control为0
- RCX既不在0x00000000H-0x00001FFFH中,也不在0xC0000000H-0xC0001FFFH中
- RCX在0x00000000H-0x00001FFFH中,但是给Low MSRs的read bitmap的第RCX个bit为1.
- RCX在0xC0000000H-0xC0001FFFH中,但是给HIGH MSRs的read bitmap的第n个bit为为1,n=RCX & 0x00001FFFH
MSR bitmap address指向MSR bitmaps(4K),每1K对应low/high MSRs(read/write).且MSR bitmap address是VMCS的一部分,访问该address只需要正常的memory access即可.
代码分析(MSR bitmap)
kvm代码
- VMCS中MSR bitmap的初始化
qemu=> kvm_vm_ioctl(KVM_CREATE_VCPU) => kvm_vm_ioctl_create_vcpu() => kvm_arch_vcpu_create() => vmx_create_vcpu() => vmx_vcpu_setup()
static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
{
// 如果cpu支持use MSR bitmaps,就将分配好的msr bitmap的地址写入VMCS中的MSR bitmap address域中
if (cpu_has_vmx_msr_bitmap())
vmcs_write64(MSR_BITMAP, __pa(vmx->vmcs01.msr_bitmap));
}
在qemu请求创建VCPU时,就会将MSR bitmap的地址写入VMCS中.
- MSR bitmap的空间分配
qemu=> kvm_vm_ioctl(KVM_CREATE_VCPU) => kvm_vm_ioctl_create_vcpu() => kvm_arch_vcpu_create() => vmx_create_vcpu() => alloc_loaded_vmcs()
static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
err = alloc_loaded_vmcs(&vmx->vmcs01);
}
// 分配一个page(4K)的空间给msr bitmap,并将该空间的内容初始化为全1
int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
{
if (cpu_has_vmx_msr_bitmap()) {
loaded_vmcs->msr_bitmap = (unsigned long *)
__get_free_page(GFP_KERNEL_ACCOUNT);
if (!loaded_vmcs->msr_bitmap)
goto out_vmcs;
memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE);
}
在qemu请求创建VCPU时,为MSR bitmap分配4K空间,初始化为全1
- 对MSR bitmap中的特定bit(对应特定MSR)进行初始化操作
qemu=> kvm_vm_ioctl(KVM_CREATE_VCPU) => kvm_vm_ioctl_create_vcpu() => kvm_arch_vcpu_create() => vmx_create_vcpu() => vmx_vcpu_setup()
static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
msr_bitmap = vmx->vmcs01.msr_bitmap;
// 清除MSR bitmap中的特定bit, 之后访问这些MSR都不需要exit
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_TSC, MSR_TYPE_R);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
if (kvm_cstate_in_guest(kvm)) {
vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C1_RES, MSR_TYPE_R);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
}
vmx->msr_bitmap_mode = 0;
vmx->loaded_vmcs = &vmx->vmcs01;
}
之后在运行过程中,还会更新一些APIC相关的中断MSR设置,其余MSR如没有特别设置,访问MSR均需要vmexit.
qemu代码
qemu提供以下代码获得MSR bitmap信息, 也可以对该信息进行修改,但qemu实际运行过程中没有修改MSR bitmap.
MSR_IA32_VMX_BASIC_REGISTER Msr;
Msr.Uint64 = AsmReadMsr64 (MSR_IA32_VMX_BASIC);
代码分析(read_msr)
guest中,在读MSR bitmap中对应bit为1的MSR时, 会导致vmexit.
guest中的read_msr会出现以下执行函数链:
guest读MSR => handle_rdmsr() => vmx_get_msr() => kvm_get_msr_common()
其中,vmx_get_msr()中处理一部分读特殊MSR请求,kvm_get_msr_common()中处理普通读MSR请求.
以MSR_IA32_ARCH_CAPABILITIES为例:
由于MSR_IA32_ARCH_CAPABILITIES是一个普通的MSR,所以交给kvm_get_msr_common()函数处理.
int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
return 1;
msr_info->data = vcpu->arch.arch_capabilities;
break;
}
代码中的msr_info->host_initiated用于区分此次读MSR内容的动作是由qemu发起的,还是由guest自己发起的.如果是qemu发起的,msr_info->host_initiated就为true,如果是guest自己发起的,msr_info->host_initiated就为false.很明显,guest读MSR_IA32_ARCH_CAPABILITIES时,msr_info->host_initiated应该为false.
guest_cpuid_has()用于检验guest是否有CPUID feature: X86_FEATURE_ARCH_CAPABILITIES, 关于CPUID在后面的一节分析,这里只需要知道,guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES)检查guest是否有该CPUID feature, 有则为true,无为false.
如果guest有该feature, 则将vcpu->arch.arch_capabilities中的内容填充到msr_info->data中去,完成读MSR工作.
如果guest没有该feature,则返回1,表明读取MSR失败.(一般guest在读msr之前,会现将读取结果初始化为0,如果读取失败,那么读取结果仍旧为0,这种设计能够防止读msr失败后程序无法继续执行)
假设guest有该feature(如果没有的话,代码分析也就到此结束了), 读取到的内容为arch_capabilities的内容.
这个vcpu->arch.arch_capabilities在内核中的2个地方有被赋值操作:
- kvm_arch_vcpu_setup()中,即在初始化vcpu时被赋值:
qemu=> kvm_vm_ioctl(KVM_CREATE_VCPU) => kvm_vm_ioctl_create_vcpu() => kvm_vm_ioctl_create_vcpu() => kvm_arch_vcpu_setup()
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
{
...
vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
...
}
// arch/x86/kvm/x86.c
static u64 kvm_get_arch_capabilities(void)
{
if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
rdmsrl(MSR_IA32_ARCH_CAPABILITIES, data);
...
return data;
}
再次假设boot cpu有feature X86_FEATURE_ARCH_CAPABILITIES(事实是现在很多CPU都有该feature),也就是一个物理CPU,有这个X86_FEATURE_ARCH_CAPABILITIES标志,那么还是通过rdmsr读取MSR_IA32_ARCH_CAPABILITIES的数据到data.不过这次rdmsr不会vmexit,而是在host内核空间中获取PCPU的X86_FEATURE_ARCH_CAPABILITIES.
也就是说,在qemu发起创建VCPU请求时,会将 vcpu->arch.arch_capabilities设置为PCPU(物理CPU)的对应MSR读到的内容.
- 在kvm_set_msr_common()中对arch_capabilities做了赋值,这是qemu在通过vcpu_ioctl时设置了arch_capabilities的值.
[ kvm_arch_put_registers(cpu, KVM_PUT_RESET_STATE),
kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE),
kvm_arch_put_registers(cpu, KVM_PUT_RUNTIME_STATE)] => kvm_arch_put_registers() => kvm_put_msrs() =>
// qemu代码: 设置MSR_entry, 并将这些MSR内容写入到guest中
static int kvm_put_msrs(X86CPU *cpu, int level)
{
/*在kvm_put_msrs()的开头, 为大量MSR添加entry,保存在cpu->kvm_msr_buf中*/
...
/* If host supports feature MSR, write down. */
if (has_msr_arch_capabs) {
kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES,
env->features[FEAT_ARCH_CAPABILITIES]);
}
...
ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, cpu->kvm_msr_buf); // 将MSR信息写入guest中
}
在vcpu运行期间,vcpu复位时,初始化vcpu时,都会调用kvm_put_msrs()设置vcpu支持的MSR和对应的内容.最终通过kvm_vcpu_ioctl(KVM_SET_MSRS)写入guest中.
has_msr_arch_capabs flag 在qemu通过thread创建vcpu时,就通过kvm_ioctl(s, KVM_GET_MSR_INDEX_LIST, &msr_list)获得guest的msrlist,然后检查guest中是否存在arch_capability feature, 来设置的. kvm在收到KVM_GET_MSR_INDEX_LIST请求后,返回guest支持的MSR和kvm可以模拟的MSR列表.
// kvm代码: 收到KVM_SET_MSRS的ioctl请求后,调用do_set_msr
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
case KVM_SET_MSRS: {
int idx = srcu_read_lock(&vcpu->kvm->srcu);
r = msr_io(vcpu, argp, do_set_msr, 0);
srcu_read_unlock(&vcpu->kvm->srcu, idx);
break;
}
}
// KVM_SET_MSRS的最终实现代码,以MSR_IA32_ARCH_CAPABILITIES为例
kvm_set_msr_common()
{
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated) // 如果是guest自己填充这个MSR,就返回1,表示设置该MSR失败
return 1;
vcpu->arch.arch_capabilities = data;
break;
}
kvm收到KVM_SET_MSRS的ioctl请求后,调用do_set_msr
do_set_msr() => kvm_set_msr => kvm_x86_ops->set_msr => vmx_set_msr => kvm_set_msr_common
最终由kvm_set_msr_common()完成对arch_capabilities的赋值.这里的data,首先由qemu从kvm中获取,然后又由qemu向kvm写入,所以归根结底,还是来自于kvm,即host.
CPUID指令
向EAX,ECX写入需要查询的内容,执行CPUID,查询结果会出现在EAX,EBX,ECX,EDX中.
代码分析
guest执行CPUID肯定会导致VMEXIT.然后由kvm处理CPUID.
handle_cpuid() => kvm_emulate_cpuid()
int kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
{
u32 eax, ebx, ecx, edx;
if (cpuid_fault_enabled(vcpu) && !kvm_require_cpl(vcpu, 0))
return 1;
eax = kvm_rax_read(vcpu); // 读取vcpu的rax内容
ecx = kvm_rcx_read(vcpu); // 读取vcpu的rcx内容
kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
kvm_rax_write(vcpu, eax);
kvm_rbx_write(vcpu, ebx);
kvm_rcx_write(vcpu, ecx);
kvm_rdx_write(vcpu, edx);
return kvm_skip_emulated_instruction(vcpu);
}
bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
u32 *ecx, u32 *edx, bool check_limit)
{
u32 function = *eax, index = *ecx;
struct kvm_cpuid_entry2 *best;
bool entry_found = true;
best = kvm_find_cpuid_entry(vcpu, function, index);
if (!best) {
entry_found = false;
if (!check_limit)
goto out;
best = check_cpuid_limit(vcpu, function, index);
}
out:
if (best) {
*eax = best->eax;
*ebx = best->ebx;
*ecx = best->ecx;
*edx = best->edx;
} else
*eax = *ebx = *ecx = *edx = 0;
trace_kvm_cpuid(function, *eax, *ebx, *ecx, *edx, entry_found);
return entry_found;
}
比较重要的函数为kvm_find_cpuid_entry
,该函数寻找Qemu写入到kvm中的CPUID_entry,如果存在,就返回CPUID的结果,如果不存在,并且check_limit为1,就确定EAX传入的数据是否超过了该vcpu的最大可接受参数,如果超过了,就返回vcpu所支持的最大EAX的值的CPUID值.
所以比较重要的是这个"entry",该entry由Qemu写入.
大致过程为:
- qemu通过ioctl(KVM_GET_SUPPORTED_CPUID)读取到host支持的CPUID列表
- qemu通过与运算剔除掉qemu不支持的CPUID
- 最后通过ioctl(KVM_SET_CPUID2)将CPUID数据写入到KVM中供guest使用