由于uvc驱动函数缺少return语句而导致内核oops的一例
一、实验环境
1、软件
a) Vmware版本:Vmware Workstation 12.5.7
b) Ubuntu版本:9.10
c) 内核版本:2.6.31.14
d) gcc版本:4.4.1
e) gdb版本:7.0
2、摄像头硬件
百问网自制uvc摄像头
3、排查过程中,使用到的工具
a) printk
b) objdump
c) strace
d)gdb
二、前言
用C语言写程序时,如果定义一个带返回值的函数,但在函数体最后却缺少了return 语句, 程序编译并运行起来后,有时会产生意想不到的严重后果!这事以前只在教科书里看到过,纸上得来终觉浅,所以一直没当回事。但这次在学习韦东山嵌入式培训视频(3期项目实战之USB摄像头监控)时,真切地接受了一次教训。兹记录下整个入坑和出坑的经过,希望对自己和大家都有所助益。
三、现象描述
仿照视频教程,自己写了一个简化版的uvc摄像头驱动。在insmod my_uvc.ko,然后运行xawtv时,不幸发生了内核Oops,详细信息如下:
[ 657.966482] BUG: unable to handle kernel paging request at fffffff4
[ 657.966486] IP: [<c0127ba8>] __ticket_spin_lock+0x8/0x20
[ 657.966491] *pde = 0081d067 *pte = 00000000
[ 657.966493] Oops: 0002 [#1] SMP
[ 657.966495] last sysfs file: /sys/devices/virtual/video4linux/video0/dev
[ 657.966498] Modules linked in: my_uvc nfsd exportfs nfs lockd nfs_acl auth_rpcgss sunrpc snd_ens1371 gameport
... 此处省略无关信息若干行
[ 657.966519]
[ 657.966522] Pid: 5059, comm: xawtv.bin Not tainted (2.6.31-14-generic #48-Ubuntu) VMware Virtual Platform
[ 657.966523] EIP: 0060:[<c0127ba8>] EFLAGS: 00210082 CPU: 0
[ 657.966525] EIP is at __ticket_spin_lock+0x8/0x20
[ 657.966527] EAX: fffffff4 EBX: 00200282 ECX: fffffff4 EDX: 00000100
[ 657.966528] ESI: deb69bf4 EDI: defcba00 EBP: deb69ad0 ESP: deb69ad0
[ 657.966529] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 657.966531] Process xawtv.bin (pid: 5059, ti=deb68000 task=df319920 task.ti=deb68000)
[ 657.966532] Stack:
[ 657.966533] deb69ad8 c0127c38 deb69aec c05707da fffffff4 deb69bf4 defcba00 deb69b00
[ 657.966536] <0> c015c58d 00000000 defcba00 defcba00 deb69b24 c01f64e0 fffffff4 00012f50
[ 657.966539] <0> deb47b80 deb69b24 deb69bd0 ffffffa8 defcba00 deb69b34 e0a8b234 de979a00
[ 657.966543] Call Trace:
[ 657.966545] [<c0127c38>] ? default_spin_lock_flags+0x8/0x10
[ 657.966548] [<c05707da>] ? _spin_lock_irqsave+0x2a/0x40
[ 657.966550] [<c015c58d>] ? add_wait_queue+0x1d/0x50
[ 657.966553] [<c01f64e0>] ? __pollwait+0x60/0xd0
[ 657.966556] [<e0a8b234>] ? myuvc_poll+0x34/0x50 [my_uvc]
[ 657.966559] [<e092d14a>] ? v4l2_poll+0x3a/0x50 [videodev]
[ 657.966562] [<c01f618b>] ? do_select+0x36b/0x660
... 此处省略无关信息若干行
[ 657.966650] [<c01f6ebc>] ? sys_select+0x2c/0xb0
[ 657.966652] [<c010336c>] ? syscall_call+0x7/0xb
[ 657.966653] Code: ff ff 90 b9 5a 7a 12 c0 b8 5d 7a 12 c0 e9 59 ff ff ff 90 b9 60 7a 12 c0 b8 63 7a 12 c0 e9 49 ff ff ff 90 55 ba 00 01 00 00 89 e5 <3e> 66 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 5d c3 8d b4 26 00
[ 657.966672] EIP: [<c0127ba8>] __ticket_spin_lock+0x8/0x20 SS:ESP 0068:deb69ad0
[ 657.966676] CR2: 00000000fffffff4
[ 657.966682] ---[ end trace 672c8069f4e9d743 ]---
四、排查过程
1、回忆起之前曾经成功用这个my_uvc驱动过xawtv,而这次发生kernel oops的代码,唯一的改动就是在myuvc_vidioc_try_fmt_vid_cap里加了一句:
printk(KERN_CRIT"frames[frame_idx].width:%d, frames[frame_idx].height:%d\n",frames[frame_idx].width, frames[frame_idx].height);
但为什么一个printk会造成kernel oops呢?一头雾水。。。
2、线索1:用objdump查看反汇编代码
a) 由于缺少return语句,从而导致kernel oops的代码
0000036d <myuvc_vidioc_try_fmt_vid_cap>: 36d: 55 push %ebp 36e: 89 e5 mov %esp,%ebp 370: 53 push %ebx 371: 83 ec 0c sub $0xc,%esp 374: 89 cb mov %ecx,%ebx 376: 83 39 01 cmpl $0x1,(%ecx) 379: 75 51 jne 3cc <myuvc_vidioc_try_fmt_vid_cap+0x5f> ;if(f->type != V4L2_BUF_TYPE_VIDEO_CAPTURE) 37b: 81 79 0c 4d 4a 50 47 cmpl $0x47504a4d,0xc(%ecx) 382: 75 48 jne 3cc <myuvc_vidioc_try_fmt_vid_cap+0x5f> ;if(f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG) 384: c7 41 04 40 01 00 00 movl $0x140,0x4(%ecx) ;f->fmt.pix.width = frames[frame_idx].width; 38b: c7 41 08 f0 00 00 00 movl $0xf0,0x8(%ecx) ;f->fmt.pix.height = frames[frame_idx].height; 392: c7 44 24 08 f0 00 00 movl $0xf0,0x8(%esp) 399: 00 39a: c7 44 24 04 40 01 00 movl $0x140,0x4(%esp) 3a1: 00 3a2: c7 04 24 00 02 00 00 movl $0x200,(%esp) 3a9: e8 fc ff ff ff call 3aa <myuvc_vidioc_try_fmt_vid_cap+0x3d> ;printk(KERN_CRIT"frames[frame_idx].width...); 3ae: c7 43 14 00 00 00 00 movl $0x0,0x14(%ebx) ;f->fmt.pix.bytesperline = 0; 3b5: c7 43 18 00 2e 01 00 movl $0x12e00,0x18(%ebx) ;f->fmt.pix.sizeimage = dwMaxVideoFrameSize; 3bc: c7 43 10 01 00 00 00 movl $0x1,0x10(%ebx) ;f->fmt.pix.field = V4L2_FIELD_NONE; 3c3: c7 43 1c 08 00 00 00 movl $0x8,0x1c(%ebx) ;f->fmt.pix.colorspace = V4L2_COLORSPACE_SRGB; 3ca: eb 05 jmp 3d1 <myuvc_vidioc_try_fmt_vid_cap+0x64> ;由于C代码缺少return语句,导致汇编代码直接旁路掉了下面这句mov $0x0,%eax! 3cc: b8 00 00 00 00 mov $0x0,%eax ;return 0; 3d1: 83 c4 0c add $0xc,%esp 3d4: 5b pop %ebx 3d5: 5d pop %ebp 3d6: c3 ret
b) 而加上了return语句,正常运行的代码
0000036d <myuvc_vidioc_try_fmt_vid_cap>: 36d: 55 push %ebp 36e: 89 e5 mov %esp,%ebp 370: 53 push %ebx 371: 83 ec 0c sub $0xc,%esp 374: 89 cb mov %ecx,%ebx 376: 83 39 01 cmpl $0x1,(%ecx) 379: 75 4f jne 3ca <myuvc_vidioc_try_fmt_vid_cap+0x5d> ;if(f->type != V4L2_BUF_TYPE_VIDEO_CAPTURE) 37b: 81 79 0c 4d 4a 50 47 cmpl $0x47504a4d,0xc(%ecx) 382: 75 46 jne 3ca <myuvc_vidioc_try_fmt_vid_cap+0x5d> ;if(f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG) 384: c7 41 04 40 01 00 00 movl $0x140,0x4(%ecx) ;f->fmt.pix.width = frames[frame_idx].width; 38b: c7 41 08 f0 00 00 00 movl $0xf0,0x8(%ecx) ;f->fmt.pix.height = frames[frame_idx].height; 392: c7 44 24 08 f0 00 00 movl $0xf0,0x8(%esp) 399: 00 39a: c7 44 24 04 40 01 00 movl $0x140,0x4(%esp) 3a1: 00 3a2: c7 04 24 00 02 00 00 movl $0x200,(%esp) 3a9: e8 fc ff ff ff call 3aa <myuvc_vidioc_try_fmt_vid_cap+0x3d> ;printk(KERN_CRIT"frames[frame_idx].width...); 3ae: c7 43 14 00 00 00 00 movl $0x0,0x14(%ebx) ;f->fmt.pix.bytesperline = 0; 3b5: c7 43 18 00 2e 01 00 movl $0x12e00,0x18(%ebx) ;f->fmt.pix.sizeimage = dwMaxVideoFrameSize; 3bc: c7 43 10 01 00 00 00 movl $0x1,0x10(%ebx) ;f->fmt.pix.field = V4L2_FIELD_NONE; 3c3: c7 43 1c 08 00 00 00 movl $0x8,0x1c(%ebx) ;f->fmt.pix.colorspace = V4L2_COLORSPACE_SRGB; 3ca: b8 00 00 00 00 mov $0x0,%eax ;由于C代码加了return语句,导致汇编代码没有旁路掉下面这句mov $0x0,%eax 3cf: 83 c4 0c add $0xc,%esp 3d2: 5b pop %ebx 3d3: 5d pop %ebp 3d4: c3 ret
经过对比,可以看出:由于C代码缺少return 语句,导致汇编代码里在函数返回前,没有正确的给eax赋0值,从而myuvc_vidioc_try_fmt_vid_cap的调用者实际得到了一个错误的返回值。
那么,是谁调用了myuvc_vidioc_try_fmt_vid_cap呢?经查,有两处:
i) my_uvc驱动里的myuvc_vidioc_s_fmt_vid_cap
ii)xawtv通过系统调用ioctl( VIDIOC_TRY_FMT)间接调用了驱动的函数
通过strace –o /dev/ttyS1 xawtv记录的日志,发现:确实是xawtv通过ioctl( VIDIOC_TRY_FMT)调用了myuvc_vidioc_try_fmt_vid_cap,并且确实得到了一个错误的返回值,相关日志信息如下:
ioctl(4, VIDIOC_TRY_FMT, 0xbfff95b0) = 79 //myuvc_vidioc_try_fmt_vid_cap由于最后缺少return语句,导致返回了非0值(至于为什么是79,且看下文)
从以上日志信息可以看出:myuvc_vidioc_try_fmt_vid_cap由于最后缺少return语句,导致返回了非0值(79)。
注:至于79的来源,经查,是由于:
printk
==>vprintk
{
… return printed_len; //这句代码会在汇编层面上把printed_len存到eax里作为返回值 }
至此可以得出初步结论:其实myuvc_vidioc_try_fmt_vid_cap起初就因缺少return语句而埋下了一颗定时炸弹,而当初运行正常只不过是因为运气好,所以这颗炸弹没被引爆而已。但这次机缘凑巧,由于printk的加入,引爆了这颗定时炸弹!
那么,接下来的问题就是:为什么xawtv调用ioctl( VIDIOC_TRY_FMT)获得了非0值后,会导致内核oops呢?再次一头雾水。。。
3、线索2:
3.1、从Oops信息并结合内核源代码,可以看出,发生内核Oops时的调用栈是:
sys_select
==>do_select
==>v4l2_poll
myuvc_poll(struct file *file, struct poll_table_struct *wait)
{
...
buf = list_first_entry(&myuvc_queue.mainqueue, struct myuvc_buffer, stream);
poll_wait(file, &buf->wait, wait);
==>__pollwait(file, &buf->wait, wait)
==>add_wait_queue(&buf->wait, &wait);
==>_spin_lock_irqsave(&buf->wait.lock, flags);
//可见:段异常的根源就是&buf->wait.lock==fffffff4,而因lock 是 buf->wait的第一个成员,所以段异常的根源亦即&buf->wait==fffffff4
...
}
3.2、根据3.1,反推,段异常应该是由于myuvc_queue.buffer成员未正确初始化。而myuvc_queue.buffer的初始化是在myuvc_vidioc_reqbufs里,
(注释中标出了程序的实际执行流)
static int myuvc_vidioc_reqbufs(struct file *file, void *priv,struct v4l2_requestbuffers *p) { int nbuffers = p->count; /* request the max size image buffer */ unsigned int bufsize = PAGE_ALIGN(myuvc_format.fmt.pix.sizeimage); //myuvc_format.fmt.pix.sizeimage居然==0! unsigned int i; void *mem = NULL; int ret; ... for (; nbuffers > 0; --nbuffers) { mem = vmalloc_32(nbuffers * bufsize); //由于bufsize==0,所以vmalloc_32分配内存失败! if (mem != NULL) break; } if (mem == NULL) { ret = -ENOMEM; goto done; //由于mem分配内存失败,所以直接跳到done标签,从而旁路掉了init_waitqueue_head(&myuvc_queue.buffer[i].wait); } ... for (i = 0; i < nbuffers; ++i) { ... init_waitqueue_head(&myuvc_queue.buffer[i].wait); }
...
myuvc_queue.count = nbuffers;
...
done:
return 0;
}
3.3、那么,myuvc_format.fmt.pix.sizeimage是在哪里赋值的呢?是在myuvc_vidioc_s_fmt_vid_cap里:
(注释中标出了程序的实际执行流)
static int myuvc_vidioc_s_fmt_vid_cap(struct file *file, void *priv, struct v4l2_format *f) { int ret = myuvc_vidioc_try_fmt_vid_cap(file, NULL, f); if (ret < 0) return ret; memcpy(&myuvc_format, f, sizeof(myuvc_format));//由于f->fmt.pix.sizeimage并没有正确赋值,所以myuvc_format.pix.sizeimage也未正确赋值 return 0; }
3.4、根据3.3,进一步分析myuvc_vidioc_try_fmt_vid_cap为什么没有给f->fmt.pix.sizeimage正确赋值
(注释中标出了程序的实际执行流)
static int myuvc_vidioc_try_fmt_vid_cap(struct file *file, void *priv, struct v4l2_format *f) { __u16 rw, rh; unsigned int d, maxd; unsigned int i; struct uvc_frame *frame = NULL; if(f->type != V4L2_BUF_TYPE_VIDEO_CAPTURE) return 0; if(f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG) return 0;//由于f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG,所以这里直接return了,从而旁路了后面的f->fmt.pix.sizeimage = dwMaxVideoFrameSize; /* 调整format的width, height, * 计算bytesperline, sizeimage */ /* 人工查看描述符, 确定支持哪几种分辨率 */ f->fmt.pix.width = frames[frame_idx].width; f->fmt.pix.height = frames[frame_idx].height; printk(KERN_CRIT"frames[frame_idx].width:%d, frames[frame_idx].height:%d\n",frames[frame_idx].width, frames[frame_idx].height); f->fmt.pix.bytesperline = 0; f->fmt.pix.sizeimage = dwMaxVideoFrameSize;
f->fmt.pix.field = V4L2_FIELD_NONE;
f->fmt.pix.colorspace = V4L2_COLORSPACE_SRGB;
//return 0;
}
3.5、那为什么传给myuvc_vidioc_try_fmt_vid_cap的参数f->fmt.pix.pixelformat 会不等于 V4L2_PIX_FMT_MJPEG呢?
通过对比strace –o /dev/ttyS1 xawtv记录下的日志,推测此次kernel oops的整个案情经过应该是:
i) 发生kernel oops时的strace日志
(注释中标出了大致的案情流程)
ioctl(4, VIDIOC_TRY_FMT, 0xbfff95b0) = 79 //xawtv第一次调用myuvc_vidioc_try_fmt_vid_cap。这次传入f->fmt.pix.pixelformat == V4L2_PIX_FMT_MJPEG。但由于后者最后缺少return语句导致返回非0值,从而导致xawtv认为这次调用失败了 ioctl(4, VIDIOC_TRY_FMT, 0x91daca4) = 0 //xawtv第二次调用myuvc_vidioc_try_fmt_vid_cap。但因这次传入f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG而导致后者过早return 0;从而导致f->fmt.pix.sizeimage没有被正确赋值 munmap(0xffffffff, 0) = -1 EINVAL (Invalid argument) ioctl(4, VIDIOC_S_FMT or VT_RELDISP, 0xbfff9a74) = 0 //xawtv调用myuvc_vidioc_s_fmt_vid_cap进而调用myuvc_vidioc_try_fmt_vid_cap。因传入 f->fmt.pix.pixelformat != V4L2_PIX_FMT_MJPEG而导致后者过早return 0;从而导致myuvc_vidioc_s_fmt_vid_cap里 myuvc_format.pix.sizeimage也未被正确赋值 ioctl(4, VIDIOC_TRY_FMT, 0xbfff95b0) = 79 //推理同上
ioctl(4, VIDIOC_TRY_FMT, 0x91daca4) = 0 //推理同上
... ioctl(4, VIDIOC_REQBUFS or VT_DISALLOCATE, 0x91dad80) = 0 //myuvc_vidioc_reqbufs,由于先前myuvc_vidioc_s_fmt_vid_cap里 myuvc_format.fmt.pix.sizeimage=0,从而导致这里 mem = vmalloc_32(nbuffers * bufsize) == NULL; 而过早的goto done;从而旁路了后面的init_waitqueue_head(&myuvc_queue.buffer[i].wait);和myuvc_queue.count = nbuffers;
ioctl(4, VIDIOC_QUERYBUF or VT_RESIZE, 0x91dad94) = -1 EINVAL (Invalid argument)
/*这里返回-1是因为
myuvc_vidioc_querybuf(struct file *file, void *priv, struct v4l2_buffer *v4l2_buf)
{
int ret = 0;
if (v4l2_buf->index >= myuvc_queue.count) { //由于myuvc_queue.count = nbuffers这句话在myuv c_vidioc_reqbufs里被过早的goto给旁路了
ret = -EINVAL;
goto done;
}
…
done:
return ret;
}
*/
...
write(2, "libv4l2: error querying buffer 0"..., 51) = 51 //于是,xawtv调ioctl(VIDIOC_QUERYBUF)失败了!
...
select(5, [4], NULL, NULL, {5, 0}[ 946.196020] BUG: unable to handle kernel paging request at fffffff4 //终于kernel oops了!
ii) 作为对比,贴出myuvc_vidioc_try_fmt_vid_cap加了return语句后,xawtv正常运行时的strace日志
ioctl(4, VIDIOC_TRY_FMT, 0xbf936760) = 0 //xawtv第一次调用myuvc_vidioc_try_fmt_vid_cap。这次传入f->fmt.pix.pixelformat == V4L2_PIX_FMT_MJPEG。所以f->fmt.pix.sizeimage被正确赋值。并且由于myuvc_vidioc_try_fmt_vid_cap加上了return语句所以返回0值,从而xawtv认为这次调用成功了 ioctl(4, VIDIOC_TRY_FMT, 0xbf936760) = 0 //推理同上
ioctl(4, VIDIOC_S_FMT or VT_RELDISP, 0xbfbfdc74) = 0//xawtv调用myuvc_vidioc_s_fmt_vid_cap进而调用myuvc_vidioc_try_fmt_vid_cap。因f->fmt.pix.sizeimage被正确赋值,所以myuvc_vidioc_s_fmt_vid_cap里 myuvc_format.pix.sizeimage也被正确赋值了
ioctl(4, VIDIOC_REQBUFS or VT_DISALLOCATE, 0xbf936e1c) = 0//myuvc_vidioc_reqbufs,由于先前myuvc_vidioc_s_fmt_vid_cap里 myuvc_format.fmt.pix.sizeimage被正确赋值,所以这里 mem = vmalloc_32(nbuffers * bufsize) 成功分配到了内存; 从而流程正常走到后面的init_waitqueue_head(&myuvc_queue.buffer[i].wait);和myuvc_queue.count = nbuffers; ioctl(4, VIDIOC_QUERYBUF or VT_RESIZE, 0xbf936d2c) = 0 //这次返回了0,而不是-1。说明xawtv调用ioctl(VIDIOC_QUERYBUF)成功了 mmap2(NULL, 77312, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0x39) = 0xb6dd4000
//以下开始获取摄像头图像 ioctl(4, VIDIOC_QBUF, 0xbf936d2c) = 0 //video_buf入队 ioctl(4, VIDIOC_STREAMON, 0xbf936d6c) = 0 //开启摄像头的数据流,开始采集摄像头图像 ioctl(4, VIDIOC_DQBUF, 0xbf936dd8) = -1 EINVAL (Invalid argument) //缓冲区出队失败(可能因为摄像头数据暂时还没生成) write(2, "libv4l2: error dequeuing buf: In"..., 47) = 47 write(2, "v4l2: read: Invalid argument\n", 29) = 29 poll([{fd=3, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) writev(3, [{"\2\0\4\0\374\0\0\0\0\10\0\0\0\0\0\0=\1\4\0b\0\300\3\0\0\0\0\0\0\0\0"..., 36}, {NULL, 0}, {"", 0}], 3) = 36 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
select(5, [4], NULL, NULL, {5, 0}) = 1 (in [4], left {4, 999996}) //摄像头数据准备就绪
ioctl(4, VIDIOC_DQBUF, 0xbfbfe220) = 0 //video_buf出队
...
五、总结
仅仅是因为在函数最后缺少了return 语句,竟然引发了这么多的麻烦,实在是没有想到!总结下来的教训就是:以后在写代码的时候,一定要注意不能遗漏return语句。并且在编译代码的时候,对编译器抛出的”control reaches end of non-void function”警告,不能忽视,必须去修正代码,直到这类警告消失为止。这次的教训非常深刻,不过从另一方面来说,可能也是件好事,一则增长了分析问题的能力,另外也增加了使用各种工具调试bug的实战经验:-)