OOM killer -----Out of Memory,process killed 的罪魁祸首之探察篇
1.出现根本原因.
man malloc
By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL
there is no guarantee that the memory really is available. This is a really bad bug. In case it turns out that the
system is out of memory, one or more processes will be killed by the infamous OOM killer. In case Linux is employed
under circumstances where it would be less desirable to suddenly lose some randomly picked processes, and moreover
the kernel version is sufficiently recent, one can switch off this overcommitting behavior using a command like:
# echo 2 > /proc/sys/vm/overcommit_memory
默认情况下,linux遵守最优内存分配策略,也就是说:即使malloc返回非空指针也不保证这块内存是可用的,因为并不是所有程序申请了内存就立即使
用的,这是malloc的bug。但是当你使用这个overcommit给你的内存时,为了防止系统内存耗尽,臭名昭著的OOM
killer会杀掉进程。
/proc/sys/vm/overcommit_memory 0 启发式策略,比较严重的overcommit将不能得逞,而轻微的overcommit将被允许。==永远允许,除非启用了是SMACK或SELinux模块
/proc/sys/vm/overcommit_memory 1 永远允许overcommit
/proc/sys/vm/overcommit_memory 2 永远禁止overcommit
系统分配的内存不会超过swap+RAM*系数(/proc/sys/vm/overcommit_ratio,默认是50%,可以调整),如果用光,申
请内存会返回错误,意味着无法运行任何新程序。
OOM killer 跳出来选目标的策略是:内存消耗量、CPU时间(utime+stime)、存活时t间(utime-stat time)和oom_adj计算出来的。oom_adj是OOM权重,在/proc/<pid>/oom_adj里面,取值是-17 到+15,取值越高,越容易被干掉。
消耗内存越多分越高,存活时间越长分越低。总之,总的策略是:损失最少的工作,释放最大的内存同时不伤及无辜的用了很大内存的进程,并且杀掉的进程数尽量少。
另外,Linux在计算进程的内存消耗的时候,会将子进程所耗内存的一半同时算到父进程中。这样,那些子进程比较多的进程就要小心了。
见oom_killer.c
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
* @uptime: current uptime in seconds
* @mem: target memory controller
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
* to kill when we run out of memory.
*
* Good in this context means that:
* 1) we lose the minimum amount of work done
* 2) we recover a large amount of memory
* 3) we don't kill anything innocent of eating tons of memory
* 4) we want to kill the minimum amount of processes (one)
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
*/
unsigned long badness(struct task_struct *p, unsigned long uptime)minor tuning to start paging earlier, but if the system cannot write dirty pages out fast enough to free memory, one can only conclude that the workload is mis-sized for the installed memory and there is little to be done. Raising the value in /proc/s
{
/*
* swapoff can easily use up all memory, so kill those first.
*/
if (p->flags & PF_SWAPOFF)
return ULONG_MAX;
/*
* Processes which fork a lot of child processes are likely
* a good choice. We add half the vmsize of the children if they
* have an own mm. This prevents forking servers to flood the
* machine with an endless amount of children. In case a single
* child is eating the vast majority of memory, adding only half
* to the parents will make the child our kill candidate of choice.
*/
另外,Linux在计算进程的内存消耗的时候,会将子进程所耗内存的一半同时算到父进程中。这样,那些子进程比较多的进程就要小心了。
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);
if (child->mm != mm && child->mm)
points += child->mm->total_vm/2 + 1;
task_unlock(child);
}
/*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
*/
cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
>> (SHIFT_HZ + 3);
if (uptime >= p->start_time.tv_sec)
run_time = (uptime - p->start_time.tv_sec) >> 10;
else
run_time = 0;
s = int_sqrt(cpu_time);
if (s)
points /= s;
s = int_sqrt(int_sqrt(run_time));
if (s)
points /= s;
/*
* Niced processes are most likely less important, so double
* their badness points.
*/
if (task_nice(p) > 0)
points *= 2;
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
*/
if (__capable(p, CAP_SYS_ADMIN) || __capable(p, CAP_SYS_RESOURCE))
points /= 4;
/*
* We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
if (__capable(p, CAP_SYS_RAWIO))
points /= 4;
/*
* If p's nodes don't overlap ours, it may still help to kill p
* because p may have allocated or otherwise mapped memory on
* this node before. However it will be less likely.
*/
if (!cpuset_mems_allowed_intersects(current, p))
points /= 8;
/*
* Adjust the score by oomkilladj.
*/
if (p->oomkilladj) {
if (p->oomkilladj > 0) {
if (!points)
points = 1;
points <<= p->oomkilladj;
} else
points >>= -(p->oomkilladj);
}
return points;
}
2.出现的可能应用程序原因
1) Kernel真的内存耗尽了./proc/meminfo中的SwapFree和MemFree很低.都小于1%,那么负载过大就是原因.
2) 如果LowFree很低而HighFree高很多,那么就是32位体系结构的原因,如果在64位内核或平台上就会好很多.
3) 内核数据结构或者内存泄漏.
/proc/slabinfo占用最多空间的对象是什么对象?awk '{printf "%5d MB %s\n", $3*$4/(1024*1024), $1}' < /proc/slabinfo | sort -n,
如果一种对象占用了系统的大部分内存,那么就有可能是这个对象的原因.检查这个对象的子系统.
SwapFree和MemFree的大小?
task_struct对象的数字是多少?是否是因为系统fork了太多进程导致的.
4)内核没有正确使用swap分区.
如果应用程序使用了mlock()或者Huge TLBfs pages,那么有可能这个应用程序就不能使用swap空间.这种情况下,Swap Free很高仍然会OOM.mlock和Huge TLBfs pages这两个特征不允许系统将使用的内存换出,因此,这样过度使用内存会使得系统内存耗尽而让系统没有其他资源.
也有可能是系统自己有死锁.将数据写入磁盘也需要为各种各样的IP数据结构开辟内存.如果系统根本找不到这样的内存,正是用来产生空闲内存的函数会导致系统内存耗尽.可以使用一些小的调节方法比如较早的页映射机智,但是如果系统不能快速的将脏页面换出来释放内存,我们只能说负载过大而没有办法解决.增加/proc/sys/vm/min_free_kbytes会使得系统回收内存更早.这也会使得系统更难进入死锁.如果你的系统时因为死锁而导致OOM那么这是一个很好的调节值.
5)内核作出了错误的决定,读数据错误.当它还有很大RAM的时候就OOM了.
6)一些病理性的事情发生了.当内核花了一段很长的时间去扫描内存来释放.2.6.19版本的内核中,这个很长的时间指的就是VM扫描了和同一区域的active+inactive 页面6倍长的时间后.
如果内核正在快速扫描页,而你的IO设备(swap,filesystem,network fs)太慢了,内核会断定没有进程在进行交换而触发一次OOM即使swap有空闲.
3.解决方案
DMA 0~16M
LowMem 16M~896M
HighMem 896M~
lowMem也叫Normal Zone区,是固定不能变的,如果lowMem里存在很多碎片会触发OOM killer.
察看目前系统的LowMem使用状况:
~ # egrep 'Low|High' /proc/meminfo
HighTotal: 6422400 kB
HighFree: 4712 kB
LowTotal: 1887416 kB
LowFree: 307404 kB
~ # cat /proc/buddyinfo
Node 0, zone DMA 3 4 4 3 2 2 2 1 0 1 3
Node 0, zone Normal 625 586 2210 1331 953 584 123 17 1 0 2
Node 0, zone HighMem 664 15 5 2 4 2 1 2 0 0 0
~ # free -lm
total used free shared buffers cached
Mem: 8115 7849 265 0 59 5132
Low: 1843 1582 260
High: 6271 6267 4
-/+ buffers/cache: 2657 5457
Swap: 0 0 0
可以使用的解决方案
a升级64位系统
b.如果必须使用32位系统,可以使用hugemem内核,此时内核会以不同的方式分割low/high memory,而大多情况提供足够多的low memory至high memory的映射。
c.将/proc/sys/vm/lower_zone_protection的值设为250或更大。或者修改/etc/sysctl.conf vm.lower_zone_protection这将会使内核在能够在high memory中分配内存时,不会再low zone里面开辟.
d.关掉 oom-kill echo "0" > /proc/sys/vm/oom-kill 或者修改:/etc/sysctl.conf vm.oom-kill = 0
e.保护单个进程 echo -17 > /proc/[pid]/oom_adj