postgresql 发生 oom 的分析之一
os:centos 7.4
postgresql:10.4
第一个窗口
创建表,向表中插入大量数据
$ psql
psql (10.4)
Type "help" for help.
postgres=# create table test01(id integer, val char(1000));
CREATE TABLE
postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));
第二个窗口,观察os层面postgres用户的进程内存占用
# watch "ps -aux |head -1;ps -aux|grep -i postgres;free -m"
Every 2.0s: ps -aux |grep -i postgres Mon Aug 20 23:20:43 2018
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1570 0.0 0.0 220660 96 pts/0 S 22:53 0:00 su - postgres
postgres 1571 0.0 0.0 116720 68 pts/0 S 22:53 0:00 -bash
postgres 1629 0.0 1.3 729832 27040 pts/0 S 22:53 0:00 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres 1630 0.0 0.0 170624 212 ? Ss 22:53 0:00 postgres: peiyb: logger process
postgres 1632 0.0 18.2 730124 372980 ? Ss 22:53 0:00 postgres: peiyb: checkpointer process
postgres 1633 0.0 0.1 729968 2604 ? Ss 22:53 0:00 postgres: peiyb: writer process
postgres 1634 0.0 0.8 729832 16592 ? Ds 22:53 0:00 postgres: peiyb: wal writer process
postgres 1635 0.0 0.0 730396 540 ? Ss 22:53 0:00 postgres: peiyb: autovacuum launcher process
postgres 1636 0.0 0.0 172884 272 ? Ss 22:53 0:00 postgres: peiyb: stats collector process
postgres 1637 0.0 0.0 730252 240 ? Ss 22:53 0:00 postgres: peiyb: bgworker: logical replication launcher
postgres 1649 0.0 0.0 156728 136 pts/0 S+ 22:54 0:00 ./bin/psql
postgres 1650 0.4 90.5 2051040 1854772 ? Ss 22:54 0:07 postgres: peiyb: postgres postgres [local] INSERT
root 2178 0.0 0.0 126428 480 pts/1 S+ 23:04 0:00 vi postgresql.conf
root 3169 0.0 0.0 153272 1276 pts/2 S+ 23:17 0:00 watch ps -aux |grep -i postgres
root 3718 0.0 0.0 153268 752 pts/2 S+ 23:20 0:00 watch ps -aux |grep -i postgres
root 3720 0.0 0.0 113176 1212 pts/2 S+ 23:20 0:00 sh -c ps -aux |grep -i postgres
root 3722 0.0 0.0 112708 952 pts/2 R+ 23:20 0:00 grep -i postgres
total used free shared buff/cache available
Mem: 2000 1401 15 506 583 42
Swap: 2047 2047 0
注意 1650 进程。
当swap用尽时,第一个窗口就出现错误
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
查看postgresql 日志
2018-08-20 23:19:47.362 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,28,"idle",2018-08-20 22:54:02 CST,3/15,0,LOG,00000,"statement: create table test01(id integer, val char(1000));",,,,,,,,,"psql"
2018-08-20 23:19:49.074 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,29,"CREATE TABLE",2018-08-20 22:54:02 CST,3/0,0,LOG,00000,"duration: 1711.734 ms",,,,,,,,,"psql"
2018-08-20 23:20:00.404 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,30,"idle",2018-08-20 22:54:02 CST,3/16,0,LOG,00000,"statement: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,,"psql"
2018-08-20 23:20:18.603 CST,,,1632,,5b7ad5fd.660,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:20:34.211 CST,,,1632,,5b7ad5fd.660,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23014 buffers (70.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=13.311 s, sync=1.077 s, total=17.344 s; sync files=22, longest=0.250 s, average=0.029 s; distance=366925 kB, estimate=366925 kB",,,,,,,,,""
2018-08-20 23:20:50.308 CST,,,1632,,5b7ad5fd.660,7,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:21:31.226 CST,,,1632,,5b7ad5fd.660,8,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 24394 buffers (74.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=38.683 s, sync=0.932 s, total=40.819 s; sync files=2, longest=0.894 s, average=0.466 s; distance=389308 kB, estimate=389308 kB",,,,,,,,,""
2018-08-20 23:21:57.251 CST,,,1632,,5b7ad5fd.660,9,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:22:57.569 CST,,,1632,,5b7ad5fd.660,10,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23210 buffers (70.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=57.202 s, sync=0.265 s, total=62.142 s; sync files=2, longest=0.208 s, average=0.132 s; distance=396971 kB, estimate=396971 kB",,,,,,,,,""
2018-08-20 23:23:30.360 CST,,,1632,,5b7ad5fd.660,11,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,"server process (PID 1650) was terminated by signal 9: Killed","Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,""
2018-08-20 23:23:55.752 CST,,,1629,,5b7ad5fd.65d,4,,2018-08-20 22:53:49 CST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2018-08-20 23:23:56.236 CST,,,1635,,5b7ad5fd.663,1,,2018-08-20 22:53:49 CST,2/0,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2018-08-20 23:23:56.969 CST,,,4102,,5b7add0b.1006,1,,2018-08-20 23:23:55 CST,,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2018-08-20 23:23:56.970 CST,,,4105,"[local]",5b7add0c.1009,1,"",2018-08-20 23:23:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-20 23:23:56.978 CST,"postgres","postgres",4105,"[local]",5b7add0c.1009,2,"",2018-08-20 23:23:56 CST,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,""
2018-08-20 23:23:56.978 CST,,,1629,,5b7ad5fd.65d,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
2018-08-20 23:23:57.120 CST,,,4106,,5b7add0d.100a,1,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was interrupted; last known up at 2018-08-20 23:22:57 CST",,,,,,,,,""
2018-08-20 23:23:58.183 CST,,,4106,,5b7add0d.100a,2,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2018-08-20 23:23:58.202 CST,,,4106,,5b7add0d.100a,3,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo starts at 0/A51158C0",,,,,,,,,""
2018-08-20 23:24:08.049 CST,,,4106,,5b7add0d.100a,4,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo done at 0/C67FB9F8",,,,,,,,,""
2018-08-20 23:24:08.062 CST,,,4106,,5b7add0d.100a,5,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,,""
2018-08-20 23:24:17.215 CST,,,4106,,5b7add0d.100a,6,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint complete: wrote 32768 buffers (100.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=6.935 s, sync=2.161 s, total=9.154 s; sync files=2, longest=2.160 s, average=1.080 s; distance=547737 kB, estimate=547737 kB",,,,,,,,,""
2018-08-20 23:24:17.246 CST,,,1629,,5b7ad5fd.65d,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
注意这条日志
2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,”server process (PID 1650) was terminated by signal 9: Killed“,”Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));”,,,,,,,,”“
查看os日志
# dmesg
[ 1930.514760] Out of memory: Kill process 1650 (postgres) score 920 or sacrifice child
[ 1930.514762] Killed process 1650 (postgres) total-vm:3993492kB, anon-rss:1323268kB, file-rss:588kB, shmem-rss:545252kB
1650 进程确实是被os 的OOM killer干掉了。
查看这几个参数
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.swappiness = 30
试着调整为
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.swappiness = 1
插入数据时就提示了错误,没有被os oom killer:
postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));
ERROR: out of memory
DETAIL: Failed on request of size 1028.
对应的log
2018-08-21 22:52:41.093 CST,"postgres","postgres",2256,"[local]",5b7c26ed.8d0,8,"INSERT",2018-08-21 22:51:25 CST,3/4,647,ERROR,53200,"out of memory","Failed on request of size 1028.",,,,,"insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,"psql"
申请的报错的 1028 受这两个参数的影响:
# cat /proc/meminfo |grep -i commit
CommitLimit: 5334540 kB
Committed_AS: 3366504 kB
CommitLimit:最大能分配的内存(个人理解仅仅在vm.overcommit_memory=2时候生效),具体的值是
SWAP内存大小 + 物理内存 * overcommit_ratio / 100
Committed_AS:当前已经分配的内存大小
调整 vm.overcommit_ratio = 90
# cat /proc/meminfo |grep -i commit
CommitLimit: 5739216 kB
Committed_AS: 3397904 kB
从这个上面来看,貌似 postgresql 如果碰到大事务的数据库,session 级别发生 oom 是必然。
于是上网搜了搜,貌似可以使用 cgroups来控制内存使用。下一篇blog试下。