postgresql 发生 oom 的分析之一

os:centos 7.4
postgresql:10.4

第一个窗口
创建表,向表中插入大量数据

$ psql
psql (10.4)
Type "help" for help.

postgres=# create table test01(id integer, val char(1000)); 
CREATE TABLE
postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));

第二个窗口,观察os层面postgres用户的进程内存占用

# watch "ps -aux |head -1;ps -aux|grep -i postgres;free -m"

Every 2.0s: ps -aux |grep -i postgres                                                                                        Mon Aug 20 23:20:43 2018

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1570  0.0  0.0 220660    96 pts/0    S    22:53   0:00 su - postgres
postgres  1571  0.0  0.0 116720    68 pts/0    S    22:53   0:00 -bash
postgres  1629  0.0  1.3 729832 27040 pts/0    S    22:53   0:00 /usr/pgsql-10/bin/postgres -D /var/lib/pgsql/10/data
postgres  1630  0.0  0.0 170624   212 ?        Ss   22:53   0:00 postgres: peiyb: logger process
postgres  1632  0.0 18.2 730124 372980 ?       Ss   22:53   0:00 postgres: peiyb: checkpointer process
postgres  1633  0.0  0.1 729968  2604 ?        Ss   22:53   0:00 postgres: peiyb: writer process
postgres  1634  0.0  0.8 729832 16592 ?        Ds   22:53   0:00 postgres: peiyb: wal writer process
postgres  1635  0.0  0.0 730396   540 ?        Ss   22:53   0:00 postgres: peiyb: autovacuum launcher process
postgres  1636  0.0  0.0 172884   272 ?        Ss   22:53   0:00 postgres: peiyb: stats collector process
postgres  1637  0.0  0.0 730252   240 ?        Ss   22:53   0:00 postgres: peiyb: bgworker: logical replication launcher
postgres  1649  0.0  0.0 156728   136 pts/0    S+   22:54   0:00 ./bin/psql
postgres  1650  0.4 90.5 2051040 1854772 ?     Ss   22:54   0:07 postgres: peiyb: postgres postgres [local] INSERT
root      2178  0.0  0.0 126428   480 pts/1    S+   23:04   0:00 vi postgresql.conf
root      3169  0.0  0.0 153272  1276 pts/2    S+   23:17   0:00 watch ps -aux |grep -i postgres
root      3718  0.0  0.0 153268   752 pts/2    S+   23:20   0:00 watch ps -aux |grep -i postgres
root      3720  0.0  0.0 113176  1212 pts/2    S+   23:20   0:00 sh -c ps -aux |grep -i postgres
root      3722  0.0  0.0 112708   952 pts/2    R+   23:20   0:00 grep -i postgres
              total        used        free  shared  buff/cache   available
Mem:           2000        1401          15         506         583          42
Swap:          2047        2047           0

注意 1650 进程。

当swap用尽时,第一个窗口就出现错误

server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

查看postgresql 日志

2018-08-20 23:19:47.362 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,28,"idle",2018-08-20 22:54:02 CST,3/15,0,LOG,00000,"statement: create table test01(id integer, val char(1000));",,,,,,,,,"psql"
2018-08-20 23:19:49.074 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,29,"CREATE TABLE",2018-08-20 22:54:02 CST,3/0,0,LOG,00000,"duration: 1711.734 ms",,,,,,,,,"psql"
2018-08-20 23:20:00.404 CST,"postgres","postgres",1650,"[local]",5b7ad60a.672,30,"idle",2018-08-20 22:54:02 CST,3/16,0,LOG,00000,"statement: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,,"psql"
2018-08-20 23:20:18.603 CST,,,1632,,5b7ad5fd.660,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:20:34.211 CST,,,1632,,5b7ad5fd.660,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23014 buffers (70.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=13.311 s, sync=1.077 s, total=17.344 s; sync files=22, longest=0.250 s, average=0.029 s; distance=366925 kB, estimate=366925 kB",,,,,,,,,""
2018-08-20 23:20:50.308 CST,,,1632,,5b7ad5fd.660,7,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:21:31.226 CST,,,1632,,5b7ad5fd.660,8,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 24394 buffers (74.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=38.683 s, sync=0.932 s, total=40.819 s; sync files=2, longest=0.894 s, average=0.466 s; distance=389308 kB, estimate=389308 kB",,,,,,,,,""
2018-08-20 23:21:57.251 CST,,,1632,,5b7ad5fd.660,9,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:22:57.569 CST,,,1632,,5b7ad5fd.660,10,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint complete: wrote 23210 buffers (70.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=57.202 s, sync=0.265 s, total=62.142 s; sync files=2, longest=0.208 s, average=0.132 s; distance=396971 kB, estimate=396971 kB",,,,,,,,,""
2018-08-20 23:23:30.360 CST,,,1632,,5b7ad5fd.660,11,,2018-08-20 22:53:49 CST,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,"server process (PID 1650) was terminated by signal 9: Killed","Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,,,,,,""
2018-08-20 23:23:55.752 CST,,,1629,,5b7ad5fd.65d,4,,2018-08-20 22:53:49 CST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2018-08-20 23:23:56.236 CST,,,1635,,5b7ad5fd.663,1,,2018-08-20 22:53:49 CST,2/0,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2018-08-20 23:23:56.969 CST,,,4102,,5b7add0b.1006,1,,2018-08-20 23:23:55 CST,,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2018-08-20 23:23:56.970 CST,,,4105,"[local]",5b7add0c.1009,1,"",2018-08-20 23:23:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-20 23:23:56.978 CST,"postgres","postgres",4105,"[local]",5b7add0c.1009,2,"",2018-08-20 23:23:56 CST,,0,FATAL,57P03,"the database system is in recovery mode",,,,,,,,,""
2018-08-20 23:23:56.978 CST,,,1629,,5b7ad5fd.65d,5,,2018-08-20 22:53:49 CST,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
2018-08-20 23:23:57.120 CST,,,4106,,5b7add0d.100a,1,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was interrupted; last known up at 2018-08-20 23:22:57 CST",,,,,,,,,""
2018-08-20 23:23:58.183 CST,,,4106,,5b7add0d.100a,2,,2018-08-20 23:23:57 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2018-08-20 23:23:58.202 CST,,,4106,,5b7add0d.100a,3,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo starts at 0/A51158C0",,,,,,,,,""
2018-08-20 23:24:08.049 CST,,,4106,,5b7add0d.100a,4,,2018-08-20 23:23:57 CST,,0,LOG,00000,"redo done at 0/C67FB9F8",,,,,,,,,""
2018-08-20 23:24:08.062 CST,,,4106,,5b7add0d.100a,5,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,,""
2018-08-20 23:24:17.215 CST,,,4106,,5b7add0d.100a,6,,2018-08-20 23:23:57 CST,,0,LOG,00000,"checkpoint complete: wrote 32768 buffers (100.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=6.935 s, sync=2.161 s, total=9.154 s; sync files=2, longest=2.160 s, average=1.080 s; distance=547737 kB, estimate=547737 kB",,,,,,,,,""
2018-08-20 23:24:17.246 CST,,,1629,,5b7ad5fd.65d,6,,2018-08-20 22:53:49 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""

注意这条日志
2018-08-20 23:23:55.686 CST,,,1629,,5b7ad5fd.65d,3,,2018-08-20 22:53:49 CST,,0,LOG,00000,”server process (PID 1650) was terminated by signal 9: Killed“,”Failed process was running: insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));”,,,,,,,,”“

查看os日志

# dmesg

[ 1930.514760] Out of memory: Kill process 1650 (postgres) score 920 or sacrifice child
[ 1930.514762] Killed process 1650 (postgres) total-vm:3993492kB, anon-rss:1323268kB, file-rss:588kB, shmem-rss:545252kB

1650 进程确实是被os 的OOM killer干掉了。
查看这几个参数

vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.swappiness = 30

试着调整为

vm.overcommit_memory = 2
vm.overcommit_ratio = 80
vm.swappiness = 1

插入数据时就提示了错误,没有被os oom killer:

postgres=# insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));
ERROR:  out of memory
DETAIL:  Failed on request of size 1028.

对应的log

2018-08-21 22:52:41.093 CST,"postgres","postgres",2256,"[local]",5b7c26ed.8d0,8,"INSERT",2018-08-21 22:51:25 CST,3/4,647,ERROR,53200,"out of memory","Failed on request of size 1028.",,,,,"insert into test01 values(generate_series(1,5000000),repeat( chr(int4(random()*26)+65),1000));",,,"psql"

申请的报错的 1028 受这两个参数的影响:

# cat /proc/meminfo |grep -i commit

CommitLimit:     5334540 kB
Committed_AS:    3366504 kB

CommitLimit:最大能分配的内存(个人理解仅仅在vm.overcommit_memory=2时候生效),具体的值是
SWAP内存大小 + 物理内存 * overcommit_ratio / 100

Committed_AS:当前已经分配的内存大小

调整 vm.overcommit_ratio = 90

# cat /proc/meminfo |grep -i commit
CommitLimit:     5739216 kB
Committed_AS:    3397904 kB

从这个上面来看,貌似 postgresql 如果碰到大事务的数据库,session 级别发生 oom 是必然。

于是上网搜了搜,貌似可以使用 cgroups来控制内存使用。下一篇blog试下。

posted @ 2018-08-21 23:37  peiybpeiyb  阅读(768)  评论(0编辑  收藏  举报