PostgreSQL checkpoint_completion_target及脏数据刷盘过程说明
1)checkpoint相关参数
checkpoint_timeout (integer) Maximum time between automatic WAL checkpoints. If this value is specified without units, it is taken as seconds. The valid range is between 30 seconds and one day. The default is five minutes (5min). Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line. checkpoint_completion_target (floating point) Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. The default is 0.5. This parameter can only be set in the postgresql.conf file or on the server command line. checkpoint_flush_after (integer) Whenever more than this amount of data has been written while performing a checkpoint, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of the checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than shared_buffers, but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. If this value is specified without units, it is taken as blocks, that is BLCKSZ bytes, typically 8kB. The valid range is between 0, which disables forced writeback, and 2MB. The default is 256kB on Linux, 0 elsewhere. (If BLCKSZ is not 8kB, the default and maximum values scale proportionally to it.) This parameter can only be set in the postgresql.conf file or on the server command line. checkpoint_warning (integer) Write a message to the server log if checkpoints caused by the filling of WAL segment files happen closer together than this amount of time (which suggests that max_wal_size ought to be raised). If this value is specified without units, it is taken as seconds. The default is 30 seconds (30s). Zero disables the warning. No warnings will be generated if checkpoint_timeout is less than checkpoint_warning. This parameter can only be set in the postgresql.conf file or on the server command line. max_wal_size (integer) Maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, such as heavy load, a failing archive_command, or a high wal_keep_segments setting. If this value is specified without units, it is taken as megabytes. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.
2)checkpoint_completion_target参数总结:
大致可以这么说:checkpoint_completion_target越大,意味着checkpointer进程休眠的机会越多,以控制脏块刷盘的进度。
在checkpoint过程中当刷盘的脏数据超过一定值(checkpoint_flush_after )后,会调用fsync将数据从page cache中刷盘。
因此,休眠越多,fsync也就不那么频繁,刷盘的IO压力就会降一点。
在checkpoint完成后,会调用一次fsync,将page cache都刷到磁盘。
所以,休眠越多,就会让fsync操作时的IO平滑一点。
3)类似的机制在pg_start_backup中使用:
checkpoint的时候分为了schedual checkpoint和全力checkpoint(无休眠),在pg_start_backup函数中第二个参数可以选择是否使用fast的checkpoint模式,默认为false。
相关逻辑可以参考:http://blog.itpub.net/6906/viewspace-2652315/
4)脏页处理的过程分为几个步骤:
首先是由background writer将shared buffers里面的被更改过的页面(即脏页),通过调用write写入操作系统page cache。在函数BgBufferSync可以看到,PG的background writer进程,会根据LRU链表,扫描shared buffers(实际上是每次扫描一部分),如果发现脏页,就调用系统调用write。可以通过设置bgwriter_delay参数,来控制background writer每次扫描之间的时间间隔。background writer在对一个页面调用write后,会将该页面对应的文件(实际上是表的segement,每个表可能有多个segment,对应多个物理文件)记录到共享内存的数组CheckpointerShmem->requests中,调用顺序如下:
BackgroundWriterMain -> BgBufferSync -> SyncOneBuffer -> FlushBuffer -> smgrwrite | | V ForwardFsyncRequest <- register_dirty_segment <- mdwrite
最终checkpointer进程通过读取CheckpointerShmem->requests数组,获得这些request,并放入pendingOpsTable。而真正将脏页回写到磁盘的操作,是由checkpointer进程完成的。checkpointer每次也会调用smgrwrite,把所有的shared buffers脏页(即还没有被background writer清理过得脏页)写入操作系统的page cache,并存入pendingOpsTable,这样pendingOpsTable存放了所有write过的脏页,包括之前background writer>已经处理的脏页。随后PG的checkpointer进程会根据pedingOpsTable的记录,进行脏页回写操作(注意每次调用fysnc,都会sync数据表的一个文件,文件中所有脏页都会写入磁盘),调用顺序如下:
CheckPointGuts->CheckPointBuffers->->mdsync->pg_fsync->fsync
如果checkpointer做磁盘写入的频率过高,则每次可能只写入很少的数据。我们知道,磁盘对于顺序写入批量数据比随机写的效率要高的多,每次写入很少数据,就造成大量随机写;而如果我们放慢checkpoint的频率,多个随机页面就有可能组成一次顺序批量写入,效率大大提高。另外,checkpoint会进行fsync操作,大量的fsync可能造成系统IO阻塞,降低系统稳定性,因此checkpoint不能过于频繁。但checkpoint的间隔也不能无限制放大。因为如果出现系统宕机,在进行恢复时,需要从上一次checkpoint的时间点开始恢复,如果checkpoint间隔过长,会造成恢复时间缓慢,降低可用性。
5)总结:
1)bgwriter会定时去刷一些脏数据,直接调用write函数写入,然后在共享内存中标记。这个时候不会调用fsync。
2)checkpoint时,会读取到哪些脏块被bgwriter已经写过了,就只处理那些没有刷盘的脏块。
3)在checkpoint过程中,只要超过一定量脏数据刷盘了(checkpoint_flush_after ),就会触发一次fsync。最终所有数据都刷盘。
4)checkpoint将shared_buffer刷盘只有buffer io这一种方式,既fsync的方式。wal writer将wal_buffer刷盘则有两种方式,open_开头的方式是O_direct方式,f*开头的是commit调用fsync等函数刷page cache的方式。