PostgreSQL synchronous_commit参数确认,以及流复制的思考
在很多时候我们查看官方手册,发现synchronous_commit参数的介绍中,on比remote_apply先介绍,就认为on的级别比remote_apply高,其实不然:
在官网上的说明:
synchronous_commit (enum) Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client. Valid values are on, remote_apply, remote_write, local, and off. The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3. If synchronous_standby_names is non-empty, this parameter also controls whether or not transaction commits will wait for their WAL records to be replicated to the standby server(s).
When set to on, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to disk. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage.
When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s).
When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it out to their operating system. This setting is sufficient to ensure data preservation even if a standby instance of PostgreSQL were to crash, but not if the standby suffers an operating-system-level crash, since the data has not necessarily reached stable storage on the standby.
Finally, the setting local causes commits to wait for local flush to disk, but not for replication. This is not usually desirable when synchronous replication is in use, but is provided for completeness. If synchronous_standby_names is empty, the settings on, remote_apply, remote_write and local all provide the same synchronization level: transaction commits only wait for local flush to disk. This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.
但是代码里面的顺序是这样的:
src\backend\replication\syncrep.c SyncRepReleaseWaiters
/* * Set the lsn first so that when we wake backends they will release up to * this location. */ if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr) { walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr; numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE); } if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr) { walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr; numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH); } if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr) { walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr; numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY); } LWLockRelease(SyncRepLock);
src\include\access\xact.h
typedef enum { SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */ SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */ SYNCHRONOUS_COMMIT_REMOTE_WRITE, /* wait for local flush and remote * write */ SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ SYNCHRONOUS_COMMIT_REMOTE_APPLY /* wait for local flush and remote apply */ } SyncCommitLevel; /* Define the default setting for synchronous_commit */ #define SYNCHRONOUS_COMMIT_ON SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* Synchronous commit level */ extern int synchronous_commit;
on对应的级别应该是SYNCHRONOUS_COMMIT_REMOTE_FLUSH,看if的顺序,remote_apply将是最后的一个级别。
那么还有几个问题需要确认:
1.wal_receiver获取到wal日志变化,是以wal record为单位的,还是以page为单位。
---应该是record为单位,某个操作产生了wal日志record(每次插入都会产生一条record,而不是一个事务一个,一个事务可能产生非常多的record,及时最后没有提交,也会同步到备库,事务的终止也会产生一条record),会有一个LSN,通过pg_current_wal_location查看。而这个record也会同步更新到备库,而不是等待该事务commit时才去将事务产生的所有record同步。
2.wal日志怎么触发将一个新的record发送到备库?
--插入一条数据,会产生一条record,该record在wal_buffer中,等落盘到wal日志之后在(xlog.c中的XLogBackgroundFlush中会激活日志发送),再读取wal日志变化,往tcp队列中加。
3.备库回放为什么不从wal_buffer中回放,而是从wal日志中回放。wal日志中的数据肯定还在内存中,应该不会产生磁盘读写,但是还是有开销。
--回放wal_buffer中的记录,很难,怎么去控制,而回放wal文件则简单很多?
那么一条日志需要先落盘才能发送给从库,那么什么时候才会落盘?
wal日志是定时去刷新的,而不是等到commit时才去刷,触发wal_buffer刷盘的操作有:
1)commit操作
2)checkpoint,会确commit的日志已经落盘。
3)wal_writer_delay时间到达,且产生的日志已经超过了 wal_writer_flush_after的设置量才刷盘。
自动刷日志进程,相关参数说明:
wal_buffers
(integer
) 默认-1,共享内存的1/32,但是不能超过XLOG_BLCKSZ,一般是16MB。
The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB
nor more than the size of one WAL segment, typically 16MB
. This value can be set manually if the automatic choice is too large or too small, but any positive value less than 32kB
will be treated as 32kB
. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ
bytes, typically 8kB. This parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.
wal_writer_delay
(integer
)
后台walwriter多少时间将wal_buffer刷盘,但还要看wal_writer_flush_after参数,例如等了200ms,准备刷盘时,发现新产生的wal数据不到flush_after参数设置的1MB,则不刷盘,进行下一轮等待。
高并发系统这个参数设置小一点,推荐10ms。
Specifies how often the WAL writer flushes WAL, in time terms. After flushing WAL the writer sleeps for the length of time given by wal_writer_delay
, unless woken up sooner by an asynchronously committing transaction. If the last flush happened less than wal_writer_delay
ago and less than wal_writer_flush_after
worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms
). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay
to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf
file or on the server command line.
wal_writer_flush_after
(integer
) 当上一次wal刷盘后,产生了多少新wal才进行刷盘。
Specifies how often the WAL writer flushes WAL, in volume terms. If the last flush happened less than wal_writer_delay
ago and less than wal_writer_flush_after
worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If wal_writer_flush_after
is set to 0
then WAL data is always flushed immediately. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ
bytes, typically 8kB. The default is 1MB
. This parameter can only be set in the postgresql.conf
file or on the server command line.
commit_delay
(integer
) 默认是0,在高并发系统时候,由于每个commit都会去将wal_buffer刷盘,如果太频繁性能并一定太好,可以等待一小段时间,多个commit一次性刷盘。设置了时间不一定生效,还要看当前系统中有多少数目的事务还在进行中,只有打开的事务超过设置数才会去等待。
Setting commit_delay
adds a time delay before a WAL flush is initiated. This can improve group commit throughput by allowing a larger number of transactions to commit via a single WAL flush, if system load is high enough that additional transactions become ready to commit within the given interval. However, it also increases latency by up to the commit_delay
for each WAL flush. Because the delay is just wasted if no other transactions become ready to commit, a delay is only performed if at least commit_siblings
other transactions are active when a flush is about to be initiated. Also, no delays are performed if fsync
is disabled. If this value is specified without units, it is taken as microseconds. The default commit_delay
is zero (no delay). Only superusers can change this setting.
In PostgreSQL releases prior to 9.3, commit_delay
behaved differently and was much less effective: it affected only commits, rather than all WAL flushes, and waited for the entire configured delay even if the WAL flush was completed sooner. Beginning in PostgreSQL 9.3, the first process that becomes ready to flush waits for the configured interval, while subsequent processes wait only until the leader completes the flush operation.
commit_siblings
(integer
) 一般设置大一点,想想高并发系统的并发事务量,事务延时提交的时间是否生效要看这系统还未commit的事务是否达到这个数量。
Minimum number of concurrent open transactions to require before performing the commit_delay
delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five transactions.
不要和max_wal_size搞混了:
max_wal_size (integer) 相当于是产生了多大的wal日志后,如果期间没有做过checkpoint,就会触发一次checkpoint Maximum size to let the WAL grow to between automatic WAL checkpoints. This is a soft limit; WAL size can exceed max_wal_size under special circumstances, such as heavy load, a failing archive_command, or a high wal_keep_segments setting. If this value is specified without units, it is taken as megabytes. The default is 1 GB. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.
那么什么时候将产生的record通过tpc连接发送给备库呢,怎么触发的?我们下次再研究。