Improve the documentation about commit_delay.
Clarify the docs explaining what commit_delay does, and add a recommendation about a useful value for it, namely half of the single-page fsync time reported by pg_test_fsync. This is informed by testing of the new-in-9.3 implementation of commit_delay; in prior versions it was far harder to arrive at a useful setting. In passing, do some wordsmithing and markup-fixing in the same general area. Also, change pg_test_fsync's default time-per-test from 2 seconds to 5. The old value was about the minimum at which the results could be taken seriously at all, and so seems a tad optimistic as a default. Peter Geoghegan, reviewed by Noah Misch; some additional editing by me
This commit is contained in:
parent
dcafdbcde1
commit
70ec2f8f43
|
@ -60,7 +60,7 @@ do { \
|
|||
|
||||
static const char *progname;
|
||||
|
||||
static int secs_per_test = 2;
|
||||
static int secs_per_test = 5;
|
||||
static int needs_unlink = 0;
|
||||
static char full_buf[XLOG_SEG_SIZE],
|
||||
*buf,
|
||||
|
|
|
@ -1603,8 +1603,8 @@ include 'filename'
|
|||
<title>Write Ahead Log</title>
|
||||
|
||||
<para>
|
||||
See also <xref linkend="wal-configuration"> for details on WAL
|
||||
and checkpoint tuning.
|
||||
For additional information on tuning these settings,
|
||||
see <xref linkend="wal-configuration">.
|
||||
</para>
|
||||
|
||||
<sect2 id="runtime-config-wal-settings">
|
||||
|
@ -1957,7 +1957,7 @@ include 'filename'
|
|||
given interval. However, it also increases latency by up to
|
||||
<varname>commit_delay</varname> microseconds for each WAL
|
||||
flush. Because the delay is just wasted if no other transactions
|
||||
become ready to commit, it is only performed if at least
|
||||
become ready to commit, a delay is only performed if at least
|
||||
<varname>commit_siblings</varname> other transactions are active
|
||||
immediately before a flush would otherwise have been initiated.
|
||||
In <productname>PostgreSQL</> releases prior to 9.3,
|
||||
|
@ -1968,7 +1968,8 @@ include 'filename'
|
|||
the first process that becomes ready to flush waits for the configured
|
||||
interval, while subsequent processes wait only until the leader
|
||||
completes the flush. The default <varname>commit_delay</> is zero
|
||||
(no delay), and only honored if <varname>fsync</varname> is enabled.
|
||||
(no delay). No delays are performed unless <varname>fsync</varname>
|
||||
is enabled.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
|
|
@ -36,8 +36,8 @@
|
|||
difference in real database throughput, especially since many database servers
|
||||
are not speed-limited by their transaction logs.
|
||||
<application>pg_test_fsync</application> reports average file sync operation
|
||||
time in microseconds for each wal_sync_method, which can be used to inform
|
||||
efforts to optimize the value of <varname>commit_delay</varname>.
|
||||
time in microseconds for each wal_sync_method, which can also be used to
|
||||
inform efforts to optimize the value of <xref linkend="guc-commit-delay">.
|
||||
</para>
|
||||
</refsect1>
|
||||
|
||||
|
@ -72,8 +72,8 @@
|
|||
<para>
|
||||
Specifies the number of seconds for each test. The more time
|
||||
per test, the greater the test's accuracy, but the longer it takes
|
||||
to run. The default is 2 seconds, which allows the program to
|
||||
complete in about 30 seconds.
|
||||
to run. The default is 5 seconds, which allows the program to
|
||||
complete in under 2 minutes.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
|
|
@ -133,7 +133,7 @@
|
|||
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
||||
command forces all data from the controller cache to the disks,
|
||||
eliminating much of the benefit of the BBU. You can run the
|
||||
<xref linkend="pgtestfsync"> module to see
|
||||
<xref linkend="pgtestfsync"> program to see
|
||||
if you are affected. If you are affected, the performance benefits
|
||||
of the BBU can be regained by turning off write barriers in
|
||||
the file system or reconfiguring the disk controller, if that is
|
||||
|
@ -372,11 +372,12 @@
|
|||
asynchronous commit, but it is actually a synchronous commit method
|
||||
(in fact, <varname>commit_delay</varname> is ignored during an
|
||||
asynchronous commit). <varname>commit_delay</varname> causes a delay
|
||||
just before a synchronous commit attempts to flush
|
||||
<acronym>WAL</acronym> to disk, in the hope that a single flush
|
||||
executed by one such transaction can also serve other transactions
|
||||
committing at about the same time. Setting <varname>commit_delay</varname>
|
||||
can only help when there are many concurrently committing transactions.
|
||||
just before a transaction flushes <acronym>WAL</acronym> to disk, in
|
||||
the hope that a single flush executed by one such transaction can also
|
||||
serve other transactions committing at about the same time. The
|
||||
setting can be thought of as a way of increasing the time window in
|
||||
which transactions can join a group about to participate in a single
|
||||
flush, to amortize the cost of the flush among multiple transactions.
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
|
@ -394,15 +395,16 @@
|
|||
<para>
|
||||
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
|
||||
are points in the sequence of transactions at which it is guaranteed
|
||||
that the heap and index data files have been updated with all information written before
|
||||
the checkpoint. At checkpoint time, all dirty data pages are flushed to
|
||||
disk and a special checkpoint record is written to the log file.
|
||||
(The changes were previously flushed to the <acronym>WAL</acronym> files.)
|
||||
that the heap and index data files have been updated with all
|
||||
information written before that checkpoint. At checkpoint time, all
|
||||
dirty data pages are flushed to disk and a special checkpoint record is
|
||||
written to the log file. (The change records were previously flushed
|
||||
to the <acronym>WAL</acronym> files.)
|
||||
In the event of a crash, the crash recovery procedure looks at the latest
|
||||
checkpoint record to determine the point in the log (known as the redo
|
||||
record) from which it should start the REDO operation. Any changes made to
|
||||
data files before that point are guaranteed to be already on disk. Hence, after
|
||||
a checkpoint, log segments preceding the one containing
|
||||
data files before that point are guaranteed to be already on disk.
|
||||
Hence, after a checkpoint, log segments preceding the one containing
|
||||
the redo record are no longer needed and can be recycled or removed. (When
|
||||
<acronym>WAL</acronym> archiving is being done, the log segments must be
|
||||
archived before being recycled or removed.)
|
||||
|
@ -411,31 +413,32 @@
|
|||
<para>
|
||||
The checkpoint requirement of flushing all dirty data pages to disk
|
||||
can cause a significant I/O load. For this reason, checkpoint
|
||||
activity is throttled so I/O begins at checkpoint start and completes
|
||||
before the next checkpoint starts; this minimizes performance
|
||||
activity is throttled so that I/O begins at checkpoint start and completes
|
||||
before the next checkpoint is due to start; this minimizes performance
|
||||
degradation during checkpoints.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The server's checkpointer process automatically performs
|
||||
a checkpoint every so often. A checkpoint is created every <xref
|
||||
a checkpoint every so often. A checkpoint is begun every <xref
|
||||
linkend="guc-checkpoint-segments"> log segments, or every <xref
|
||||
linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
|
||||
The default settings are 3 segments and 300 seconds (5 minutes), respectively.
|
||||
In cases where no WAL has been written since the previous checkpoint, new
|
||||
checkpoints will be skipped even if checkpoint_timeout has passed.
|
||||
If WAL archiving is being used and you want to put a lower limit on
|
||||
how often files are archived in order to bound potential data
|
||||
loss, you should adjust archive_timeout parameter rather than the checkpoint
|
||||
parameters. It is also possible to force a checkpoint by using the SQL
|
||||
If no WAL has been written since the previous checkpoint, new checkpoints
|
||||
will be skipped even if <varname>checkpoint_timeout</> has passed.
|
||||
(If WAL archiving is being used and you want to put a lower limit on how
|
||||
often files are archived in order to bound potential data loss, you should
|
||||
adjust the <xref linkend="guc-archive-timeout"> parameter rather than the
|
||||
checkpoint parameters.)
|
||||
It is also possible to force a checkpoint by using the SQL
|
||||
command <command>CHECKPOINT</command>.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Reducing <varname>checkpoint_segments</varname> and/or
|
||||
<varname>checkpoint_timeout</varname> causes checkpoints to occur
|
||||
more often. This allows faster after-crash recovery (since less work
|
||||
will need to be redone). However, one must balance this against the
|
||||
more often. This allows faster after-crash recovery, since less work
|
||||
will need to be redone. However, one must balance this against the
|
||||
increased cost of flushing dirty data pages more often. If
|
||||
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
|
||||
another factor to consider. To ensure data page consistency,
|
||||
|
@ -450,7 +453,7 @@
|
|||
Checkpoints are fairly expensive, first because they require writing
|
||||
out all currently dirty buffers, and second because they result in
|
||||
extra subsequent WAL traffic as discussed above. It is therefore
|
||||
wise to set the checkpointing parameters high enough that checkpoints
|
||||
wise to set the checkpointing parameters high enough so that checkpoints
|
||||
don't happen too often. As a simple sanity check on your checkpointing
|
||||
parameters, you can set the <xref linkend="guc-checkpoint-warning">
|
||||
parameter. If checkpoints happen closer together than
|
||||
|
@ -498,7 +501,7 @@
|
|||
altered when building the server). You can use this to estimate space
|
||||
requirements for <acronym>WAL</acronym>.
|
||||
Ordinarily, when old log segment files are no longer needed, they
|
||||
are recycled (renamed to become the next segments in the numbered
|
||||
are recycled (that is, renamed to become future segments in the numbered
|
||||
sequence). If, due to a short-term peak of log output rate, there
|
||||
are more than 3 * <varname>checkpoint_segments</varname> + 1
|
||||
segment files, the unneeded segment files will be deleted instead
|
||||
|
@ -507,64 +510,108 @@
|
|||
|
||||
<para>
|
||||
In archive recovery or standby mode, the server periodically performs
|
||||
<firstterm>restartpoints</><indexterm><primary>restartpoint</></>
|
||||
<firstterm>restartpoints</>,<indexterm><primary>restartpoint</></>
|
||||
which are similar to checkpoints in normal operation: the server forces
|
||||
all its state to disk, updates the <filename>pg_control</> file to
|
||||
indicate that the already-processed WAL data need not be scanned again,
|
||||
and then recycles any old log segment files in <filename>pg_xlog</>
|
||||
directory. A restartpoint is triggered if at least one checkpoint record
|
||||
has been replayed and <varname>checkpoint_timeout</> seconds have passed
|
||||
since last restartpoint. In standby mode, a restartpoint is also triggered
|
||||
if <varname>checkpoint_segments</> log segments have been replayed since
|
||||
last restartpoint and at least one checkpoint record has been replayed.
|
||||
and then recycles any old log segment files in the <filename>pg_xlog</>
|
||||
directory.
|
||||
Restartpoints can't be performed more frequently than checkpoints in the
|
||||
master because restartpoints can only be performed at checkpoint records.
|
||||
A restartpoint is triggered when a checkpoint record is reached if at
|
||||
least <varname>checkpoint_timeout</> seconds have passed since the last
|
||||
restartpoint. In standby mode, a restartpoint is also triggered if at
|
||||
least <varname>checkpoint_segments</> log segments have been replayed
|
||||
since the last restartpoint.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
There are two commonly used internal <acronym>WAL</acronym> functions:
|
||||
<function>LogInsert</function> and <function>LogFlush</function>.
|
||||
<function>LogInsert</function> is used to place a new record into
|
||||
<function>XLogInsert</function> and <function>XLogFlush</function>.
|
||||
<function>XLogInsert</function> is used to place a new record into
|
||||
the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
||||
space for the new record, <function>LogInsert</function> will have
|
||||
space for the new record, <function>XLogInsert</function> will have
|
||||
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
||||
buffers. This is undesirable because <function>LogInsert</function>
|
||||
buffers. This is undesirable because <function>XLogInsert</function>
|
||||
is used on every database low level modification (for example, row
|
||||
insertion) at a time when an exclusive lock is held on affected
|
||||
data pages, so the operation needs to be as fast as possible. What
|
||||
is worse, writing <acronym>WAL</acronym> buffers might also force the
|
||||
creation of a new log segment, which takes even more
|
||||
time. Normally, <acronym>WAL</acronym> buffers should be written
|
||||
and flushed by a <function>LogFlush</function> request, which is
|
||||
and flushed by an <function>XLogFlush</function> request, which is
|
||||
made, for the most part, at transaction commit time to ensure that
|
||||
transaction records are flushed to permanent storage. On systems
|
||||
with high log output, <function>LogFlush</function> requests might
|
||||
not occur often enough to prevent <function>LogInsert</function>
|
||||
with high log output, <function>XLogFlush</function> requests might
|
||||
not occur often enough to prevent <function>XLogInsert</function>
|
||||
from having to do writes. On such systems
|
||||
one should increase the number of <acronym>WAL</acronym> buffers by
|
||||
modifying the configuration parameter <xref
|
||||
linkend="guc-wal-buffers">. When
|
||||
modifying the <xref linkend="guc-wal-buffers"> parameter. When
|
||||
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
|
||||
setting this value higher will help smooth response times during the
|
||||
period immediately following each checkpoint.
|
||||
setting <varname>wal_buffers</> higher will help smooth response times
|
||||
during the period immediately following each checkpoint.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The <xref linkend="guc-commit-delay"> parameter defines for how many
|
||||
microseconds the server process will sleep after writing a commit
|
||||
record to the log with <function>LogInsert</function> but before
|
||||
performing a <function>LogFlush</function>. This delay allows other
|
||||
server processes to add their commit records to the log so as to have all
|
||||
of them flushed with a single log sync. No sleep will occur if
|
||||
<xref linkend="guc-fsync">
|
||||
is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
|
||||
other sessions are currently in active transactions; this avoids
|
||||
sleeping when it's unlikely that any other session will commit soon.
|
||||
Note that on most platforms, the resolution of a sleep request is
|
||||
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
|
||||
setting between 1 and 10000 microseconds would have the same effect.
|
||||
Good values for these parameters are not yet clear; experimentation
|
||||
is encouraged.
|
||||
microseconds a group commit leader process will sleep after acquiring a
|
||||
lock within <function>XLogFlush</function>, while group commit
|
||||
followers queue up behind the leader. This delay allows other server
|
||||
processes to add their commit records to the WAL buffers so that all of
|
||||
them will be flushed by the leader's eventual sync operation. No sleep
|
||||
will occur if <xref linkend="guc-fsync"> is not enabled, or if fewer
|
||||
than <xref linkend="guc-commit-siblings"> other sessions are currently
|
||||
in active transactions; this avoids sleeping when it's unlikely that
|
||||
any other session will commit soon. Note that on some platforms, the
|
||||
resolution of a sleep request is ten milliseconds, so that any nonzero
|
||||
<varname>commit_delay</varname> setting between 1 and 10000
|
||||
microseconds would have the same effect. Note also that on some
|
||||
platforms, sleep operations may take slightly longer than requested by
|
||||
the parameter.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Since the purpose of <varname>commit_delay</varname> is to allow the
|
||||
cost of each flush operation to be amortized across concurrently
|
||||
committing transactions (potentially at the expense of transaction
|
||||
latency), it is necessary to quantify that cost before the setting can
|
||||
be chosen intelligently. The higher that cost is, the more effective
|
||||
<varname>commit_delay</varname> is expected to be in increasing
|
||||
transaction throughput, up to a point. The <xref
|
||||
linkend="pgtestfsync"> program can be used to measure the average time
|
||||
in microseconds that a single WAL flush operation takes. A value of
|
||||
half of the average time the program reports it takes to flush after a
|
||||
single 8kB write operation is often the most effective setting for
|
||||
<varname>commit_delay</varname>, so this value is recommended as the
|
||||
starting point to use when optimizing for a particular workload. While
|
||||
tuning <varname>commit_delay</varname> is particularly useful when the
|
||||
WAL log is stored on high-latency rotating disks, benefits can be
|
||||
significant even on storage media with very fast sync times, such as
|
||||
solid-state drives or RAID arrays with a battery-backed write cache;
|
||||
but this should definitely be tested against a representative workload.
|
||||
Higher values of <varname>commit_siblings</varname> should be used in
|
||||
such cases, whereas smaller <varname>commit_siblings</varname> values
|
||||
are often helpful on higher latency media. Note that it is quite
|
||||
possible that a setting of <varname>commit_delay</varname> that is too
|
||||
high can increase transaction latency by so much that total transaction
|
||||
throughput suffers.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
When <varname>commit_delay</varname> is set to zero (the default), it
|
||||
is still possible for a form of group commit to occur, but each group
|
||||
will consist only of sessions that reach the point where they need to
|
||||
flush their commit records during the window in which the previous
|
||||
flush operation (if any) is occurring. At higher client counts a
|
||||
<quote>gangway effect</> tends to occur, so that the effects of group
|
||||
commit become significant even when <varname>commit_delay</varname> is
|
||||
zero, and thus explicitly setting <varname>commit_delay</varname> tends
|
||||
to help less. Setting <varname>commit_delay</varname> can only help
|
||||
when (1) there are some concurrently committing transactions, and (2)
|
||||
throughput is limited to some degree by commit rate; but with high
|
||||
rotational latency this setting can be effective in increasing
|
||||
transaction throughput with as few as two clients (that is, a single
|
||||
committing client with one sibling transaction).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
@ -574,9 +621,9 @@
|
|||
All the options should be the same in terms of reliability, with
|
||||
the exception of <literal>fsync_writethrough</>, which can sometimes
|
||||
force a flush of the disk cache even when other options do not do so.
|
||||
However, it's quite platform-specific which one will be the fastest;
|
||||
you can test option speeds using the <xref
|
||||
linkend="pgtestfsync"> module.
|
||||
However, it's quite platform-specific which one will be the fastest.
|
||||
You can test the speeds of different options using the <xref
|
||||
linkend="pgtestfsync"> program.
|
||||
Note that this parameter is irrelevant if <varname>fsync</varname>
|
||||
has been turned off.
|
||||
</para>
|
||||
|
@ -585,7 +632,7 @@
|
|||
Enabling the <xref linkend="guc-wal-debug"> configuration parameter
|
||||
(provided that <productname>PostgreSQL</productname> has been
|
||||
compiled with support for it) will result in each
|
||||
<function>LogInsert</function> and <function>LogFlush</function>
|
||||
<function>XLogInsert</function> and <function>XLogFlush</function>
|
||||
<acronym>WAL</acronym> call being logged to the server log. This
|
||||
option might be replaced by a more general mechanism in the future.
|
||||
</para>
|
||||
|
|
Loading…
Reference in a new issue