doc/src/sgml/wal.sgml

   1 <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.68.2.4 2010/08/17 04:37:19 petere Exp $ -->
   2
   3 <chapter id="wal">
   4  <title>Reliability and the Write-Ahead Log</title>
   5
   6  <para>
   7   This chapter explains how the Write-Ahead Log is used to obtain
   8   efficient, reliable operation.
   9  </para>
  10
  11  <sect1 id="wal-reliability">
  12   <title>Reliability</title>
  13
  14   <para>
  15    Reliability is an important property of any serious database
  16    system, and <productname>PostgreSQL</> does everything possible to
  17    guarantee reliable operation. One aspect of reliable operation is
  18    that all data recorded by a committed transaction should be stored
  19    in a nonvolatile area that is safe from power loss, operating
  20    system failure, and hardware failure (except failure of the
  21    nonvolatile area itself, of course).  Successfully writing the data
  22    to the computer's permanent storage (disk drive or equivalent)
  23    ordinarily meets this requirement.  In fact, even if a computer is
  24    fatally damaged, if the disk drives survive they can be moved to
  25    another computer with similar hardware and all committed
  26    transactions will remain intact.
  27   </para>
  28
  29   <para>
  30    While forcing data periodically to the disk platters might seem like
  31    a simple operation, it is not. Because disk drives are dramatically
  32    slower than main memory and CPUs, several layers of caching exist
  33    between the computer's main memory and the disk platters.
  34    First, there is the operating system's buffer cache, which caches
  35    frequently requested disk blocks and combines disk writes. Fortunately,
  36    all operating systems give applications a way to force writes from
  37    the buffer cache to disk, and <productname>PostgreSQL</> uses those
  38    features.  (See the <xref linkend="guc-wal-sync-method"> parameter
  39    to adjust how this is done.)
  40   </para>
  41
  42   <para>
  43    Next, there might be a cache in the disk drive controller; this is
  44    particularly common on <acronym>RAID</> controller cards. Some of
  45    these caches are <firstterm>write-through</>, meaning writes are sent
  46    to the drive as soon as they arrive. Others are
  47    <firstterm>write-back</>, meaning data is sent to the drive at
  48    some later time. Such caches can be a reliability hazard because the
  49    memory in the disk controller cache is volatile, and will lose its
  50    contents in a power failure.  Better controller cards have
  51    <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
  52    the card has a battery that
  53    maintains power to the cache in case of system power loss.  After power
  54    is restored the data will be written to the disk drives.
  55   </para>
  56
  57   <para>
  58    And finally, most disk drives have caches. Some are write-through
  59    while some are write-back, and the same concerns about data loss
  60    exist for write-back drive caches as exist for disk controller
  61    caches.  Consumer-grade IDE and SATA drives are particularly likely
  62    to have write-back caches that will not survive a power failure,
  63    though <acronym>ATAPI-6</> introduced a drive cache flush command
  64    (<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
  65    <acronym>ZFS</>, <acronym>ext4</>.  (The SCSI command
  66    <command>SYNCHRONIZE CACHE</> has long been available.) Many
  67    solid-state drives (SSD) also have volatile write-back caches, and
  68    many do not honor cache flush commands by default.
  69   </para>
  70
  71   <para>
  72    To check write caching on <productname>Linux</> use
  73    <command>hdparm -I</>;  it is enabled if there is a <literal>*</> next
  74    to <literal>Write cache</>; <command>hdparm -W</> to turn off
  75    write caching.  On <productname>FreeBSD</> use
  76    <application>atacontrol</>.  (For SCSI disks use <ulink
  77    url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>
  78    to turn off <literal>WCE</>.)  On <productname>Solaris</> the disk
  79    write cache is controlled by <ulink
  80    url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format
  81    -e</></ulink>. (The Solaris <acronym>ZFS</> file system is safe with
  82    disk write-cache enabled because it issues its own disk cache flush
  83    commands.)  On <productname>Windows</> if <varname>wal_sync_method</>
  84    is <literal>open_datasync</> (the default), write caching is disabled
  85    by unchecking <literal>My Computer\Open\{select disk
  86    drive}\Properties\Hardware\Properties\Policies\Enable write caching on
  87    the disk</>.  Also on Windows, <literal>fsync</> and
  88    <literal>fsync_writethrough</> never do write caching.  The
  89    <literal>fsync_writethrough</> option can also be used to disable
  90    write caching on <productname>MacOS X</>.
  91   </para>
  92
  93   <para>
  94    Many file systems that use write barriers (e.g.  <acronym>ZFS</>,
  95    <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
  96    <command>SYNCHRONIZE CACHE</> commands to flush data to the platters on
  97    write-back-enabled drives.  Unfortunately, such write barrier file
  98    systems behave suboptimally when combined with battery-backed unit
  99    (<acronym>BBU</>) disk controllers.  In such setups, the synchronize
 100    command forces all data from the BBU to the disks, eliminating much
 101    of the benefit of the BBU.  You can run the utility
 102    <filename>src/tools/fsync</> in the PostgreSQL source tree to see
 103    if you are affected.  If you are affected, the performance benefits
 104    of the BBU cache can be regained by turning off write barriers in
 105    the file system or reconfiguring the disk controller, if that is
 106    an option.  If write barriers are turned off, make sure the battery
 107    remains active; a faulty battery can potentially lead to data loss.
 108    Hopefully file system and disk controller designers will eventually
 109    address this suboptimal behavior.
 110   </para>
 111
 112   <para>
 113    When the operating system sends a write request to the storage hardware,
 114    there is little it can do to make sure the data has arrived at a truly
 115    non-volatile storage area. Rather, it is the
 116    administrator's responsibility to make certain that all storage components
 117    ensure data integrity.  Avoid disk controllers that have non-battery-backed
 118    write caches.  At the drive level, disable write-back caching if the
 119    drive cannot guarantee the data will be written before shutdown.
 120    You can test for reliable I/O subsystem behavior using <ulink
 121    url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
 122   </para>
 123
 124   <para>
 125    Another risk of data loss is posed by the disk platter write
 126    operations themselves. Disk platters are divided into sectors,
 127    commonly 512 bytes each.  Every physical read or write operation
 128    processes a whole sector.
 129    When a write request arrives at the drive, it might be for 512 bytes,
 130    1024 bytes, or 8192 bytes, and the process of writing could fail due
 131    to power loss at any time, meaning some of the 512-byte sectors were
 132    written, and others were not.  To guard against such failures,
 133    <productname>PostgreSQL</> periodically writes full page images to
 134    permanent WAL storage <emphasis>before</> modifying the actual page on
 135    disk. By doing this, during crash recovery <productname>PostgreSQL</> can
 136    restore partially-written pages.  If you have a battery-backed disk
 137    controller or file-system software that prevents partial page writes
 138    (e.g., ZFS),  you can turn off this page imaging by turning off the
 139    <xref linkend="guc-full-page-writes"> parameter.
 140   </para>
 141  </sect1>
 142
 143   <sect1 id="wal-intro">
 144    <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
 145
 146    <indexterm zone="wal">
 147     <primary>WAL</primary>
 148    </indexterm>
 149
 150    <indexterm>
 151     <primary>transaction log</primary>
 152     <see>WAL</see>
 153    </indexterm>
 154
 155    <para>
 156     <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
 157     is a standard method for ensuring data integrity.  A detailed
 158     description can be found in most (if not all) books about
 159     transaction processing. Briefly, <acronym>WAL</acronym>'s central
 160     concept is that changes to data files (where tables and indexes
 161     reside) must be written only after those changes have been logged,
 162     that is, after log records describing the changes have been flushed
 163     to permanent storage. If we follow this procedure, we do not need
 164     to flush data pages to disk on every transaction commit, because we
 165     know that in the event of a crash we will be able to recover the
 166     database using the log: any changes that have not been applied to
 167     the data pages can be redone from the log records.  (This is
 168     roll-forward recovery, also known as REDO.)
 169    </para>
 170
 171    <tip>
 172     <para>
 173      Because <acronym>WAL</acronym> restores database file
 174      contents after a crash, journaled file systems are not necessary for
 175      reliable storage of the data files or WAL files.  In fact, journaling
 176      overhead can reduce performance, especially if journaling
 177      causes file system <emphasis>data</emphasis> to be flushed
 178      to disk.  Fortunately, data flushing during journaling can
 179      often be disabled with a file system mount option, e.g.
 180      <literal>data=writeback</> on a Linux ext3 file system.
 181      Journaled file systems do improve boot speed after a crash.
 182     </para>
 183    </tip>
 184
 185
 186    <para>
 187     Using <acronym>WAL</acronym> results in a
 188     significantly reduced number of disk writes, because only the log
 189     file needs to be flushed to disk to guarantee that a transaction is
 190     committed, rather than every data file changed by the transaction.
 191     The log file is written sequentially,
 192     and so the cost of syncing the log is much less than the cost of
 193     flushing the data pages.  This is especially true for servers
 194     handling many small transactions touching different parts of the data
 195     store.  Furthermore, when the server is processing many small concurrent
 196     transactions, one <function>fsync</function> of the log file may
 197     suffice to commit many transactions.
 198    </para>
 199
 200    <para>
 201     <acronym>WAL</acronym> also makes it possible to support on-line
 202     backup and point-in-time recovery, as described in <xref
 203     linkend="continuous-archiving">.  By archiving the WAL data we can support
 204     reverting to any time instant covered by the available WAL data:
 205     we simply install a prior physical backup of the database, and
 206     replay the WAL log just as far as the desired time.  What's more,
 207     the physical backup doesn't have to be an instantaneous snapshot
 208     of the database state &mdash; if it is made over some period of time,
 209     then replaying the WAL log for that period will fix any internal
 210     inconsistencies.
 211    </para>
 212   </sect1>
 213
 214  <sect1 id="wal-async-commit">
 215   <title>Asynchronous Commit</title>
 216
 217    <indexterm>
 218     <primary>synchronous commit</primary>
 219    </indexterm>
 220
 221    <indexterm>
 222     <primary>asynchronous commit</primary>
 223    </indexterm>
 224
 225   <para>
 226    <firstterm>Asynchronous commit</> is an option that allows transactions
 227    to complete more quickly, at the cost that the most recent transactions may
 228    be lost if the database should crash.  In many applications this is an
 229    acceptable trade-off.
 230   </para>
 231
 232   <para>
 233    As described in the previous section, transaction commit is normally
 234    <firstterm>synchronous</>: the server waits for the transaction's
 235    <acronym>WAL</acronym> records to be flushed to permanent storage
 236    before returning a success indication to the client.  The client is
 237    therefore guaranteed that a transaction reported to be committed will
 238    be preserved, even in the event of a server crash immediately after.
 239    However, for short transactions this delay is a major component of the
 240    total transaction time.  Selecting asynchronous commit mode means that
 241    the server returns success as soon as the transaction is logically
 242    completed, before the <acronym>WAL</acronym> records it generated have
 243    actually made their way to disk.  This can provide a significant boost
 244    in throughput for small transactions.
 245   </para>
 246
 247   <para>
 248    Asynchronous commit introduces the risk of data loss. There is a short
 249    time window between the report of transaction completion to the client
 250    and the time that the transaction is truly committed (that is, it is
 251    guaranteed not to be lost if the server crashes).  Thus asynchronous
 252    commit should not be used if the client will take external actions
 253    relying on the assumption that the transaction will be remembered.
 254    As an example, a bank would certainly not use asynchronous commit for
 255    a transaction recording an ATM's dispensing of cash.  But in many
 256    scenarios, such as event logging, there is no need for a strong
 257    guarantee of this kind.
 258   </para>
 259
 260   <para>
 261    The risk that is taken by using asynchronous commit is of data loss,
 262    not data corruption.  If the database should crash, it will recover
 263    by replaying <acronym>WAL</acronym> up to the last record that was
 264    flushed.  The database will therefore be restored to a self-consistent
 265    state, but any transactions that were not yet flushed to disk will
 266    not be reflected in that state.  The net effect is therefore loss of
 267    the last few transactions.  Because the transactions are replayed in
 268    commit order, no inconsistency can be introduced &mdash; for example,
 269    if transaction B made changes relying on the effects of a previous
 270    transaction A, it is not possible for A's effects to be lost while B's
 271    effects are preserved.
 272   </para>
 273
 274   <para>
 275    The user can select the commit mode of each transaction, so that
 276    it is possible to have both synchronous and asynchronous commit
 277    transactions running concurrently.  This allows flexible trade-offs
 278    between performance and certainty of transaction durability.
 279    The commit mode is controlled by the user-settable parameter
 280    <xref linkend="guc-synchronous-commit">, which can be changed in any of
 281    the ways that a configuration parameter can be set.  The mode used for
 282    any one transaction depends on the value of
 283    <varname>synchronous_commit</varname> when transaction commit begins.
 284   </para>
 285
 286   <para>
 287    Certain utility commands, for instance <command>DROP TABLE</>, are
 288    forced to commit synchronously regardless of the setting of
 289    <varname>synchronous_commit</varname>.  This is to ensure consistency
 290    between the server's file system and the logical state of the database.
 291    The commands supporting two-phase commit, such as <command>PREPARE
 292    TRANSACTION</>, are also always synchronous.
 293   </para>
 294
 295   <para>
 296    If the database crashes during the risk window between an
 297    asynchronous commit and the writing of the transaction's
 298    <acronym>WAL</acronym> records,
 299    then changes made during that transaction <emphasis>will</> be lost.
 300    The duration of the
 301    risk window is limited because a background process (the <quote>WAL
 302    writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
 303    every <xref linkend="guc-wal-writer-delay"> milliseconds.
 304    The actual maximum duration of the risk window is three times
 305    <varname>wal_writer_delay</varname> because the WAL writer is
 306    designed to favor writing whole pages at a time during busy periods.
 307   </para>
 308
 309   <caution>
 310    <para>
 311     An immediate-mode shutdown is equivalent to a server crash, and will
 312     therefore cause loss of any unflushed asynchronous commits.
 313    </para>
 314   </caution>
 315
 316   <para>
 317    Asynchronous commit provides behavior different from setting
 318    <xref linkend="guc-fsync"> = off.
 319    <varname>fsync</varname> is a server-wide
 320    setting that will alter the behavior of all transactions.  It disables
 321    all logic within <productname>PostgreSQL</> that attempts to synchronize
 322    writes to different portions of the database, and therefore a system
 323    crash (that is, a hardware or operating system crash, not a failure of
 324    <productname>PostgreSQL</> itself) could result in arbitrarily bad
 325    corruption of the database state.  In many scenarios, asynchronous
 326    commit provides most of the performance improvement that could be
 327    obtained by turning off <varname>fsync</varname>, but without the risk
 328    of data corruption.
 329   </para>
 330
 331   <para>
 332    <xref linkend="guc-commit-delay"> also sounds very similar to
 333    asynchronous commit, but it is actually a synchronous commit method
 334    (in fact, <varname>commit_delay</varname> is ignored during an
 335    asynchronous commit).  <varname>commit_delay</varname> causes a delay
 336    just before a synchronous commit attempts to flush
 337    <acronym>WAL</acronym> to disk, in the hope that a single flush
 338    executed by one such transaction can also serve other transactions
 339    committing at about the same time.  Setting <varname>commit_delay</varname>
 340    can only help when there are many concurrently committing transactions,
 341    and it is difficult to tune it to a value that actually helps rather
 342    than hurt throughput.
 343   </para>
 344
 345  </sect1>
 346
 347  <sect1 id="wal-configuration">
 348   <title><acronym>WAL</acronym> Configuration</title>
 349
 350   <para>
 351    There are several <acronym>WAL</>-related configuration parameters that
 352    affect database performance. This section explains their use.
 353    Consult <xref linkend="runtime-config"> for general information about
 354    setting server configuration parameters.
 355   </para>
 356
 357   <para>
 358    <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
 359    are points in the sequence of transactions at which it is guaranteed
 360    that the heap and index data files have been updated with all information written before
 361    the checkpoint.  At checkpoint time, all dirty data pages are flushed to
 362    disk and a special checkpoint record is written to the log file.
 363    (The changes were previously flushed to the <acronym>WAL</acronym> files.)
 364    In the event of a crash, the crash recovery procedure looks at the latest
 365    checkpoint record to determine the point in the log (known as the redo
 366    record) from which it should start the REDO operation.  Any changes made to
 367    data files before that point are guaranteed to be already on disk.  Hence, after
 368    a checkpoint, log segments preceding the one containing
 369    the redo record are no longer needed and can be recycled or removed. (When
 370    <acronym>WAL</acronym> archiving is being done, the log segments must be
 371    archived before being recycled or removed.)
 372   </para>
 373
 374   <para>
 375    The checkpoint requirement of flushing all dirty data pages to disk
 376    can cause a significant I/O load.  For this reason, checkpoint
 377    activity is throttled so I/O begins at checkpoint start and completes
 378    before the next checkpoint starts;  this minimizes performance
 379    degradation during checkpoints.
 380   </para>
 381
 382   <para>
 383    The server's background writer process automatically performs
 384    a checkpoint every so often.  A checkpoint is created every <xref
 385    linkend="guc-checkpoint-segments"> log segments, or every <xref
 386    linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
 387    The default settings are 3 segments and 300 seconds (5 minutes), respectively.
 388    It is also possible to force a checkpoint by using the SQL command
 389    <command>CHECKPOINT</command>.
 390   </para>
 391
 392   <para>
 393    Reducing <varname>checkpoint_segments</varname> and/or
 394    <varname>checkpoint_timeout</varname> causes checkpoints to occur
 395    more often. This allows faster after-crash recovery (since less work
 396    will need to be redone). However, one must balance this against the
 397    increased cost of flushing dirty data pages more often. If
 398    <xref linkend="guc-full-page-writes"> is set (as is the default), there is
 399    another factor to consider. To ensure data page consistency,
 400    the first modification of a data page after each checkpoint results in
 401    logging the entire page content. In that case,
 402    a smaller checkpoint interval increases the volume of output to the WAL log,
 403    partially negating the goal of using a smaller interval,
 404    and in any case causing more disk I/O.
 405   </para>
 406
 407   <para>
 408    Checkpoints are fairly expensive, first because they require writing
 409    out all currently dirty buffers, and second because they result in
 410    extra subsequent WAL traffic as discussed above.  It is therefore
 411    wise to set the checkpointing parameters high enough that checkpoints
 412    don't happen too often.  As a simple sanity check on your checkpointing
 413    parameters, you can set the <xref linkend="guc-checkpoint-warning">
 414    parameter.  If checkpoints happen closer together than
 415    <varname>checkpoint_warning</> seconds,
 416    a message will be output to the server log recommending increasing
 417    <varname>checkpoint_segments</varname>.  Occasional appearance of such
 418    a message is not cause for alarm, but if it appears often then the
 419    checkpoint control parameters should be increased. Bulk operations such
 420    as large <command>COPY</> transfers might cause a number of such warnings
 421    to appear if you have not set <varname>checkpoint_segments</> high
 422    enough.
 423   </para>
 424
 425   <para>
 426    To avoid flooding the I/O system with a burst of page writes,
 427    writing dirty buffers during a checkpoint is spread over a period of time.
 428    That period is controlled by
 429    <xref linkend="guc-checkpoint-completion-target">, which is
 430    given as a fraction of the checkpoint interval.
 431    The I/O rate is adjusted so that the checkpoint finishes when the
 432    given fraction of <varname>checkpoint_segments</varname> WAL segments
 433    have been consumed since checkpoint start, or the given fraction of
 434    <varname>checkpoint_timeout</varname> seconds have elapsed,
 435    whichever is sooner.  With the default value of 0.5,
 436    <productname>PostgreSQL</> can be expected to complete each checkpoint
 437    in about half the time before the next checkpoint starts.  On a system
 438    that's very close to maximum I/O throughput during normal operation,
 439    you might want to increase <varname>checkpoint_completion_target</varname>
 440    to reduce the I/O load from checkpoints.  The disadvantage of this is that
 441    prolonging checkpoints affects recovery time, because more WAL segments
 442    will need to be kept around for possible use in recovery.  Although
 443    <varname>checkpoint_completion_target</varname> can be set as high as 1.0,
 444    it is best to keep it less than that (perhaps 0.9 at most) since
 445    checkpoints include some other activities besides writing dirty buffers.
 446    A setting of 1.0 is quite likely to result in checkpoints not being
 447    completed on time, which would result in performance loss due to
 448    unexpected variation in the number of WAL segments needed.
 449   </para>
 450
 451   <para>
 452    There will always be at least one WAL segment file, and will normally
 453    not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
 454    or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
 455    files.  Each segment file is normally 16 MB (though this size can be
 456    altered when building the server).  You can use this to estimate space
 457    requirements for <acronym>WAL</acronym>.
 458    Ordinarily, when old log segment files are no longer needed, they
 459    are recycled (renamed to become the next segments in the numbered
 460    sequence). If, due to a short-term peak of log output rate, there
 461    are more than 3 * <varname>checkpoint_segments</varname> + 1
 462    segment files, the unneeded segment files will be deleted instead
 463    of recycled until the system gets back under this limit.
 464   </para>
 465
 466   <para>
 467    In archive recovery or standby mode, the server periodically performs
 468    <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
 469    which are similar to checkpoints in normal operation: the server forces
 470    all its state to disk, updates the <filename>pg_control</> file to
 471    indicate that the already-processed WAL data need not be scanned again,
 472    and then recycles any old log segment files in <filename>pg_xlog</>
 473    directory. A restartpoint is triggered if at least one checkpoint record
 474    has been replayed and <varname>checkpoint_timeout</> seconds have passed
 475    since last restartpoint. In standby mode, a restartpoint is also triggered
 476    if <varname>checkpoint_segments</> log segments have been replayed since
 477    last restartpoint and at least one checkpoint record has been replayed.
 478    Restartpoints can't be performed more frequently than checkpoints in the
 479    master because restartpoints can only be performed at checkpoint records.
 480   </para>
 481
 482   <para>
 483    There are two commonly used internal <acronym>WAL</acronym> functions:
 484    <function>LogInsert</function> and <function>LogFlush</function>.
 485    <function>LogInsert</function> is used to place a new record into
 486    the <acronym>WAL</acronym> buffers in shared memory. If there is no
 487    space for the new record, <function>LogInsert</function> will have
 488    to write (move to kernel cache) a few filled <acronym>WAL</acronym>
 489    buffers. This is undesirable because <function>LogInsert</function>
 490    is used on every database low level modification (for example, row
 491    insertion) at a time when an exclusive lock is held on affected
 492    data pages, so the operation needs to be as fast as possible.  What
 493    is worse, writing <acronym>WAL</acronym> buffers might also force the
 494    creation of a new log segment, which takes even more
 495    time. Normally, <acronym>WAL</acronym> buffers should be written
 496    and flushed by a <function>LogFlush</function> request, which is
 497    made, for the most part, at transaction commit time to ensure that
 498    transaction records are flushed to permanent storage. On systems
 499    with high log output, <function>LogFlush</function> requests might
 500    not occur often enough to prevent <function>LogInsert</function>
 501    from having to do writes.  On such systems
 502    one should increase the number of <acronym>WAL</acronym> buffers by
 503    modifying the configuration parameter <xref
 504    linkend="guc-wal-buffers">.  The default number of <acronym>WAL</acronym>
 505    buffers is 8.  Increasing this value will
 506    correspondingly increase shared memory usage.  When
 507    <xref linkend="guc-full-page-writes"> is set and the system is very busy,
 508    setting this value higher will help smooth response times during the
 509    period immediately following each checkpoint.
 510   </para>
 511
 512   <para>
 513    The <xref linkend="guc-commit-delay"> parameter defines for how many
 514    microseconds the server process will sleep after writing a commit
 515    record to the log with <function>LogInsert</function> but before
 516    performing a <function>LogFlush</function>. This delay allows other
 517    server processes to add their commit records to the log so as to have all
 518    of them flushed with a single log sync. No sleep will occur if
 519    <xref linkend="guc-fsync">
 520    is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
 521    other sessions are currently in active transactions; this avoids
 522    sleeping when it's unlikely that any other session will commit soon.
 523    Note that on most platforms, the resolution of a sleep request is
 524    ten milliseconds, so that any nonzero <varname>commit_delay</varname>
 525    setting between 1 and 10000 microseconds would have the same effect.
 526    Good values for these parameters are not yet clear; experimentation
 527    is encouraged.
 528   </para>
 529
 530   <para>
 531    The <xref linkend="guc-wal-sync-method"> parameter determines how
 532    <productname>PostgreSQL</productname> will ask the kernel to force
 533     <acronym>WAL</acronym> updates out to disk.
 534    With the exception of <literal>fsync_writethrough</>, which can sometimes
 535    force a flush of the disk cache even when other options do not do so,
 536    all the options should be the same in terms of reliability.
 537    However, it's quite platform-specific which one will be the fastest.
 538    Note that this parameter is irrelevant if <varname>fsync</varname>
 539    has been turned off.
 540   </para>
 541
 542   <para>
 543    Enabling the <xref linkend="guc-wal-debug"> configuration parameter
 544    (provided that <productname>PostgreSQL</productname> has been
 545    compiled with support for it) will result in each
 546    <function>LogInsert</function> and <function>LogFlush</function>
 547    <acronym>WAL</acronym> call being logged to the server log. This
 548    option might be replaced by a more general mechanism in the future.
 549   </para>
 550  </sect1>
 551
 552  <sect1 id="wal-internals">
 553   <title>WAL Internals</title>
 554
 555   <para>
 556    <acronym>WAL</acronym> is automatically enabled; no action is
 557    required from the administrator except ensuring that the
 558    disk-space requirements for the <acronym>WAL</acronym> logs are met,
 559    and that any necessary tuning is done (see <xref
 560    linkend="wal-configuration">).
 561   </para>
 562
 563   <para>
 564    <acronym>WAL</acronym> logs are stored in the directory
 565    <filename>pg_xlog</filename> under the data directory, as a set of
 566    segment files, normally each 16 MB in size (but the size can be changed
 567    by altering the <option>--with-wal-segsize</> configure option when
 568    building the server).  Each segment is divided into pages, normally
 569    8 kB each (this size can be changed via the <option>--with-wal-blocksize</>
 570    configure option).  The log record headers are described in
 571    <filename>access/xlog.h</filename>; the record content is dependent
 572    on the type of event that is being logged.  Segment files are given
 573    ever-increasing numbers as names, starting at
 574    <filename>000000010000000000000000</filename>.  The numbers do not wrap,
 575    but it will take a very, very long time to exhaust the
 576    available stock of numbers.
 577   </para>
 578
 579   <para>
 580    It is advantageous if the log is located on a different disk from the
 581    main database files.  This can be achieved by moving the
 582    <filename>pg_xlog</filename> directory to another location (while the server
 583    is shut down, of course) and creating a symbolic link from the
 584    original location in the main data directory to the new location.
 585   </para>
 586
 587   <para>
 588    The aim of <acronym>WAL</acronym> is to ensure that the log is
 589    written before database records are altered, but this can be subverted by
 590    disk drives<indexterm><primary>disk drive</></> that falsely report a
 591    successful write to the kernel,
 592    when in fact they have only cached the data and not yet stored it
 593    on the disk.  A power failure in such a situation might lead to
 594    irrecoverable data corruption.  Administrators should try to ensure
 595    that disks holding <productname>PostgreSQL</productname>'s
 596    <acronym>WAL</acronym> log files do not make such false reports.
 597    (See <xref linkend="wal-reliability">.)
 598   </para>
 599
 600   <para>
 601    After a checkpoint has been made and the log flushed, the
 602    checkpoint's position is saved in the file
 603    <filename>pg_control</filename>. Therefore, at the start of recovery,
 604    the server first reads <filename>pg_control</filename> and
 605    then the checkpoint record; then it performs the REDO operation by
 606    scanning forward from the log position indicated in the checkpoint
 607    record.  Because the entire content of data pages is saved in the
 608    log on the first page modification after a checkpoint (assuming
 609    <xref linkend="guc-full-page-writes"> is not disabled), all pages
 610    changed since the checkpoint will be restored to a consistent
 611    state.
 612   </para>
 613
 614   <para>
 615    To deal with the case where <filename>pg_control</filename> is
 616    corrupt, we should support the possibility of scanning existing log
 617    segments in reverse order &mdash; newest to oldest &mdash; in order to find the
 618    latest checkpoint.  This has not been implemented yet.
 619    <filename>pg_control</filename> is small enough (less than one disk page)
 620    that it is not subject to partial-write problems, and as of this writing
 621    there have been no reports of database failures due solely to the inability
 622    to read <filename>pg_control</filename> itself.  So while it is
 623    theoretically a weak spot, <filename>pg_control</filename> does not
 624    seem to be a problem in practice.
 625   </para>
 626  </sect1>
 627 </chapter>