1 <!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.68.2.4 2010/08/17 04:37:19 petere Exp $ -->
4 <title>Reliability and the Write-Ahead Log</title>
7 This chapter explains how the Write-Ahead Log is used to obtain
8 efficient, reliable operation.
11 <sect1 id="wal-reliability">
12 <title>Reliability</title>
15 Reliability is an important property of any serious database
16 system, and <productname>PostgreSQL</> does everything possible to
17 guarantee reliable operation. One aspect of reliable operation is
18 that all data recorded by a committed transaction should be stored
19 in a nonvolatile area that is safe from power loss, operating
20 system failure, and hardware failure (except failure of the
21 nonvolatile area itself, of course). Successfully writing the data
22 to the computer's permanent storage (disk drive or equivalent)
23 ordinarily meets this requirement. In fact, even if a computer is
24 fatally damaged, if the disk drives survive they can be moved to
25 another computer with similar hardware and all committed
26 transactions will remain intact.
30 While forcing data periodically to the disk platters might seem like
31 a simple operation, it is not. Because disk drives are dramatically
32 slower than main memory and CPUs, several layers of caching exist
33 between the computer's main memory and the disk platters.
34 First, there is the operating system's buffer cache, which caches
35 frequently requested disk blocks and combines disk writes. Fortunately,
36 all operating systems give applications a way to force writes from
37 the buffer cache to disk, and <productname>PostgreSQL</> uses those
38 features. (See the <xref linkend="guc-wal-sync-method"> parameter
39 to adjust how this is done.)
43 Next, there might be a cache in the disk drive controller; this is
44 particularly common on <acronym>RAID</> controller cards. Some of
45 these caches are <firstterm>write-through</>, meaning writes are sent
46 to the drive as soon as they arrive. Others are
47 <firstterm>write-back</>, meaning data is sent to the drive at
48 some later time. Such caches can be a reliability hazard because the
49 memory in the disk controller cache is volatile, and will lose its
50 contents in a power failure. Better controller cards have
51 <firstterm>battery-backed unit</> (<acronym>BBU</>) caches, meaning
52 the card has a battery that
53 maintains power to the cache in case of system power loss. After power
54 is restored the data will be written to the disk drives.
58 And finally, most disk drives have caches. Some are write-through
59 while some are write-back, and the same concerns about data loss
60 exist for write-back drive caches as exist for disk controller
61 caches. Consumer-grade IDE and SATA drives are particularly likely
62 to have write-back caches that will not survive a power failure,
63 though <acronym>ATAPI-6</> introduced a drive cache flush command
64 (<command>FLUSH CACHE EXT</>) that some file systems use, e.g.
65 <acronym>ZFS</>, <acronym>ext4</>. (The SCSI command
66 <command>SYNCHRONIZE CACHE</> has long been available.) Many
67 solid-state drives (SSD) also have volatile write-back caches, and
68 many do not honor cache flush commands by default.
72 To check write caching on <productname>Linux</> use
73 <command>hdparm -I</>; it is enabled if there is a <literal>*</> next
74 to <literal>Write cache</>; <command>hdparm -W</> to turn off
75 write caching. On <productname>FreeBSD</> use
76 <application>atacontrol</>. (For SCSI disks use <ulink
77 url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>
78 to turn off <literal>WCE</>.) On <productname>Solaris</> the disk
79 write cache is controlled by <ulink
80 url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format
81 -e</></ulink>. (The Solaris <acronym>ZFS</> file system is safe with
82 disk write-cache enabled because it issues its own disk cache flush
83 commands.) On <productname>Windows</> if <varname>wal_sync_method</>
84 is <literal>open_datasync</> (the default), write caching is disabled
85 by unchecking <literal>My Computer\Open\{select disk
86 drive}\Properties\Hardware\Properties\Policies\Enable write caching on
87 the disk</>. Also on Windows, <literal>fsync</> and
88 <literal>fsync_writethrough</> never do write caching. The
89 <literal>fsync_writethrough</> option can also be used to disable
90 write caching on <productname>MacOS X</>.
94 Many file systems that use write barriers (e.g. <acronym>ZFS</>,
95 <acronym>ext4</>) internally use <command>FLUSH CACHE EXT</> or
96 <command>SYNCHRONIZE CACHE</> commands to flush data to the platters on
97 write-back-enabled drives. Unfortunately, such write barrier file
98 systems behave suboptimally when combined with battery-backed unit
99 (<acronym>BBU</>) disk controllers. In such setups, the synchronize
100 command forces all data from the BBU to the disks, eliminating much
101 of the benefit of the BBU. You can run the utility
102 <filename>src/tools/fsync</> in the PostgreSQL source tree to see
103 if you are affected. If you are affected, the performance benefits
104 of the BBU cache can be regained by turning off write barriers in
105 the file system or reconfiguring the disk controller, if that is
106 an option. If write barriers are turned off, make sure the battery
107 remains active; a faulty battery can potentially lead to data loss.
108 Hopefully file system and disk controller designers will eventually
109 address this suboptimal behavior.
113 When the operating system sends a write request to the storage hardware,
114 there is little it can do to make sure the data has arrived at a truly
115 non-volatile storage area. Rather, it is the
116 administrator's responsibility to make certain that all storage components
117 ensure data integrity. Avoid disk controllers that have non-battery-backed
118 write caches. At the drive level, disable write-back caching if the
119 drive cannot guarantee the data will be written before shutdown.
120 You can test for reliable I/O subsystem behavior using <ulink
121 url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
125 Another risk of data loss is posed by the disk platter write
126 operations themselves. Disk platters are divided into sectors,
127 commonly 512 bytes each. Every physical read or write operation
128 processes a whole sector.
129 When a write request arrives at the drive, it might be for 512 bytes,
130 1024 bytes, or 8192 bytes, and the process of writing could fail due
131 to power loss at any time, meaning some of the 512-byte sectors were
132 written, and others were not. To guard against such failures,
133 <productname>PostgreSQL</> periodically writes full page images to
134 permanent WAL storage <emphasis>before</> modifying the actual page on
135 disk. By doing this, during crash recovery <productname>PostgreSQL</> can
136 restore partially-written pages. If you have a battery-backed disk
137 controller or file-system software that prevents partial page writes
138 (e.g., ZFS), you can turn off this page imaging by turning off the
139 <xref linkend="guc-full-page-writes"> parameter.
143 <sect1 id="wal-intro">
144 <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
146 <indexterm zone="wal">
147 <primary>WAL</primary>
151 <primary>transaction log</primary>
156 <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
157 is a standard method for ensuring data integrity. A detailed
158 description can be found in most (if not all) books about
159 transaction processing. Briefly, <acronym>WAL</acronym>'s central
160 concept is that changes to data files (where tables and indexes
161 reside) must be written only after those changes have been logged,
162 that is, after log records describing the changes have been flushed
163 to permanent storage. If we follow this procedure, we do not need
164 to flush data pages to disk on every transaction commit, because we
165 know that in the event of a crash we will be able to recover the
166 database using the log: any changes that have not been applied to
167 the data pages can be redone from the log records. (This is
168 roll-forward recovery, also known as REDO.)
173 Because <acronym>WAL</acronym> restores database file
174 contents after a crash, journaled file systems are not necessary for
175 reliable storage of the data files or WAL files. In fact, journaling
176 overhead can reduce performance, especially if journaling
177 causes file system <emphasis>data</emphasis> to be flushed
178 to disk. Fortunately, data flushing during journaling can
179 often be disabled with a file system mount option, e.g.
180 <literal>data=writeback</> on a Linux ext3 file system.
181 Journaled file systems do improve boot speed after a crash.
187 Using <acronym>WAL</acronym> results in a
188 significantly reduced number of disk writes, because only the log
189 file needs to be flushed to disk to guarantee that a transaction is
190 committed, rather than every data file changed by the transaction.
191 The log file is written sequentially,
192 and so the cost of syncing the log is much less than the cost of
193 flushing the data pages. This is especially true for servers
194 handling many small transactions touching different parts of the data
195 store. Furthermore, when the server is processing many small concurrent
196 transactions, one <function>fsync</function> of the log file may
197 suffice to commit many transactions.
201 <acronym>WAL</acronym> also makes it possible to support on-line
202 backup and point-in-time recovery, as described in <xref
203 linkend="continuous-archiving">. By archiving the WAL data we can support
204 reverting to any time instant covered by the available WAL data:
205 we simply install a prior physical backup of the database, and
206 replay the WAL log just as far as the desired time. What's more,
207 the physical backup doesn't have to be an instantaneous snapshot
208 of the database state — if it is made over some period of time,
209 then replaying the WAL log for that period will fix any internal
214 <sect1 id="wal-async-commit">
215 <title>Asynchronous Commit</title>
218 <primary>synchronous commit</primary>
222 <primary>asynchronous commit</primary>
226 <firstterm>Asynchronous commit</> is an option that allows transactions
227 to complete more quickly, at the cost that the most recent transactions may
228 be lost if the database should crash. In many applications this is an
229 acceptable trade-off.
233 As described in the previous section, transaction commit is normally
234 <firstterm>synchronous</>: the server waits for the transaction's
235 <acronym>WAL</acronym> records to be flushed to permanent storage
236 before returning a success indication to the client. The client is
237 therefore guaranteed that a transaction reported to be committed will
238 be preserved, even in the event of a server crash immediately after.
239 However, for short transactions this delay is a major component of the
240 total transaction time. Selecting asynchronous commit mode means that
241 the server returns success as soon as the transaction is logically
242 completed, before the <acronym>WAL</acronym> records it generated have
243 actually made their way to disk. This can provide a significant boost
244 in throughput for small transactions.
248 Asynchronous commit introduces the risk of data loss. There is a short
249 time window between the report of transaction completion to the client
250 and the time that the transaction is truly committed (that is, it is
251 guaranteed not to be lost if the server crashes). Thus asynchronous
252 commit should not be used if the client will take external actions
253 relying on the assumption that the transaction will be remembered.
254 As an example, a bank would certainly not use asynchronous commit for
255 a transaction recording an ATM's dispensing of cash. But in many
256 scenarios, such as event logging, there is no need for a strong
257 guarantee of this kind.
261 The risk that is taken by using asynchronous commit is of data loss,
262 not data corruption. If the database should crash, it will recover
263 by replaying <acronym>WAL</acronym> up to the last record that was
264 flushed. The database will therefore be restored to a self-consistent
265 state, but any transactions that were not yet flushed to disk will
266 not be reflected in that state. The net effect is therefore loss of
267 the last few transactions. Because the transactions are replayed in
268 commit order, no inconsistency can be introduced — for example,
269 if transaction B made changes relying on the effects of a previous
270 transaction A, it is not possible for A's effects to be lost while B's
271 effects are preserved.
275 The user can select the commit mode of each transaction, so that
276 it is possible to have both synchronous and asynchronous commit
277 transactions running concurrently. This allows flexible trade-offs
278 between performance and certainty of transaction durability.
279 The commit mode is controlled by the user-settable parameter
280 <xref linkend="guc-synchronous-commit">, which can be changed in any of
281 the ways that a configuration parameter can be set. The mode used for
282 any one transaction depends on the value of
283 <varname>synchronous_commit</varname> when transaction commit begins.
287 Certain utility commands, for instance <command>DROP TABLE</>, are
288 forced to commit synchronously regardless of the setting of
289 <varname>synchronous_commit</varname>. This is to ensure consistency
290 between the server's file system and the logical state of the database.
291 The commands supporting two-phase commit, such as <command>PREPARE
292 TRANSACTION</>, are also always synchronous.
296 If the database crashes during the risk window between an
297 asynchronous commit and the writing of the transaction's
298 <acronym>WAL</acronym> records,
299 then changes made during that transaction <emphasis>will</> be lost.
301 risk window is limited because a background process (the <quote>WAL
302 writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
303 every <xref linkend="guc-wal-writer-delay"> milliseconds.
304 The actual maximum duration of the risk window is three times
305 <varname>wal_writer_delay</varname> because the WAL writer is
306 designed to favor writing whole pages at a time during busy periods.
311 An immediate-mode shutdown is equivalent to a server crash, and will
312 therefore cause loss of any unflushed asynchronous commits.
317 Asynchronous commit provides behavior different from setting
318 <xref linkend="guc-fsync"> = off.
319 <varname>fsync</varname> is a server-wide
320 setting that will alter the behavior of all transactions. It disables
321 all logic within <productname>PostgreSQL</> that attempts to synchronize
322 writes to different portions of the database, and therefore a system
323 crash (that is, a hardware or operating system crash, not a failure of
324 <productname>PostgreSQL</> itself) could result in arbitrarily bad
325 corruption of the database state. In many scenarios, asynchronous
326 commit provides most of the performance improvement that could be
327 obtained by turning off <varname>fsync</varname>, but without the risk
332 <xref linkend="guc-commit-delay"> also sounds very similar to
333 asynchronous commit, but it is actually a synchronous commit method
334 (in fact, <varname>commit_delay</varname> is ignored during an
335 asynchronous commit). <varname>commit_delay</varname> causes a delay
336 just before a synchronous commit attempts to flush
337 <acronym>WAL</acronym> to disk, in the hope that a single flush
338 executed by one such transaction can also serve other transactions
339 committing at about the same time. Setting <varname>commit_delay</varname>
340 can only help when there are many concurrently committing transactions,
341 and it is difficult to tune it to a value that actually helps rather
342 than hurt throughput.
347 <sect1 id="wal-configuration">
348 <title><acronym>WAL</acronym> Configuration</title>
351 There are several <acronym>WAL</>-related configuration parameters that
352 affect database performance. This section explains their use.
353 Consult <xref linkend="runtime-config"> for general information about
354 setting server configuration parameters.
358 <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
359 are points in the sequence of transactions at which it is guaranteed
360 that the heap and index data files have been updated with all information written before
361 the checkpoint. At checkpoint time, all dirty data pages are flushed to
362 disk and a special checkpoint record is written to the log file.
363 (The changes were previously flushed to the <acronym>WAL</acronym> files.)
364 In the event of a crash, the crash recovery procedure looks at the latest
365 checkpoint record to determine the point in the log (known as the redo
366 record) from which it should start the REDO operation. Any changes made to
367 data files before that point are guaranteed to be already on disk. Hence, after
368 a checkpoint, log segments preceding the one containing
369 the redo record are no longer needed and can be recycled or removed. (When
370 <acronym>WAL</acronym> archiving is being done, the log segments must be
371 archived before being recycled or removed.)
375 The checkpoint requirement of flushing all dirty data pages to disk
376 can cause a significant I/O load. For this reason, checkpoint
377 activity is throttled so I/O begins at checkpoint start and completes
378 before the next checkpoint starts; this minimizes performance
379 degradation during checkpoints.
383 The server's background writer process automatically performs
384 a checkpoint every so often. A checkpoint is created every <xref
385 linkend="guc-checkpoint-segments"> log segments, or every <xref
386 linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
387 The default settings are 3 segments and 300 seconds (5 minutes), respectively.
388 It is also possible to force a checkpoint by using the SQL command
389 <command>CHECKPOINT</command>.
393 Reducing <varname>checkpoint_segments</varname> and/or
394 <varname>checkpoint_timeout</varname> causes checkpoints to occur
395 more often. This allows faster after-crash recovery (since less work
396 will need to be redone). However, one must balance this against the
397 increased cost of flushing dirty data pages more often. If
398 <xref linkend="guc-full-page-writes"> is set (as is the default), there is
399 another factor to consider. To ensure data page consistency,
400 the first modification of a data page after each checkpoint results in
401 logging the entire page content. In that case,
402 a smaller checkpoint interval increases the volume of output to the WAL log,
403 partially negating the goal of using a smaller interval,
404 and in any case causing more disk I/O.
408 Checkpoints are fairly expensive, first because they require writing
409 out all currently dirty buffers, and second because they result in
410 extra subsequent WAL traffic as discussed above. It is therefore
411 wise to set the checkpointing parameters high enough that checkpoints
412 don't happen too often. As a simple sanity check on your checkpointing
413 parameters, you can set the <xref linkend="guc-checkpoint-warning">
414 parameter. If checkpoints happen closer together than
415 <varname>checkpoint_warning</> seconds,
416 a message will be output to the server log recommending increasing
417 <varname>checkpoint_segments</varname>. Occasional appearance of such
418 a message is not cause for alarm, but if it appears often then the
419 checkpoint control parameters should be increased. Bulk operations such
420 as large <command>COPY</> transfers might cause a number of such warnings
421 to appear if you have not set <varname>checkpoint_segments</> high
426 To avoid flooding the I/O system with a burst of page writes,
427 writing dirty buffers during a checkpoint is spread over a period of time.
428 That period is controlled by
429 <xref linkend="guc-checkpoint-completion-target">, which is
430 given as a fraction of the checkpoint interval.
431 The I/O rate is adjusted so that the checkpoint finishes when the
432 given fraction of <varname>checkpoint_segments</varname> WAL segments
433 have been consumed since checkpoint start, or the given fraction of
434 <varname>checkpoint_timeout</varname> seconds have elapsed,
435 whichever is sooner. With the default value of 0.5,
436 <productname>PostgreSQL</> can be expected to complete each checkpoint
437 in about half the time before the next checkpoint starts. On a system
438 that's very close to maximum I/O throughput during normal operation,
439 you might want to increase <varname>checkpoint_completion_target</varname>
440 to reduce the I/O load from checkpoints. The disadvantage of this is that
441 prolonging checkpoints affects recovery time, because more WAL segments
442 will need to be kept around for possible use in recovery. Although
443 <varname>checkpoint_completion_target</varname> can be set as high as 1.0,
444 it is best to keep it less than that (perhaps 0.9 at most) since
445 checkpoints include some other activities besides writing dirty buffers.
446 A setting of 1.0 is quite likely to result in checkpoints not being
447 completed on time, which would result in performance loss due to
448 unexpected variation in the number of WAL segments needed.
452 There will always be at least one WAL segment file, and will normally
453 not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
454 or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
455 files. Each segment file is normally 16 MB (though this size can be
456 altered when building the server). You can use this to estimate space
457 requirements for <acronym>WAL</acronym>.
458 Ordinarily, when old log segment files are no longer needed, they
459 are recycled (renamed to become the next segments in the numbered
460 sequence). If, due to a short-term peak of log output rate, there
461 are more than 3 * <varname>checkpoint_segments</varname> + 1
462 segment files, the unneeded segment files will be deleted instead
463 of recycled until the system gets back under this limit.
467 In archive recovery or standby mode, the server periodically performs
468 <firstterm>restartpoints</><indexterm><primary>restartpoint</></>
469 which are similar to checkpoints in normal operation: the server forces
470 all its state to disk, updates the <filename>pg_control</> file to
471 indicate that the already-processed WAL data need not be scanned again,
472 and then recycles any old log segment files in <filename>pg_xlog</>
473 directory. A restartpoint is triggered if at least one checkpoint record
474 has been replayed and <varname>checkpoint_timeout</> seconds have passed
475 since last restartpoint. In standby mode, a restartpoint is also triggered
476 if <varname>checkpoint_segments</> log segments have been replayed since
477 last restartpoint and at least one checkpoint record has been replayed.
478 Restartpoints can't be performed more frequently than checkpoints in the
479 master because restartpoints can only be performed at checkpoint records.
483 There are two commonly used internal <acronym>WAL</acronym> functions:
484 <function>LogInsert</function> and <function>LogFlush</function>.
485 <function>LogInsert</function> is used to place a new record into
486 the <acronym>WAL</acronym> buffers in shared memory. If there is no
487 space for the new record, <function>LogInsert</function> will have
488 to write (move to kernel cache) a few filled <acronym>WAL</acronym>
489 buffers. This is undesirable because <function>LogInsert</function>
490 is used on every database low level modification (for example, row
491 insertion) at a time when an exclusive lock is held on affected
492 data pages, so the operation needs to be as fast as possible. What
493 is worse, writing <acronym>WAL</acronym> buffers might also force the
494 creation of a new log segment, which takes even more
495 time. Normally, <acronym>WAL</acronym> buffers should be written
496 and flushed by a <function>LogFlush</function> request, which is
497 made, for the most part, at transaction commit time to ensure that
498 transaction records are flushed to permanent storage. On systems
499 with high log output, <function>LogFlush</function> requests might
500 not occur often enough to prevent <function>LogInsert</function>
501 from having to do writes. On such systems
502 one should increase the number of <acronym>WAL</acronym> buffers by
503 modifying the configuration parameter <xref
504 linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
505 buffers is 8. Increasing this value will
506 correspondingly increase shared memory usage. When
507 <xref linkend="guc-full-page-writes"> is set and the system is very busy,
508 setting this value higher will help smooth response times during the
509 period immediately following each checkpoint.
513 The <xref linkend="guc-commit-delay"> parameter defines for how many
514 microseconds the server process will sleep after writing a commit
515 record to the log with <function>LogInsert</function> but before
516 performing a <function>LogFlush</function>. This delay allows other
517 server processes to add their commit records to the log so as to have all
518 of them flushed with a single log sync. No sleep will occur if
519 <xref linkend="guc-fsync">
520 is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
521 other sessions are currently in active transactions; this avoids
522 sleeping when it's unlikely that any other session will commit soon.
523 Note that on most platforms, the resolution of a sleep request is
524 ten milliseconds, so that any nonzero <varname>commit_delay</varname>
525 setting between 1 and 10000 microseconds would have the same effect.
526 Good values for these parameters are not yet clear; experimentation
531 The <xref linkend="guc-wal-sync-method"> parameter determines how
532 <productname>PostgreSQL</productname> will ask the kernel to force
533 <acronym>WAL</acronym> updates out to disk.
534 With the exception of <literal>fsync_writethrough</>, which can sometimes
535 force a flush of the disk cache even when other options do not do so,
536 all the options should be the same in terms of reliability.
537 However, it's quite platform-specific which one will be the fastest.
538 Note that this parameter is irrelevant if <varname>fsync</varname>
543 Enabling the <xref linkend="guc-wal-debug"> configuration parameter
544 (provided that <productname>PostgreSQL</productname> has been
545 compiled with support for it) will result in each
546 <function>LogInsert</function> and <function>LogFlush</function>
547 <acronym>WAL</acronym> call being logged to the server log. This
548 option might be replaced by a more general mechanism in the future.
552 <sect1 id="wal-internals">
553 <title>WAL Internals</title>
556 <acronym>WAL</acronym> is automatically enabled; no action is
557 required from the administrator except ensuring that the
558 disk-space requirements for the <acronym>WAL</acronym> logs are met,
559 and that any necessary tuning is done (see <xref
560 linkend="wal-configuration">).
564 <acronym>WAL</acronym> logs are stored in the directory
565 <filename>pg_xlog</filename> under the data directory, as a set of
566 segment files, normally each 16 MB in size (but the size can be changed
567 by altering the <option>--with-wal-segsize</> configure option when
568 building the server). Each segment is divided into pages, normally
569 8 kB each (this size can be changed via the <option>--with-wal-blocksize</>
570 configure option). The log record headers are described in
571 <filename>access/xlog.h</filename>; the record content is dependent
572 on the type of event that is being logged. Segment files are given
573 ever-increasing numbers as names, starting at
574 <filename>000000010000000000000000</filename>. The numbers do not wrap,
575 but it will take a very, very long time to exhaust the
576 available stock of numbers.
580 It is advantageous if the log is located on a different disk from the
581 main database files. This can be achieved by moving the
582 <filename>pg_xlog</filename> directory to another location (while the server
583 is shut down, of course) and creating a symbolic link from the
584 original location in the main data directory to the new location.
588 The aim of <acronym>WAL</acronym> is to ensure that the log is
589 written before database records are altered, but this can be subverted by
590 disk drives<indexterm><primary>disk drive</></> that falsely report a
591 successful write to the kernel,
592 when in fact they have only cached the data and not yet stored it
593 on the disk. A power failure in such a situation might lead to
594 irrecoverable data corruption. Administrators should try to ensure
595 that disks holding <productname>PostgreSQL</productname>'s
596 <acronym>WAL</acronym> log files do not make such false reports.
597 (See <xref linkend="wal-reliability">.)
601 After a checkpoint has been made and the log flushed, the
602 checkpoint's position is saved in the file
603 <filename>pg_control</filename>. Therefore, at the start of recovery,
604 the server first reads <filename>pg_control</filename> and
605 then the checkpoint record; then it performs the REDO operation by
606 scanning forward from the log position indicated in the checkpoint
607 record. Because the entire content of data pages is saved in the
608 log on the first page modification after a checkpoint (assuming
609 <xref linkend="guc-full-page-writes"> is not disabled), all pages
610 changed since the checkpoint will be restored to a consistent
615 To deal with the case where <filename>pg_control</filename> is
616 corrupt, we should support the possibility of scanning existing log
617 segments in reverse order — newest to oldest — in order to find the
618 latest checkpoint. This has not been implemented yet.
619 <filename>pg_control</filename> is small enough (less than one disk page)
620 that it is not subject to partial-write problems, and as of this writing
621 there have been no reports of database failures due solely to the inability
622 to read <filename>pg_control</filename> itself. So while it is
623 theoretically a weak spot, <filename>pg_control</filename> does not
624 seem to be a problem in practice.