From 43bb02863f870f2667ab5ae84c0fa5bbd2c8150f Mon Sep 17 00:00:00 2001 From: Bruce Momjian Date: Thu, 14 Aug 2003 23:13:42 +0000 Subject: [PATCH] Add disk rotation idea to WAL todo emails. --- doc/TODO.detail/wal | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) diff --git a/doc/TODO.detail/wal b/doc/TODO.detail/wal index 6842d8daab..c1b1cad2dd 100644 --- a/doc/TODO.detail/wal +++ b/doc/TODO.detail/wal @@ -2698,3 +2698,142 @@ TIP 4: Don't 'kill -9' the postmaster +From pgsql-hackers-owner+M31893@postgresql.org Fri Nov 15 11:25:58 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id gAFHPvR10276 + for ; Fri, 15 Nov 2002 12:25:57 -0500 (EST) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP + id A2D5A4774A1; Fri, 15 Nov 2002 11:34:54 -0500 (EST) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 5E898477132; Fri, 15 Nov 2002 11:15:45 -0500 (EST) +Received: from localhost (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with ESMTP id 90CF1475B85 + for ; Mon, 11 Nov 2002 15:33:47 -0500 (EST) +Received: from Curtis-Vaio (unknown [63.164.0.45]) + by postgresql.org (Postfix) with SMTP id C6CB1475A3F + for ; Mon, 11 Nov 2002 15:33:46 -0500 (EST) +Received: from [127.0.0.1] by Curtis-Vaio + (ArGoSoft Mail Server Freeware, Version 1.8 (1.8.1.7)); Mon, 11 Nov 2002 16:33:42 -0400 +From: "Curtis Faith" +To: +Subject: [HACKERS] 500 tpsQL + WAL log implementation +Date: Mon, 11 Nov 2002 16:33:41 -0400 +Message-ID: +MIME-Version: 1.0 +Content-Type: text/plain; + charset="iso-8859-1" +Content-Transfer-Encoding: 7bit +X-Priority: 3 (Normal) +X-MSMail-Priority: Normal +X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) +X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 +Importance: Normal +X-Virus-Scanned: by AMaViS new-20020517 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Virus-Scanned: by AMaViS new-20020517 +Status: ORr + +I have been experimenting with empirical tests of file system and device +level writes to determine the actual constraints in order to speed up the WAL +logging code. + +Using a raw file partition and a time-based technique for determining the +optimal write position, I am able to get 8K writes physically written to disk +synchronously in the range of 500 to 650 writes per second using FreeBSD raw +device partitions on IDE disks (with write cache disabled). I will be +testing it soon under linux with 10,00RPM SCSI which should be even better. +It is my belief that the mechanism used to achieve these speeds could be +incorporated into the existing WAL logging code as an abstraction that looks +to the WAL code just like the file level access currently used. The current +speeds are limited by the speed of a single disk rotation. For a 7,200 RPM +disk this is 120/second, for a 10,000 RPM disk this is 166.66/second + +The mechanism works by adjusting the seek offset of the write by using +gettimeofday to determine approximately where the disk head is in its +rotation. The mechanism does not use any AIO calls. + +Assuming the following: + +1) Disk rotation time is 8.333ms or 8333us (7200 RPM). + +2) A write at offset 1,500K completes at system time 103s 000ms 000us + +3) A new write is requested at system time 103s 004ms 166us + +4) A 390K per rotation alignment of the data on the disk. + +5) A write must be sent at least 20K ahead of the current head position to +ensure that it is written in less than one rotation. + +It can be determined from the above that a write for an offset of something +slightly more than 195K past the last write, or offset 1,695K will be ahead +of the current location of the head and will therefore complete in less than +a single rotation's time. + +The disk specific metrics (rotation speed, bytes per rotation, base write +time, etc.) can be derived empirically through a tester program that would +take a few minutes to run and which could be run at log setup time. + +The obvious problem with the above mechanism is that the WAL log needs to be +able to read from the log file in transaction order during recovery. This +could be provided for using an abstraction that prepends the logical order +for each block written to the disk and makes sure that the log blocks contain +either a valid logical order number or some other marker indicating that the +block is not being used. + +A bitmap of blocks that have already been used would be kept in memory for +quickly determining the next set of possible unused blocks but this bitmap +would not need to be written to disk except during normal shutdown since in +the even of a failure the bitmaps would be reconstructed by reading all the +blocks from the disk. + +Checkpointing and something akin to log rotation could be handled using this +mechanism as well. + +So, MY REAL QUESTION is whether or not this is the sort of speed improvement +that warrants the work of writing the required abstraction layer and making +this very robust. The WAL code should remain essentially unchanged, with +perhaps new calls for the five or six routines used to access the log files, +and handle the equivalent of log rotation for raw device access. These new +calls would either use the current file based implementation or the new +logging mechanism depending on the configuration. + +I anticipate that the extra work required for a PostgreSQL administrator to +use the proposed logging mechanism would be to: + +1) Create a raw device partition of the appropriate size +2) Run the metrics tester for that device partition +3) Set the appropriate configuration parameters to indicate raw WAL logging + +I anticipate that the additional space requirements for this system would be +on the order of 10% to 15% beyond the current file-based implementation's +requirements. + +So, is this worth doing? Would a robust implementation likely be accepted for +7.4 assuming it can demonstrate speed improvements in the range of 500tps? + +- Curtis + + + + + + + + + + + + + + + + + +---------------------------(end of broadcast)--------------------------- +TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org + -- 2.11.0