2791 lines
124 KiB
Text
2791 lines
124 KiB
Text
From cjs@cynic.net Thu Jun 20 22:18:27 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195
|
|
for <pgman@candle.pha.pa.us>; Thu, 20 Jun 2002 22:18:26 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC)
|
|
Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206211106390.437-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On Thu, 20 Jun 2002, Bruce Momjian wrote:
|
|
|
|
> > MS SQL Server has an interesting way of dealing with this. They have a
|
|
> > "torn" bit in each 512-byte chunk of a page, and this bit is set the
|
|
> > same for each chunk. When they are about to write out a page, they first
|
|
> > flip all of the torn bits and then do the write. If the write does not
|
|
> > complete due to a system crash or whatever, this can be detected later
|
|
> > because the torn bits won't match across the entire page.
|
|
>
|
|
> I was wondering, how does knowing the block is corrupt help MS SQL?
|
|
|
|
I'm trying to recall, but I can't off hand. I'll have to look it
|
|
up in my Inside SQL Server book, which is at home right now,
|
|
unfortunately. I'll bring the book into work and let you know the
|
|
details later.
|
|
|
|
> Right now, we write changed pages to WAL, then later write them to disk.
|
|
|
|
Ah. You write the entire page? MS writes only the changed tuple.
|
|
And DB2, in fact, goes one better and writes only the part of the
|
|
tuple up to the change, IIRC. Thus, if you put smaller and/or more
|
|
frequently changed columns first, you'll have smaller logs.
|
|
|
|
> I have always been looking for a way to prevent these WAL writes. The
|
|
> 512-byte bit seems interesting, but how does it help?
|
|
|
|
Well, this would at least let you reduce the write to the 512-byte
|
|
chunk that changed, rather than writing the entire 8K page.
|
|
|
|
> And how does the bit help them with partial block writes? Is the bit at
|
|
> the end of the block? Is that reliable?
|
|
|
|
The bit is somewhere within every 512 byte "disk page" within the
|
|
8192 byte "filesystem/database page." So an 8KB page is divided up
|
|
like this:
|
|
|
|
| <----------------------- 8 Kb ----------------------> |
|
|
|
|
| 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |
|
|
|
|
Thus, the tear bits start out like this:
|
|
|
|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
|
|
|
After a successful write of the entire page, you have this:
|
|
|
|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
|
|
|
|
If the write is unsuccessful, you end up with something like this:
|
|
|
|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
|
|
|
|
And now you know which parts of your page got written, and which
|
|
parts didn't.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From cjs@cynic.net Sat Jun 22 04:41:54 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net ([63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5M8fpF04711
|
|
for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 04:41:53 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 415C8F820; Sat, 22 Jun 2002 08:41:33 +0000 (UTC)
|
|
Date: Sat, 22 Jun 2002 17:41:30 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
|
|
mlw <markw@mohawksoft.com>,
|
|
PostgreSQL-development <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <19332.1024668861@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206221731130.1091-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On Fri, 21 Jun 2002, Tom Lane wrote:
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > And now you know which parts of your page got written, and which
|
|
> > parts didn't.
|
|
>
|
|
> Yes ... and what do you *do* about it?
|
|
|
|
Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207:
|
|
|
|
torn page detection When TRUE, this option causes a bit to be
|
|
flipped for each 512-byte sector in a database page (8 KB)
|
|
whenever the page is written to disk. This option allows
|
|
SQL Server to detect incomplete I/O operations caused by
|
|
power failures or other system outages. If a bit is in the
|
|
wrong state when the page is later read by SQL Server, this
|
|
means the page was written incorrectly; a torn page has
|
|
been detected. Although SQL Server database pages are 8
|
|
KB, disks perform I/O operations using 512-byte sectors.
|
|
Therefore, 16 sectors are written per database page. A
|
|
torn page can occur if the system crashes (for example,
|
|
because of power failure) between the time the operating
|
|
system writes the first 512-byte sector to disk and the
|
|
completion of the 8-KB I/O operation. If the first sector
|
|
of a database page is successfully written before the crash,
|
|
it will appear that the database page on disk was updated,
|
|
although it might not have succeeded. Using battery-backed
|
|
disk caches can ensure that data is [sic] successfully
|
|
written to disk or not written at all. In this case, don't
|
|
set torn page detection to TRUE, as it isn't needed. If a
|
|
torn page is detected, the database will need to be restored
|
|
from backup because it will be physically inconsistent.
|
|
|
|
As I understand it, this is not a problem for postgres becuase the
|
|
entire page is written to the log. So postgres is safe, but quite
|
|
inefficient. (It would be much more efficient to write just the
|
|
changed tuple, or even just the changed values within the tuple,
|
|
to the log.)
|
|
|
|
Adding these torn bits would allow posgres at least to write to
|
|
the log just the 512-byte sectors that have changed, rather than
|
|
the entire 8 KB page.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002
|
|
Return-path: <pgsql-hackers-owner+M24060@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014
|
|
for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 18:31:20 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 47AD2475935
|
|
for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:28:40 -0400 (EDT)
|
|
Received: from hades.usol.com (hades.usol.com [208.232.58.41])
|
|
by postgresql.org (Postfix) with ESMTP id 1D5DA476166
|
|
for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:23:16 -0400 (EDT)
|
|
Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344;
|
|
Sat, 22 Jun 2002 18:22:25 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
|
|
mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
|
|
Tom Lane <tgl@sss.pgh.pa.us>
|
|
In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
|
|
References: <200206210158.g5L1wFk20118@candle.pha.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
Message-ID: <1024784514.1793.242.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 22 Jun 2002 18:22:58 -0400
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: ROr
|
|
|
|
On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
|
|
> I was wondering, how does knowing the block is corrupt help MS SQL?
|
|
> Right now, we write changed pages to WAL, then later write them to disk.
|
|
> I have always been looking for a way to prevent these WAL writes. The
|
|
> 512-byte bit seems interesting, but how does it help?
|
|
>
|
|
> And how does the bit help them with partial block writes? Is the bit at
|
|
> the end of the block? Is that reliable?
|
|
>
|
|
|
|
My understanding of this is as follows:
|
|
|
|
1) On most commercial systems, if you get a corrupted block (from
|
|
partial write or whatever) you need to restore the file(s) from the most
|
|
recent backup, and replay the log from the log archive (usually only the
|
|
damaged files will be written to during replay).
|
|
|
|
2) If you can't deal with the downtime to recover the file, then EMC,
|
|
Sun, or IBM will sell you an expensive disk array with an NVRAM cache
|
|
that will do atomic writes. Some plain-vanilla SCSI disks are also
|
|
capable of atomic writes, though usually they don't use NVRAM to do it.
|
|
|
|
The database must then make sure that each page-write gets translated
|
|
into exactly one SCSI-level write. This is one reason why ORACLE and
|
|
Sybase recommend that you use raw disk partitions for high availability.
|
|
Some operating systems support this through the filesystem, but it is OS
|
|
dependent. I think Solaris 7 & 8 has support for this, but I'm not sure.
|
|
|
|
PostgreSQL has trouble because it can neither archive logs for replay,
|
|
nor use raw disk partitions.
|
|
|
|
|
|
One other point:
|
|
|
|
Page pre-image logging is fundamentally the same as what Jim Grey's
|
|
book[1] would call "careful writes". I don't believe they should be in
|
|
the XLOG, because we never need to keep the pre-images after we're sure
|
|
the buffer has made it to the disk. Instead, we should have the buffer
|
|
IO routines implement ping-pong writes of some kind if we want
|
|
protection from partial writes.
|
|
|
|
|
|
Does any of this make sense?
|
|
|
|
|
|
|
|
;jrnield
|
|
|
|
|
|
[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts
|
|
and Techniques". Morgan Kaufmann.
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002
|
|
Return-path: <pgsql-hackers-owner+M24068@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 08:40:27 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 0D57847592A
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:38 -0400 (EDT)
|
|
Received: from hades.usol.com (hades.usol.com [208.232.58.41])
|
|
by postgresql.org (Postfix) with ESMTP id 75326475876
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:36 -0400 (EDT)
|
|
Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111;
|
|
Sun, 23 Jun 2002 08:37:23 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
|
|
mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
|
|
Tom Lane <tgl@sss.pgh.pa.us>
|
|
In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us>
|
|
References: <200206222317.g5MNHBn23427@candle.pha.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 23 Jun 2002 08:37:53 -0400
|
|
Message-ID: <1024835880.1793.264.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: RO
|
|
|
|
On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
|
|
> J. R. Nield wrote:
|
|
> > One other point:
|
|
> >
|
|
> > Page pre-image logging is fundamentally the same as what Jim Grey's
|
|
> > book[1] would call "careful writes". I don't believe they should be in
|
|
> > the XLOG, because we never need to keep the pre-images after we're sure
|
|
> > the buffer has made it to the disk. Instead, we should have the buffer
|
|
> > IO routines implement ping-pong writes of some kind if we want
|
|
> > protection from partial writes.
|
|
>
|
|
> Ping-pong writes to where? We have to fsync, and rather than fsync that
|
|
> area and WAL, we just do WAL. Not sure about a win there.
|
|
>
|
|
|
|
The key question is: do we have some method to ensure that the OS
|
|
doesn't do the writes in parallel?
|
|
|
|
If the OS will ensure that one of the two block writes of a ping-pong
|
|
completes before the other starts, then we don't need to fsync() at
|
|
all.
|
|
|
|
The only thing we are protecting against is the possibility of both
|
|
writes being partial. If neither is done, that's fine because WAL will
|
|
protect us. If the first write is partial, we will detect that and use
|
|
the old data from the other, then recover from WAL. If the first is
|
|
complete but the second is partial, then we detect that and use the
|
|
newer block from the first write. If the second is complete but the
|
|
first is partial, we detect that and use the newer block from the second
|
|
write.
|
|
|
|
So does anyone know a way to prevent parallel writes in one of the
|
|
common unix standards? Do they say anything about this?
|
|
|
|
It would seem to me that if the same process does both ping-pong writes,
|
|
then there should be a cheap way to enforce a serial order. I could be
|
|
wrong though.
|
|
|
|
As to where the first block of the ping-pong should go, maybe we could
|
|
reserve a file with nBlocks space for them, and write the information
|
|
about which block was being written to the XLOG for use in recovery.
|
|
There are many other ways to do it.
|
|
|
|
;jrnield
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
From cjs@cynic.net Sun Jun 23 09:33:29 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 09:33:28 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC)
|
|
Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
|
|
mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
|
|
Tom Lane <tgl@sss.pgh.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain>
|
|
Message-ID: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On 23 Jun 2002, J. R. Nield wrote:
|
|
|
|
> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
|
|
> > J. R. Nield wrote:
|
|
> > > One other point:
|
|
> > >
|
|
> > > Page pre-image logging is fundamentally the same as what Jim Grey's
|
|
> > > book[1] would call "careful writes". I don't believe they should be in
|
|
> > > the XLOG, because we never need to keep the pre-images after we're sure
|
|
> > > the buffer has made it to the disk. Instead, we should have the buffer
|
|
> > > IO routines implement ping-pong writes of some kind if we want
|
|
> > > protection from partial writes.
|
|
> >
|
|
> > Ping-pong writes to where? We have to fsync, and rather than fsync that
|
|
> > area and WAL, we just do WAL. Not sure about a win there.
|
|
|
|
Presumably the win is that, "we never need to keep the pre-images
|
|
after we're sure the buffer has made it to the disk." So the
|
|
pre-image log can be completely ditched when we shut down the
|
|
server, so a full system sync, or whatever. This keeps the log file
|
|
size down, which means faster recovery, less to back up (when we
|
|
start getting transaction logs that can be backed up), etc.
|
|
|
|
This should also allow us to disable completely the ping-pong writes
|
|
if we have a disk subsystem that we trust. (E.g., a disk array with
|
|
battery backed memory.) That would, in theory, produce a nice little
|
|
performance increase when lots of inserts and/or updates are being
|
|
committed, as we have much, much less to write to the log file.
|
|
|
|
Are there stats that track, e.g., the bandwidth of writes to the
|
|
log file? I'd be interested in knowing just what kind of savings
|
|
one might see by doing this.
|
|
|
|
> The key question is: do we have some method to ensure that the OS
|
|
> doesn't do the writes in parallel?...
|
|
> It would seem to me that if the same process does both ping-pong writes,
|
|
> then there should be a cheap way to enforce a serial order. I could be
|
|
> wrong though.
|
|
|
|
Well, whether or not there's a cheap way depends on whether you consider
|
|
fsync to be cheap. :-)
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002
|
|
Return-path: <pgsql-hackers-owner+M24073@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 11:19:59 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id E2353475C4B
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:47 -0400 (EDT)
|
|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
|
|
by postgresql.org (Postfix) with ESMTP id 746F8475AEA
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:46 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464;
|
|
Sun, 23 Jun 2002 11:19:15 -0400 (EDT)
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Sun, 23 Jun 2002 22:33:07 +0900"
|
|
Date: Sun, 23 Jun 2002 11:19:15 -0400
|
|
Message-ID: <8461.1024845555@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
Status: RO
|
|
|
|
Curt Sampson <cjs@cynic.net> writes:
|
|
> This should also allow us to disable completely the ping-pong writes
|
|
> if we have a disk subsystem that we trust.
|
|
|
|
If we have a disk subsystem we trust, we just disable fsync on the
|
|
WAL and the performance issue largely goes away.
|
|
|
|
I concur with Bruce: the reason we keep page images in WAL is to
|
|
minimize the number of places we have to fsync, and thus the amount of
|
|
head movement required for a commit. Putting the page images elsewhere
|
|
cannot be a win AFAICS.
|
|
|
|
> Well, whether or not there's a cheap way depends on whether you consider
|
|
> fsync to be cheap. :-)
|
|
|
|
It's never cheap :-(
|
|
|
|
regards, tom lane
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/users-lounge/docs/faq.html
|
|
|
|
From cjs@cynic.net Sun Jun 23 12:10:44 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 12:10:43 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC)
|
|
Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On Sun, 23 Jun 2002, Tom Lane wrote:
|
|
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > This should also allow us to disable completely the ping-pong writes
|
|
> > if we have a disk subsystem that we trust.
|
|
>
|
|
> If we have a disk subsystem we trust, we just disable fsync on the
|
|
> WAL and the performance issue largely goes away.
|
|
|
|
No, you can't do this. If you don't fsync(), there's no guarantee
|
|
that the write ever got out of the computer's buffer cache and to
|
|
the disk subsystem in the first place.
|
|
|
|
> I concur with Bruce: the reason we keep page images in WAL is to
|
|
> minimize the number of places we have to fsync, and thus the amount of
|
|
> head movement required for a commit.
|
|
|
|
An fsync() does not necessarially cause head movement, or any real
|
|
disk writes at all. If you're writing to many external disk arrays,
|
|
for example, the fsync() ensures that the data are in the disk array's
|
|
non-volatile or UPS-backed RAM, no more. The array might hold the data
|
|
for quite some time before it actually writes it to disk.
|
|
|
|
But you're right that it's faster, if you're going to write out changed
|
|
pages and have have the ping-pong file and the transaction log on the
|
|
same disk, just to write out the entire page to the transaction log.
|
|
|
|
So what we would really need to implement, if we wanted to be more
|
|
efficient with trusted disk subsystems, would be the option of writing
|
|
to the log only the changed row or changed part of the row, or writing
|
|
the entire changed page. I don't know how hard this would be....
|
|
|
|
> > Well, whether or not there's a cheap way depends on whether you consider
|
|
> > fsync to be cheap. :-)
|
|
>
|
|
> It's never cheap :-(
|
|
|
|
Actually, with a good external RAID system with non-volatile RAM,
|
|
it's a good two to four orders of magnitude cheaper than writing to a
|
|
directly connected disk that doesn't claim the write is complete until
|
|
it's physically on disk. I'd say that it qualifies as at least "not
|
|
expensive." Not that you want to do it more often than you have to
|
|
anyway....
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From jrnield@usol.com Sun Jun 23 13:56:59 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 13:56:58 -0400 (EDT)
|
|
Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549;
|
|
Sun, 23 Jun 2002 13:56:49 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker
|
|
<pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
|
|
<8461.1024845555@sss.pgh.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 23 Jun 2002 13:57:19 -0400
|
|
Message-ID: <1024855044.1793.414.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: ROr
|
|
|
|
On Sun, 2002-06-23 at 11:19, Tom Lane wrote:
|
|
> Curt Sampson <cjs@cynic.net> writes:
|
|
> > This should also allow us to disable completely the ping-pong writes
|
|
> > if we have a disk subsystem that we trust.
|
|
>
|
|
> If we have a disk subsystem we trust, we just disable fsync on the
|
|
> WAL and the performance issue largely goes away.
|
|
|
|
It wouldn't work because the OS buffering interferes, and we need those
|
|
WAL records on disk up to the greatest LSN of the Buffer we will be writing.
|
|
|
|
|
|
We already buffer WAL ourselves. We also already buffer regular pages.
|
|
Whenever we write a Buffer out of the buffer cache, it is because we
|
|
really want that page on disk and wanted to start an IO. If thats not
|
|
the case, then we should have more block buffers!
|
|
|
|
So since we have all this buffering designed especially to meet our
|
|
needs, and since the OS buffering is in the way, can someone explain to
|
|
me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
platform supports it?
|
|
|
|
|
|
|
|
>
|
|
> I concur with Bruce: the reason we keep page images in WAL is to
|
|
> minimize the number of places we have to fsync, and thus the amount of
|
|
> head movement required for a commit. Putting the page images elsewhere
|
|
> cannot be a win AFAICS.
|
|
|
|
|
|
Why not put all the page images in a single pre-allocated file and treat
|
|
it as a ring? How could this be any worse than flushing them in the WAL
|
|
log?
|
|
|
|
Maybe fsync would be slower with two files, but I don't see how
|
|
fdatasync would be, and most platforms support that.
|
|
|
|
What would improve performance would be to have a dbflush process that
|
|
would work in the background flushing buffers in groups and trying to
|
|
stay ahead of ReadBuffer requests. That would let you do the temporary
|
|
side of the ping-pong as a huge O_DSYNC writev(2) request (or
|
|
fdatasync() once) and then write out the other buffers. It would also
|
|
tend to prevent the other backends from blocking on write requests.
|
|
|
|
A dbflush could also support aio_read/aio_write on platforms like
|
|
Solaris and WindowsNT that support it.
|
|
|
|
Am I correct that right now, buffers only get written when they get
|
|
removed from the free list for reuse? So a released dirty buffer will
|
|
sit in the buffer free list until it becomes the Least Recently Used
|
|
buffer, and will then cause a backend to block for IO in a call to
|
|
BufferAlloc?
|
|
|
|
This would explain why we like using the OS buffer cache, and why our
|
|
performance is troublesome when we have to do synchronous IO writes, and
|
|
why fsync() takes so long to complete. All of the backends block for
|
|
each call to BufferAlloc() after a large table update by a single
|
|
backend, and then the OS buffers are always full of our "written" data.
|
|
|
|
Am I reading the bufmgr code correctly? I already found an imaginary
|
|
race condition there once :-)
|
|
|
|
;jnield
|
|
|
|
|
|
>
|
|
> > Well, whether or not there's a cheap way depends on whether you consider
|
|
> > fsync to be cheap. :-)
|
|
>
|
|
> It's never cheap :-(
|
|
>
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
From cjs@cynic.net Sun Jun 23 14:15:15 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:15:15 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC)
|
|
Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
|
|
Message-ID: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: ROr
|
|
|
|
On 23 Jun 2002, J. R. Nield wrote:
|
|
|
|
> So since we have all this buffering designed especially to meet our
|
|
> needs, and since the OS buffering is in the way, can someone explain to
|
|
> me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
> platform supports it?
|
|
|
|
It's more code, if there are platforms out there that don't support
|
|
O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
|
|
O_DSYNC could save us a disk arm movement over fsync() because it
|
|
appears to me that fsync is also going to force a metadata update, which
|
|
means that the inode blocks have to be written as well.
|
|
|
|
> Maybe fsync would be slower with two files, but I don't see how
|
|
> fdatasync would be, and most platforms support that.
|
|
|
|
Because, if both files are on the same disk, you still have to move
|
|
the disk arm from the cylinder at the current log file write point
|
|
to the cylinder at the current ping-pong file write point. And then back
|
|
again to the log file write point cylinder.
|
|
|
|
In the end, having a ping-pong file as well seems to me unnecessary
|
|
complexity, especially when anyone interested in really good
|
|
performance is going to buy a disk subsystem that guarantees no
|
|
torn pages and thus will want to turn off the ping-pong file writes
|
|
entirely, anyway.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From jrnield@usol.com Sun Jun 23 14:14:51 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:14:50 -0400 (EDT)
|
|
Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287;
|
|
Sun, 23 Jun 2002 14:14:46 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker
|
|
<pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 23 Jun 2002 14:15:17 -0400
|
|
Message-ID: <1024856120.3054.418.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: RO
|
|
|
|
On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
|
|
>
|
|
> So what we would really need to implement, if we wanted to be more
|
|
> efficient with trusted disk subsystems, would be the option of writing
|
|
> to the log only the changed row or changed part of the row, or writing
|
|
> the entire changed page. I don't know how hard this would be....
|
|
>
|
|
We already log that stuff. The page images are in addition to the
|
|
"Logical Changes", so we could just stop logging the page images.
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002
|
|
Return-path: <pgsql-hackers-owner+M24100@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OHDeF08564
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 13:13:40 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 05602475CBE; Mon, 24 Jun 2002 13:11:10 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 13:11:10 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 929A247633B; Mon, 24 Jun 2002 09:26:54 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 962C147631A
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:31:43 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:31:43 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id C112D475C3C
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:35:20 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5NJYtL07449;
|
|
Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206231934.g5NJYtL07449@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
Date: Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
J. R. Nield wrote:
|
|
> So since we have all this buffering designed especially to meet our
|
|
> needs, and since the OS buffering is in the way, can someone explain to
|
|
> me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
> platform supports it?
|
|
|
|
We sync only WAL, not the other pages, except for the sync() call we do
|
|
during checkpoint when we discard old WAL files.
|
|
|
|
> > I concur with Bruce: the reason we keep page images in WAL is to
|
|
> > minimize the number of places we have to fsync, and thus the amount of
|
|
> > head movement required for a commit. Putting the page images elsewhere
|
|
> > cannot be a win AFAICS.
|
|
>
|
|
>
|
|
> Why not put all the page images in a single pre-allocated file and treat
|
|
> it as a ring? How could this be any worse than flushing them in the WAL
|
|
> log?
|
|
>
|
|
> Maybe fsync would be slower with two files, but I don't see how
|
|
> fdatasync would be, and most platforms support that.
|
|
|
|
We have fdatasync option for WAL in postgresql.conf.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002
|
|
Return-path: <pgsql-hackers-owner+M24091@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:54:22 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 2769E4762E3
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:27:39 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id ED459475C61
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:37:08 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642;
|
|
Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Curt Sampson wrote:
|
|
> On 23 Jun 2002, J. R. Nield wrote:
|
|
>
|
|
> > So since we have all this buffering designed especially to meet our
|
|
> > needs, and since the OS buffering is in the way, can someone explain to
|
|
> > me why postgresql would ever open a file without the O_DSYNC flag if the
|
|
> > platform supports it?
|
|
>
|
|
> It's more code, if there are platforms out there that don't support
|
|
> O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
|
|
> O_DSYNC could save us a disk arm movement over fsync() because it
|
|
> appears to me that fsync is also going to force a metadata update, which
|
|
> means that the inode blocks have to be written as well.
|
|
|
|
Again, see postgresql.conf:
|
|
|
|
#wal_sync_method = fsync # the default varies across platforms:
|
|
# # fsync, fdatasync, open_sync, or open_datasync
|
|
|
|
>
|
|
> > Maybe fsync would be slower with two files, but I don't see how
|
|
> > fdatasync would be, and most platforms support that.
|
|
>
|
|
> Because, if both files are on the same disk, you still have to move
|
|
> the disk arm from the cylinder at the current log file write point
|
|
> to the cylinder at the current ping-pong file write point. And then back
|
|
> again to the log file write point cylinder.
|
|
>
|
|
> In the end, having a ping-pong file as well seems to me unnecessary
|
|
> complexity, especially when anyone interested in really good
|
|
> performance is going to buy a disk subsystem that guarantees no
|
|
> torn pages and thus will want to turn off the ping-pong file writes
|
|
> entirely, anyway.
|
|
|
|
Yes, I don't see writing to two files vs. one to be any win, especially
|
|
when we need to fsync both of them. What I would really like is to
|
|
avoid the double I/O of writing to WAL and to the data file; improving
|
|
that would be a huge win.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From cjs@cynic.net Sun Jun 23 20:09:44 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 20:09:43 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC)
|
|
Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On Sun, 23 Jun 2002, Bruce Momjian wrote:
|
|
|
|
> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
> when we need to fsync both of them. What I would really like is to
|
|
> avoid the double I/O of writing to WAL and to the data file; improving
|
|
> that would be a huge win.
|
|
|
|
You mean, the double I/O of writing the block to the WAL and data file?
|
|
(We'd still have to write the changed columns or whatever to the WAL,
|
|
right?)
|
|
|
|
I'd just add an option to turn it off. If you need it, you need it;
|
|
there's no way around that except to buy hardware that is really going
|
|
to guarantee your writes (which then means you don't need it).
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From jrnield@usol.com Sun Jun 23 21:28:58 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 21:28:57 -0400 (EDT)
|
|
Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303;
|
|
Sun, 23 Jun 2002 21:28:55 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis
|
|
<mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker
|
|
<pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
|
|
References: <200206231936.g5NJasa07642@candle.pha.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 23 Jun 2002 21:29:23 -0400
|
|
Message-ID: <1024882167.1793.733.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: ROr
|
|
|
|
On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
|
|
> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
> when we need to fsync both of them. What I would really like is to
|
|
> avoid the double I/O of writing to WAL and to the data file; improving
|
|
> that would be a huge win.
|
|
>
|
|
|
|
If is impossible to do what you want. You can not protect against
|
|
partial writes without writing pages twice and calling fdatasync between
|
|
them while going through a generic filesystem. The best disk array will
|
|
not protect you if the operating system does not align block writes to
|
|
the structure of the underlying device. Even with raw devices, you need
|
|
special support or knowledge of the operating system and/or the disk
|
|
device to ensure that each write request will be atomic to the
|
|
underlying hardware.
|
|
|
|
All other systems rely on the fact that you can recover a damaged file
|
|
using the log archive. This means downtime in the rare case, but no data
|
|
loss. Until PostgreSQL can do this, then it will not be acceptable for
|
|
real critical production use. This is not to knock PostgreSQL, because
|
|
it is a very good database system, and clearly the best open-source one.
|
|
It even has feature advantages over the commercial systems. But at the
|
|
end of the day, unless you have complete understanding of the I/O system
|
|
from write(2) through to the disk system, the only sure ways to protect
|
|
against partial writes are by "careful writes" (in the WAL log or
|
|
elsewhere, writing pages twice), or by requiring (and allowing) users to
|
|
do log-replay recovery when a file is corrupted by a partial write. As
|
|
long as there is a UPS, and the operating system doesn't crash, then
|
|
there still should be no partial writes.
|
|
|
|
If we log pages to WAL, they are useless when archived (after a
|
|
checkpoint). So either we have a separate "log" for them (the ping-pong
|
|
file), or we should at least remove them when archived, which makes log
|
|
archiving more complex but is perfectly doable.
|
|
|
|
Finally, I would love to hear why we are using the operating system
|
|
buffer manager at all. The OS is acting as a secondary buffer manager
|
|
for us. Why is that? What flaw in our I/O system does this reveal? I
|
|
know that:
|
|
|
|
>We sync only WAL, not the other pages, except for the sync() call we do
|
|
> during checkpoint when we discard old WAL files.
|
|
|
|
But this is probably not a good thing. We should only be writing blocks
|
|
when they need to be on disk. We should not be expecting the OS to write
|
|
them "sometime later" and avoid blocking (as long) for the write. If we
|
|
need that, then our buffer management is wrong and we need to fix it.
|
|
The reason we are doing this is because we expect the OS buffer manager
|
|
to do asynchronous I/O for us, but then we don't control the order. That
|
|
is the reason why we have to call fdatasync(), to create "sequence
|
|
points".
|
|
|
|
The reason we have performance problems with either D_OSYNC or fdatasync
|
|
on the normal relations is because we have no dbflush process. This
|
|
causes an unacceptable amount of I/O blocking by other transactions.
|
|
|
|
The ORACLE people were not kidding when they said that they could not
|
|
certify Linux for production use until it supported O_DSYNC. Can you
|
|
explain why that was the case?
|
|
|
|
Finally, let me apologize if the above comes across as somewhat
|
|
belligerent. I know very well that I can't compete with you guys for
|
|
knowledge of the PosgreSQL system. I am still at a loss when I look at
|
|
the optimizer and executor modules, and it will take some time before I
|
|
can follow discussion of that area. Even then, I doubt my ability to
|
|
compare with people like Mr. Lane and Mr. Momjian in experience and
|
|
general intelligence, or in the field of database programming and
|
|
software development in particular. However, this discussion and a
|
|
search of the pgsql-hackers archives reveals this problem to be the KEY
|
|
area of PostgreSQL's failing, and general misunderstanding, when
|
|
compared to its commercial competitors.
|
|
|
|
Sincerely,
|
|
|
|
J. R. Nield
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002
|
|
Return-path: <pgsql-hackers-owner+M24090@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:38:03 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 10:18:05 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id CDDFA475CC3
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:37:44 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:37:44 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 5C971475858
|
|
for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 22:47:12 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992;
|
|
Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
|
|
cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
J. R. Nield wrote:
|
|
> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
|
|
> > Yes, I don't see writing to two files vs. one to be any win, especially
|
|
> > when we need to fsync both of them. What I would really like is to
|
|
> > avoid the double I/O of writing to WAL and to the data file; improving
|
|
> > that would be a huge win.
|
|
> >
|
|
>
|
|
> If is impossible to do what you want. You can not protect against
|
|
> partial writes without writing pages twice and calling fdatasync between
|
|
> them while going through a generic filesystem. The best disk array will
|
|
> not protect you if the operating system does not align block writes to
|
|
> the structure of the underlying device. Even with raw devices, you need
|
|
> special support or knowledge of the operating system and/or the disk
|
|
> device to ensure that each write request will be atomic to the
|
|
> underlying hardware.
|
|
|
|
Yes, I suspected it was impossible, but that doesn't mean I want it any
|
|
less. ;-)
|
|
|
|
> All other systems rely on the fact that you can recover a damaged file
|
|
> using the log archive. This means downtime in the rare case, but no data
|
|
> loss. Until PostgreSQL can do this, then it will not be acceptable for
|
|
> real critical production use. This is not to knock PostgreSQL, because
|
|
> it is a very good database system, and clearly the best open-source one.
|
|
> It even has feature advantages over the commercial systems. But at the
|
|
> end of the day, unless you have complete understanding of the I/O system
|
|
> from write(2) through to the disk system, the only sure ways to protect
|
|
> against partial writes are by "careful writes" (in the WAL log or
|
|
> elsewhere, writing pages twice), or by requiring (and allowing) users to
|
|
> do log-replay recovery when a file is corrupted by a partial write. As
|
|
> long as there is a UPS, and the operating system doesn't crash, then
|
|
> there still should be no partial writes.
|
|
|
|
You are talking point-in-time recovery, a major missing feature right
|
|
next to replication, and I agree it makes PostgreSQL unacceptable for
|
|
some applications. Point taken.
|
|
|
|
And the interesting thing you are saying is that with point-in-time
|
|
recovery, we don't need to write pre-write images of pages because if we
|
|
detect a partial page write, we then abort the database and tell the
|
|
user to do a point-in-time recovery, basically meaning we are using the
|
|
previous full backup as our pre-write page image and roll forward using
|
|
the logical logs. This is clearly a nice thing to be able to do because
|
|
it let's you take a pre-write image of the page once during full backup,
|
|
keep it offline, and bring it back in the rare case of a full page write
|
|
failure. I now can see how the MSSQL tearoff-bits would be used, not
|
|
for recovery, but to detect a partial write and force a point-in-time
|
|
recovery from the administrator.
|
|
|
|
|
|
> If we log pages to WAL, they are useless when archived (after a
|
|
> checkpoint). So either we have a separate "log" for them (the ping-pong
|
|
> file), or we should at least remove them when archived, which makes log
|
|
> archiving more complex but is perfectly doable.
|
|
|
|
Yes, that is how we will do point-in-time recovery; remove the
|
|
pre-write page images and archive the rest. It is more complex, but
|
|
having the fsync all in one file is too big a win.
|
|
|
|
> Finally, I would love to hear why we are using the operating system
|
|
> buffer manager at all. The OS is acting as a secondary buffer manager
|
|
> for us. Why is that? What flaw in our I/O system does this reveal? I
|
|
> know that:
|
|
>
|
|
> >We sync only WAL, not the other pages, except for the sync() call we do
|
|
> > during checkpoint when we discard old WAL files.
|
|
>
|
|
> But this is probably not a good thing. We should only be writing blocks
|
|
> when they need to be on disk. We should not be expecting the OS to write
|
|
> them "sometime later" and avoid blocking (as long) for the write. If we
|
|
> need that, then our buffer management is wrong and we need to fix it.
|
|
> The reason we are doing this is because we expect the OS buffer manager
|
|
> to do asynchronous I/O for us, but then we don't control the order. That
|
|
> is the reason why we have to call fdatasync(), to create "sequence
|
|
> points".
|
|
|
|
Yes. I think I understand. It is true we have to fsync WAL because we
|
|
can't control the individual writes by the OS.
|
|
|
|
> The reason we have performance problems with either D_OSYNC or fdatasync
|
|
> on the normal relations is because we have no dbflush process. This
|
|
> causes an unacceptable amount of I/O blocking by other transactions.
|
|
|
|
Uh, that would force writes all over the disk. Why do we really care how
|
|
the OS writes them? If we are going to fsync, let's just do the one
|
|
file and be done with it. What would a separate flusher process really
|
|
buy us if it has to use fsync too. The main backend doesn't have to
|
|
wait for the fsync, but then again, we can't say the transaction is
|
|
committed until it hits the disk, so how does a flusher help?
|
|
|
|
> The ORACLE people were not kidding when they said that they could not
|
|
> certify Linux for production use until it supported O_DSYNC. Can you
|
|
> explain why that was the case?
|
|
|
|
I don't see O_DSYNC as very different from write/fsync(or fdatasync).
|
|
|
|
> Finally, let me apologize if the above comes across as somewhat
|
|
> belligerent. I know very well that I can't compete with you guys for
|
|
> knowledge of the PostgreSQL system. I am still at a loss when I look at
|
|
> the optimizer and executor modules, and it will take some time before I
|
|
> can follow discussion of that area. Even then, I doubt my ability to
|
|
> compare with people like Mr. Lane and Mr. Momjian in experience and
|
|
> general intelligence, or in the field of database programming and
|
|
> software development in particular. However, this discussion and a
|
|
> search of the pgsql-hackers archives reveals this problem to be the KEY
|
|
> area of PostgreSQL's failing, and general misunderstanding, when
|
|
> compared to its commercial competitors.
|
|
|
|
We appreciate your ideas. Few of us are professional db folks so we are
|
|
always looking for good ideas.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 6: Have you searched our list archives?
|
|
|
|
http://archives.postgresql.org
|
|
|
|
|
|
|
|
From cjs@cynic.net Sun Jun 23 23:40:59 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O3evF17903
|
|
for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 23:40:58 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 37F36F820; Mon, 24 Jun 2002 03:40:54 +0000 (UTC)
|
|
Date: Mon, 24 Jun 2002 12:40:51 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
|
|
Message-ID: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On 23 Jun 2002, J. R. Nield wrote:
|
|
|
|
> If is impossible to do what you want. You can not protect against
|
|
> partial writes without writing pages twice and calling fdatasync
|
|
> between them while going through a generic filesystem.
|
|
|
|
I agree with this.
|
|
|
|
> The best disk array will not protect you if the operating system does
|
|
> not align block writes to the structure of the underlying device.
|
|
|
|
This I don't quite understand. Assuming you're using a SCSI drive
|
|
(and this mostly applies to ATAPI/IDE, too), you can do naught but
|
|
align block writes to the structure of the underlying device. When you
|
|
initiate a SCSI WRITE command, you start by telling the device at which
|
|
block to start writing and how many blocks you intend to write. Then you
|
|
start passing the data.
|
|
|
|
(See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
|
|
details for the SCSI WRITE(10) command. You may find the SCSI 2
|
|
specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
|
|
reference here.)
|
|
|
|
> Even with raw devices, you need special support or knowledge of the
|
|
> operating system and/or the disk device to ensure that each write
|
|
> request will be atomic to the underlying hardware.
|
|
|
|
Well, so here I guess you're talking about two things:
|
|
|
|
1. When you request, say, an 8K block write, will the OS really
|
|
write it to disk in a single 8K or multiple of 8K SCSI write
|
|
command?
|
|
|
|
2. Does the SCSI device you're writing to consider these writes to
|
|
be transactional. That is, if the write is interrupted before being
|
|
completed, does the SCSI device guarantee that the partially-sent
|
|
data is not written, and the old data is maintained? And of course,
|
|
does it guarantee that, when it acknowledges a write, that write is
|
|
now in stable storage and will never go away?
|
|
|
|
Both of these are not hard to guarantee, actually. For a BSD-based OS,
|
|
for example, just make sure that your filesystem block size is the
|
|
same as or a multiple of the database block size. BSD will never write
|
|
anything other than a block or a sequence of blocks to a disk in a
|
|
single SCSI transaction (unless you've got a really odd SCSI driver).
|
|
And for your disk, buy a Baydel or Clarion disk array, or something
|
|
similar.
|
|
|
|
Given that it's not hard to set up a system that meets these criteria,
|
|
and this is in fact commonly done for database servers, it would seem a
|
|
good idea for postgres to have the option to take advantage of the time
|
|
and money spent and adjust its performance upward appropriately.
|
|
|
|
> All other systems rely on the fact that you can recover a damaged file
|
|
> using the log archive.
|
|
|
|
Not exactly. For MS SQL Server, at any rate, if it detects a page tear
|
|
you cannot restore based on the log file alone. You need a full or
|
|
partial backup that includes that entire torn block.
|
|
|
|
> This means downtime in the rare case, but no data loss. Until
|
|
> PostgreSQL can do this, then it will not be acceptable for real
|
|
> critical production use.
|
|
|
|
It seems to me that it is doing this right now. In fact, it's more
|
|
reliable than some commerial systems (such as SQL Server) because it can
|
|
recover from a torn block with just the logfile.
|
|
|
|
> But at the end of the day, unless you have complete understanding of
|
|
> the I/O system from write(2) through to the disk system, the only sure
|
|
> ways to protect against partial writes are by "careful writes" (in
|
|
> the WAL log or elsewhere, writing pages twice), or by requiring (and
|
|
> allowing) users to do log-replay recovery when a file is corrupted by
|
|
> a partial write.
|
|
|
|
I don't understand how, without a copy of the old data that was in the
|
|
torn block, you can restore that block from just log file entries. Can
|
|
you explain this to me? Take, as an example, a block with ten tuples,
|
|
only one of which has been changed "recently." (I.e., only that change
|
|
is in the log files.)
|
|
|
|
> If we log pages to WAL, they are useless when archived (after a
|
|
> checkpoint). So either we have a separate "log" for them (the
|
|
> ping-pong file), or we should at least remove them when archived,
|
|
> which makes log archiving more complex but is perfectly doable.
|
|
|
|
Right. That seems to me a better option, since we've now got only one
|
|
write point on the disk rather than two.
|
|
|
|
> Finally, I would love to hear why we are using the operating system
|
|
> buffer manager at all. The OS is acting as a secondary buffer manager
|
|
> for us. Why is that? What flaw in our I/O system does this reveal?
|
|
|
|
It's acting as a "second-level" buffer manager, yes, but to say it's
|
|
"secondary" may be a bit misleading. On most of the systems I've set
|
|
up, the OS buffer cache is doing the vast majority of the work, and the
|
|
postgres buffering is fairly minimal.
|
|
|
|
There are some good (and some perhaps not-so-good) reasons to do it this
|
|
way. I'll list them more or less in the order of best to worst:
|
|
|
|
1. The OS knows where the blocks physically reside on disk, and
|
|
postgres does not. Therefore it's in the interest of postgresql to
|
|
dispatch write responsibility back to the OS as quickly as possible
|
|
so that the OS can prioritize requests appropriately. Most operating
|
|
systems use an "elevator" algorithm to minimize disk head movement;
|
|
but if the OS does not have a block that it could write while the
|
|
head is "on the way" to another request, it can't write it in that
|
|
head pass.
|
|
|
|
2. Postgres does not know about any "bank-switching" tricks for
|
|
mapping more physical memory than it has address space. Thus, on
|
|
32-bit machines, postgres might be limited to mapping 2 or 3 GB of
|
|
memory, even though the machine has, say, 6 GB of physical RAM. The
|
|
OS can use all of the available memory for caching; postgres cannot.
|
|
|
|
3. A lot of work has been put into the seek algorithms, read-ahead
|
|
algorithms, block allocation algorithms, etc. in the OS. Why
|
|
duplicate all that work again in postgres?
|
|
|
|
When you say things like the following:
|
|
|
|
> We should only be writing blocks when they need to be on disk. We
|
|
> should not be expecting the OS to write them "sometime later" and
|
|
> avoid blocking (as long) for the write. If we need that, then our
|
|
> buffer management is wrong and we need to fix it.
|
|
|
|
you appear to be making the arugment that we should take the route of
|
|
other database systems, and use raw devices and our own management of
|
|
disk block allocation. If so, you might want first to look back through
|
|
the archives at the discussion I and several others had about this a
|
|
month or two ago. After looking in detail at what NetBSD, at least, does
|
|
in terms of its disk I/O algorithms and buffering, I've pretty much come
|
|
around, at least for the moment, to the attitude that we should stick
|
|
with using the OS. I wouldn't mind seeing postgres be able to manage all
|
|
of this stuff, but it's a *lot* of work for not all that much benefit
|
|
that I can see.
|
|
|
|
> The ORACLE people were not kidding when they said that they could not
|
|
> certify Linux for production use until it supported O_DSYNC. Can you
|
|
> explain why that was the case?
|
|
|
|
I'm suspecting it's because Linux at the time had no raw devices, so
|
|
O_DSYNC was the only other possible method of making sure that disk
|
|
writes actually got to disk.
|
|
|
|
You certainly don't want to use O_DSYNC if you can use another method,
|
|
because O_DSYNC still goes through the the operating system's buffer
|
|
cache, wasting memory and double-caching things. If you're doing your
|
|
own management, you need either to use a raw device or open files with
|
|
the flag that indicates that the buffer cache should not be used at all
|
|
for reads from and writes to that file.
|
|
|
|
> However, this discussion and a search of the pgsql-hackers archives
|
|
> reveals this problem to be the KEY area of PostgreSQL's failing, and
|
|
> general misunderstanding, when compared to its commercial competitors.
|
|
|
|
No, I think it's just that you're under a few minor misapprehensions
|
|
here about what postgres and the OS are actually doing. As I said, I
|
|
went through this whole exact argument a month or two ago, on this very
|
|
list, and I came around to the idea that what postgres is doing now
|
|
works quite well, at least on NetBSD. (Most other OSes have disk I/O
|
|
algorithms that are pretty much as good or better.) There might be a
|
|
very slight advantage to doing all one's own I/O management, but it's
|
|
a huge amount of work, and I think that much effort could be much more
|
|
usefully applied to other areas.
|
|
|
|
Just as a side note, I've been a NetBSD developer since about '96,
|
|
and have been delving into the details of OS design since well before
|
|
that time, so I'm coming to this with what I hope is reasonably good
|
|
knowledge of how disks work and how operating systems use them. (Not
|
|
that this should stop you from pointing out holes in my arguments. :-))
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24112@postgresql.org Mon Jun 24 18:16:36 2002
|
|
Return-path: <pgsql-hackers-owner+M24112@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OMGaF00910
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 18:16:36 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id A2EF1476475; Mon, 24 Jun 2002 16:43:38 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:43:38 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id BA57D476148; Mon, 24 Jun 2002 14:14:00 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 93D6A477214
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:59:17 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 13:59:17 2002
|
|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
|
|
by postgresql.org (Postfix) with ESMTP id D70AA476401
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 10:06:26 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OE6J117666;
|
|
Mon, 24 Jun 2002 10:06:19 -0400 (EDT)
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, "J. R. Nield" <jrnield@usol.com>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
|
|
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
|
|
message dated "Mon, 24 Jun 2002 09:09:30 +0900"
|
|
Date: Mon, 24 Jun 2002 10:06:19 -0400
|
|
Message-ID: <17663.1024927579@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-5.3 required=5.0
|
|
tests=IN_REP_TO,X_NOT_PRESENT
|
|
version=2.30
|
|
Status: RO
|
|
|
|
> On Sun, 23 Jun 2002, Bruce Momjian wrote:
|
|
>> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
>> when we need to fsync both of them. What I would really like is to
|
|
>> avoid the double I/O of writing to WAL and to the data file; improving
|
|
>> that would be a huge win.
|
|
|
|
I don't believe it's possible to eliminate the double I/O. Keep in mind
|
|
though that in the ideal case (plenty of shared buffers) you are only
|
|
paying two writes per modified block per checkpoint interval --- one to
|
|
the WAL during the first write of the interval, and then a write to the
|
|
real datafile issued by the checkpoint process. Anything that requires
|
|
transaction commits to write data blocks will likely result in more I/O
|
|
not less, at least for blocks that are modified by several successive
|
|
transactions.
|
|
|
|
The only thing I've been able to think of that seems like it might
|
|
improve matters is to make the WAL writing logic aware of the layout
|
|
of buffer pages --- specifically, to know that our pages generally
|
|
contain an uninteresting "hole" in the middle, and not write the hole.
|
|
Optimistically this might reduce the WAL data volume by something
|
|
approaching 50%; though pessimistically (if most pages are near full)
|
|
it wouldn't help much.
|
|
|
|
This was not very feasible when the WAL code was designed because the
|
|
buffer manager needed to cope with both normal pages and pg_log pages,
|
|
but as of 7.2 I think it'd be safe to assume that all pages have the
|
|
standard layout.
|
|
|
|
regards, tom lane
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 2: you can get off all lists at once with the unregister command
|
|
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24116@postgresql.org Mon Jun 24 20:32:07 2002
|
|
Return-path: <pgsql-hackers-owner+M24116@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0W7F10985
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:32:07 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id EBCE547632E; Mon, 24 Jun 2002 18:54:34 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:54:34 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 3EB93476D85; Mon, 24 Jun 2002 17:12:18 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id EBC20476E2E
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:54:40 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 14:54:40 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 1C8874760C2
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 12:40:53 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OGeVY06116;
|
|
Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206241640.g5OGeVY06116@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <17663.1024927579@sss.pgh.pa.us>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Date: Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
|
|
cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Tom Lane wrote:
|
|
> > On Sun, 23 Jun 2002, Bruce Momjian wrote:
|
|
> >> Yes, I don't see writing to two files vs. one to be any win, especially
|
|
> >> when we need to fsync both of them. What I would really like is to
|
|
> >> avoid the double I/O of writing to WAL and to the data file; improving
|
|
> >> that would be a huge win.
|
|
>
|
|
> I don't believe it's possible to eliminate the double I/O. Keep in mind
|
|
> though that in the ideal case (plenty of shared buffers) you are only
|
|
> paying two writes per modified block per checkpoint interval --- one to
|
|
> the WAL during the first write of the interval, and then a write to the
|
|
> real datafile issued by the checkpoint process. Anything that requires
|
|
> transaction commits to write data blocks will likely result in more I/O
|
|
> not less, at least for blocks that are modified by several successive
|
|
> transactions.
|
|
>
|
|
> The only thing I've been able to think of that seems like it might
|
|
> improve matters is to make the WAL writing logic aware of the layout
|
|
> of buffer pages --- specifically, to know that our pages generally
|
|
> contain an uninteresting "hole" in the middle, and not write the hole.
|
|
> Optimistically this might reduce the WAL data volume by something
|
|
> approaching 50%; though pessimistically (if most pages are near full)
|
|
> it wouldn't help much.
|
|
|
|
Good idea. How about putting the page through or TOAST compression
|
|
routine before writing it to WAL? Should be pretty easy and fast and
|
|
doesn't require any knowledge of the page format.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002
|
|
Return-path: <pgsql-hackers-owner+M24114@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:54:35 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:46:24 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 950004770BC
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:59:46 -0400 (EDT)
|
|
Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 14:59:46 2002
|
|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
|
|
by postgresql.org (Postfix) with ESMTP id A0756475BB7
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:11:41 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826;
|
|
Mon, 24 Jun 2002 13:11:02 -0400 (EDT)
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
|
|
Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us>
|
|
References: <200206241640.g5OGeVY06116@candle.pha.pa.us>
|
|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
|
|
message dated "Mon, 24 Jun 2002 12:40:31 -0400"
|
|
Date: Mon, 24 Jun 2002 13:11:01 -0400
|
|
Message-ID: <19823.1024938661@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-5.3 required=5.0
|
|
tests=IN_REP_TO,X_NOT_PRESENT
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
>> The only thing I've been able to think of that seems like it might
|
|
>> improve matters is to make the WAL writing logic aware of the layout
|
|
>> of buffer pages --- specifically, to know that our pages generally
|
|
>> contain an uninteresting "hole" in the middle, and not write the hole.
|
|
>> Optimistically this might reduce the WAL data volume by something
|
|
>> approaching 50%; though pessimistically (if most pages are near full)
|
|
>> it wouldn't help much.
|
|
|
|
> Good idea. How about putting the page through or TOAST compression
|
|
> routine before writing it to WAL? Should be pretty easy and fast and
|
|
> doesn't require any knowledge of the page format.
|
|
|
|
Easy, maybe, but fast definitely NOT. The compressor is not speedy.
|
|
Given that we have to be holding various locks while we build WAL
|
|
records, I do not think it's a good idea to add CPU time there.
|
|
|
|
Also, compressing already-compressed data is not a win ...
|
|
|
|
regards, tom lane
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 3: if posting/reading through Usenet, please send an appropriate
|
|
subscribe-nomail command to majordomo@postgresql.org so that your
|
|
message can get through to the mailing list cleanly
|
|
|
|
|
|
|
|
From jrnield@usol.com Mon Jun 24 16:49:25 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 16:49:24 -0400 (EDT)
|
|
Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100;
|
|
Mon, 24 Jun 2002 16:49:18 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 24 Jun 2002 16:49:42 -0400
|
|
Message-ID: <1024951786.1793.865.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: ROr
|
|
|
|
On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
|
|
> On 23 Jun 2002, J. R. Nield wrote:
|
|
>
|
|
> > If is impossible to do what you want. You can not protect against
|
|
> > partial writes without writing pages twice and calling fdatasync
|
|
> > between them while going through a generic filesystem.
|
|
>
|
|
> I agree with this.
|
|
>
|
|
> > The best disk array will not protect you if the operating system does
|
|
> > not align block writes to the structure of the underlying device.
|
|
>
|
|
> This I don't quite understand. Assuming you're using a SCSI drive
|
|
> (and this mostly applies to ATAPI/IDE, too), you can do naught but
|
|
> align block writes to the structure of the underlying device. When you
|
|
> initiate a SCSI WRITE command, you start by telling the device at which
|
|
> block to start writing and how many blocks you intend to write. Then you
|
|
> start passing the data.
|
|
>
|
|
|
|
All I'm saying is that the entire postgresql block write must be
|
|
converted into exactly one SCSI write command in all cases, and I don't
|
|
know a portable way to ensure this.
|
|
|
|
> > Even with raw devices, you need special support or knowledge of the
|
|
> > operating system and/or the disk device to ensure that each write
|
|
> > request will be atomic to the underlying hardware.
|
|
>
|
|
> Well, so here I guess you're talking about two things:
|
|
>
|
|
> 1. When you request, say, an 8K block write, will the OS really
|
|
> write it to disk in a single 8K or multiple of 8K SCSI write
|
|
> command?
|
|
>
|
|
> 2. Does the SCSI device you're writing to consider these writes to
|
|
> be transactional. That is, if the write is interrupted before being
|
|
> completed, does the SCSI device guarantee that the partially-sent
|
|
> data is not written, and the old data is maintained? And of course,
|
|
> does it guarantee that, when it acknowledges a write, that write is
|
|
> now in stable storage and will never go away?
|
|
>
|
|
> Both of these are not hard to guarantee, actually. For a BSD-based OS,
|
|
> for example, just make sure that your filesystem block size is the
|
|
> same as or a multiple of the database block size. BSD will never write
|
|
> anything other than a block or a sequence of blocks to a disk in a
|
|
> single SCSI transaction (unless you've got a really odd SCSI driver).
|
|
> And for your disk, buy a Baydel or Clarion disk array, or something
|
|
> similar.
|
|
>
|
|
> Given that it's not hard to set up a system that meets these criteria,
|
|
> and this is in fact commonly done for database servers, it would seem a
|
|
> good idea for postgres to have the option to take advantage of the time
|
|
> and money spent and adjust its performance upward appropriately.
|
|
|
|
I agree with this. My point was only that you need to know what
|
|
guarantees your operating system/hardware combination provides on a
|
|
case-by-case basis, and there is no standard way for a program to
|
|
discover this. Most system administrators are not going to know this
|
|
either, unless databases are their main responsibility.
|
|
|
|
>
|
|
> > All other systems rely on the fact that you can recover a damaged file
|
|
> > using the log archive.
|
|
>
|
|
> Not exactly. For MS SQL Server, at any rate, if it detects a page tear
|
|
> you cannot restore based on the log file alone. You need a full or
|
|
> partial backup that includes that entire torn block.
|
|
>
|
|
|
|
I should have been more specific: you need a backup of the file from
|
|
some time ago, plus all the archived logs from then until the current
|
|
log sequence number.
|
|
|
|
> > This means downtime in the rare case, but no data loss. Until
|
|
> > PostgreSQL can do this, then it will not be acceptable for real
|
|
> > critical production use.
|
|
>
|
|
> It seems to me that it is doing this right now. In fact, it's more
|
|
> reliable than some commerial systems (such as SQL Server) because it can
|
|
> recover from a torn block with just the logfile.
|
|
|
|
Again, what I meant to say is that the commercial systems can recover
|
|
with an old file backup + logs. How old the backup can be depends only
|
|
on how much time you are willing to spend playing the logs forward. So
|
|
if you do a full backup once a week, and multiplex and backup the logs,
|
|
then even if a backup tape gets destroyed you can still survive. It just
|
|
takes longer.
|
|
|
|
Also, postgreSQL can't recover from any other type of block corruption,
|
|
while the commercial systems can. That's what I meant by the "critical
|
|
production use" comment, which was sort-of unfair.
|
|
|
|
So I would say they are equally reliable for torn pages (but not bad
|
|
blocks), and the commercial systems let you trade potential recovery
|
|
time for not having to write the blocks twice. You do need to back-up
|
|
the log archives though.
|
|
|
|
>
|
|
> > But at the end of the day, unless you have complete understanding of
|
|
> > the I/O system from write(2) through to the disk system, the only sure
|
|
> > ways to protect against partial writes are by "careful writes" (in
|
|
> > the WAL log or elsewhere, writing pages twice), or by requiring (and
|
|
> > allowing) users to do log-replay recovery when a file is corrupted by
|
|
> > a partial write.
|
|
>
|
|
> I don't understand how, without a copy of the old data that was in the
|
|
> torn block, you can restore that block from just log file entries. Can
|
|
> you explain this to me? Take, as an example, a block with ten tuples,
|
|
> only one of which has been changed "recently." (I.e., only that change
|
|
> is in the log files.)
|
|
>
|
|
>
|
|
> > If we log pages to WAL, they are useless when archived (after a
|
|
> > checkpoint). So either we have a separate "log" for them (the
|
|
> > ping-pong file), or we should at least remove them when archived,
|
|
> > which makes log archiving more complex but is perfectly doable.
|
|
>
|
|
> Right. That seems to me a better option, since we've now got only one
|
|
> write point on the disk rather than two.
|
|
|
|
OK. I agree with this now.
|
|
|
|
>
|
|
> > Finally, I would love to hear why we are using the operating system
|
|
> > buffer manager at all. The OS is acting as a secondary buffer manager
|
|
> > for us. Why is that? What flaw in our I/O system does this reveal?
|
|
>
|
|
> It's acting as a "second-level" buffer manager, yes, but to say it's
|
|
> "secondary" may be a bit misleading. On most of the systems I've set
|
|
> up, the OS buffer cache is doing the vast majority of the work, and the
|
|
> postgres buffering is fairly minimal.
|
|
>
|
|
> There are some good (and some perhaps not-so-good) reasons to do it this
|
|
> way. I'll list them more or less in the order of best to worst:
|
|
>
|
|
> 1. The OS knows where the blocks physically reside on disk, and
|
|
> postgres does not. Therefore it's in the interest of postgresql to
|
|
> dispatch write responsibility back to the OS as quickly as possible
|
|
> so that the OS can prioritize requests appropriately. Most operating
|
|
> systems use an "elevator" algorithm to minimize disk head movement;
|
|
> but if the OS does not have a block that it could write while the
|
|
> head is "on the way" to another request, it can't write it in that
|
|
> head pass.
|
|
>
|
|
> 2. Postgres does not know about any "bank-switching" tricks for
|
|
> mapping more physical memory than it has address space. Thus, on
|
|
> 32-bit machines, postgres might be limited to mapping 2 or 3 GB of
|
|
> memory, even though the machine has, say, 6 GB of physical RAM. The
|
|
> OS can use all of the available memory for caching; postgres cannot.
|
|
>
|
|
> 3. A lot of work has been put into the seek algorithms, read-ahead
|
|
> algorithms, block allocation algorithms, etc. in the OS. Why
|
|
> duplicate all that work again in postgres?
|
|
>
|
|
> When you say things like the following:
|
|
>
|
|
> > We should only be writing blocks when they need to be on disk. We
|
|
> > should not be expecting the OS to write them "sometime later" and
|
|
> > avoid blocking (as long) for the write. If we need that, then our
|
|
> > buffer management is wrong and we need to fix it.
|
|
>
|
|
> you appear to be making the arugment that we should take the route of
|
|
> other database systems, and use raw devices and our own management of
|
|
> disk block allocation. If so, you might want first to look back through
|
|
> the archives at the discussion I and several others had about this a
|
|
> month or two ago. After looking in detail at what NetBSD, at least, does
|
|
> in terms of its disk I/O algorithms and buffering, I've pretty much come
|
|
> around, at least for the moment, to the attitude that we should stick
|
|
> with using the OS. I wouldn't mind seeing postgres be able to manage all
|
|
> of this stuff, but it's a *lot* of work for not all that much benefit
|
|
> that I can see.
|
|
|
|
I'll back off on that. I don't know if we want to use the OS buffer
|
|
manager, but shouldn't we try to have our buffer manager group writes
|
|
together by files, and pro-actively get them out to disk? Right now, it
|
|
looks like all our write requests are delayed as long as possible and
|
|
the order in which they are written is pretty-much random, as is the
|
|
backend that writes the block, so there is no locality of reference even
|
|
when the blocks are adjacent on disk, and the write calls are spread-out
|
|
over all the backends.
|
|
|
|
Would it not be the case that things like read-ahead, grouping writes,
|
|
and caching written data are probably best done by PostgreSQL, because
|
|
only our buffer manager can understand when they will be useful or when
|
|
they will thrash the cache?
|
|
|
|
I may likely be wrong on this, and I haven't done any performance
|
|
testing. I shouldn't have brought this up alongside the logging issues,
|
|
but there seemed to be some question about whether the OS was actually
|
|
doing all these things behind the scene.
|
|
|
|
|
|
>
|
|
> > The ORACLE people were not kidding when they said that they could not
|
|
> > certify Linux for production use until it supported O_DSYNC. Can you
|
|
> > explain why that was the case?
|
|
>
|
|
> I'm suspecting it's because Linux at the time had no raw devices, so
|
|
> O_DSYNC was the only other possible method of making sure that disk
|
|
> writes actually got to disk.
|
|
>
|
|
> You certainly don't want to use O_DSYNC if you can use another method,
|
|
> because O_DSYNC still goes through the the operating system's buffer
|
|
> cache, wasting memory and double-caching things. If you're doing your
|
|
> own management, you need either to use a raw device or open files with
|
|
> the flag that indicates that the buffer cache should not be used at all
|
|
> for reads from and writes to that file.
|
|
|
|
Would O_DSYNC|O_RSYNC turn off the cache?
|
|
|
|
>
|
|
> > However, this discussion and a search of the pgsql-hackers archives
|
|
> > reveals this problem to be the KEY area of PostgreSQL's failing, and
|
|
> > general misunderstanding, when compared to its commercial competitors.
|
|
>
|
|
> No, I think it's just that you're under a few minor misapprehensions
|
|
> here about what postgres and the OS are actually doing. As I said, I
|
|
> went through this whole exact argument a month or two ago, on this very
|
|
> list, and I came around to the idea that what postgres is doing now
|
|
> works quite well, at least on NetBSD. (Most other OSes have disk I/O
|
|
> algorithms that are pretty much as good or better.) There might be a
|
|
> very slight advantage to doing all one's own I/O management, but it's
|
|
> a huge amount of work, and I think that much effort could be much more
|
|
> usefully applied to other areas.
|
|
|
|
I will look for that discussion in the archives.
|
|
|
|
The logging issue is a key one I think. At least I would be very nervous
|
|
as a DBA if I were running a system where any damaged file would cause
|
|
data loss.
|
|
|
|
Does anyone know what the major barriers to infinite log replay are in
|
|
PostgreSQL? I'm trying to look for everything that might need to be
|
|
changed outside xlog.c, but surely this has come up before. Searching
|
|
the archives hasn't revealed much.
|
|
|
|
|
|
|
|
As to the I/O issue:
|
|
|
|
Since you know a lot about NetBSD internals, I'd be interested in
|
|
hearing about what postgresql looks like to the NetBSD buffer manager.
|
|
Am I right that strings of successive writes get randomized? What do our
|
|
cache-hit percentages look like? I'm going to do some experimenting with
|
|
this.
|
|
|
|
>
|
|
> Just as a side note, I've been a NetBSD developer since about '96,
|
|
> and have been delving into the details of OS design since well before
|
|
> that time, so I'm coming to this with what I hope is reasonably good
|
|
> knowledge of how disks work and how operating systems use them. (Not
|
|
> that this should stop you from pointing out holes in my arguments. :-))
|
|
>
|
|
|
|
This stuff is very difficult to get right. Glad to know you follow this
|
|
list.
|
|
|
|
|
|
> cjs
|
|
> --
|
|
> Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
> Don't you know, in this new Dark Age, we're all light. --XTC
|
|
>
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002
|
|
Return-path: <tgl@sss.pgh.pa.us>
|
|
Received: from sss.pgh.pa.us (root@[192.204.191.242])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:16:05 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379;
|
|
Mon, 24 Jun 2002 17:16:02 -0400 (EDT)
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
|
|
References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net> <1024951786.1793.865.camel@localhost.localdomain>
|
|
Comments: In-reply-to "J. R. Nield" <jrnield@usol.com>
|
|
message dated "24 Jun 2002 16:49:42 -0400"
|
|
Date: Mon, 24 Jun 2002 17:16:01 -0400
|
|
Message-ID: <21376.1024953361@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Status: RO
|
|
|
|
"J. R. Nield" <jrnield@usol.com> writes:
|
|
> Also, postgreSQL can't recover from any other type of block corruption,
|
|
> while the commercial systems can.
|
|
|
|
Say again?
|
|
|
|
> Would it not be the case that things like read-ahead, grouping writes,
|
|
> and caching written data are probably best done by PostgreSQL, because
|
|
> only our buffer manager can understand when they will be useful or when
|
|
> they will thrash the cache?
|
|
|
|
I think you have been missing the point. No one denies that there will
|
|
be some incremental gain if we do all that. However, the conclusion of
|
|
everyone who has thought much about it (and I see Curt has joined that
|
|
group) is that the effort would be far out of proportion to the probable
|
|
gain. There are a lot of other things we desperately need to spend time
|
|
on that would not amount to re-engineering large quantities of OS-level
|
|
code. Given that most Unixen have perfectly respectable disk management
|
|
subsystems, we prefer to tune our code to make use of that stuff, rather
|
|
than follow the "conventional wisdom" that databases need to bypass it.
|
|
|
|
Oracle can afford to do that sort of thing because they have umpteen
|
|
thousand developers available. Postgres does not.
|
|
|
|
regards, tom lane
|
|
|
|
From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002
|
|
Return-path: <pgsql-hackers-owner+M24128@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P21vF19918
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:01:57 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 540B8475B33; Mon, 24 Jun 2002 21:34:40 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 21:34:40 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 0A13F476965; Mon, 24 Jun 2002 19:30:14 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id B4F62476E4A
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:53:59 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 18:53:59 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 36043475BF6
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:25:28 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OLPFG26140;
|
|
Mon, 24 Jun 2002 17:25:15 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
Date: Mon, 24 Jun 2002 17:25:14 -0400 (EDT)
|
|
cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
J. R. Nield wrote:
|
|
> > This I don't quite understand. Assuming you're using a SCSI drive
|
|
> > (and this mostly applies to ATAPI/IDE, too), you can do naught but
|
|
> > align block writes to the structure of the underlying device. When you
|
|
> > initiate a SCSI WRITE command, you start by telling the device at which
|
|
> > block to start writing and how many blocks you intend to write. Then you
|
|
> > start passing the data.
|
|
> >
|
|
>
|
|
> All I'm saying is that the entire postgresql block write must be
|
|
> converted into exactly one SCSI write command in all cases, and I don't
|
|
> know a portable way to ensure this.
|
|
|
|
...
|
|
|
|
> I agree with this. My point was only that you need to know what
|
|
> guarantees your operating system/hardware combination provides on a
|
|
> case-by-case basis, and there is no standard way for a program to
|
|
> discover this. Most system administrators are not going to know this
|
|
> either, unless databases are their main responsibility.
|
|
|
|
Yes, agreed. >1% are going to know the answer to this question so we
|
|
have to assume worst case.
|
|
|
|
> > It seems to me that it is doing this right now. In fact, it's more
|
|
> > reliable than some commerial systems (such as SQL Server) because it can
|
|
> > recover from a torn block with just the logfile.
|
|
>
|
|
> Again, what I meant to say is that the commercial systems can recover
|
|
> with an old file backup + logs. How old the backup can be depends only
|
|
> on how much time you are willing to spend playing the logs forward. So
|
|
> if you do a full backup once a week, and multiplex and backup the logs,
|
|
> then even if a backup tape gets destroyed you can still survive. It just
|
|
> takes longer.
|
|
>
|
|
> Also, postgreSQL can't recover from any other type of block corruption,
|
|
> while the commercial systems can. That's what I meant by the "critical
|
|
> production use" comment, which was sort-of unfair.
|
|
>
|
|
> So I would say they are equally reliable for torn pages (but not bad
|
|
> blocks), and the commercial systems let you trade potential recovery
|
|
> time for not having to write the blocks twice. You do need to back-up
|
|
> the log archives though.
|
|
|
|
Yes, good tradeoff analysis. We recover from partial writes quicker,
|
|
and don't require saving of log files, _but_ we don't recover from bad
|
|
disk blocks. Good summary.
|
|
|
|
> I'll back off on that. I don't know if we want to use the OS buffer
|
|
> manager, but shouldn't we try to have our buffer manager group writes
|
|
> together by files, and pro-actively get them out to disk? Right now, it
|
|
> looks like all our write requests are delayed as long as possible and
|
|
> the order in which they are written is pretty-much random, as is the
|
|
> backend that writes the block, so there is no locality of reference even
|
|
> when the blocks are adjacent on disk, and the write calls are spread-out
|
|
> over all the backends.
|
|
>
|
|
> Would it not be the case that things like read-ahead, grouping writes,
|
|
> and caching written data are probably best done by PostgreSQL, because
|
|
> only our buffer manager can understand when they will be useful or when
|
|
> they will thrash the cache?
|
|
|
|
The OS should handle all of this. We are doing main table writes but no
|
|
sync until checkpoint, so the OS can keep those blocks around and write
|
|
them at its convenience. It knows the size of the buffer cache and when
|
|
stuff is forced to disk. We can't second-guess that.
|
|
|
|
> I may likely be wrong on this, and I haven't done any performance
|
|
> testing. I shouldn't have brought this up alongside the logging issues,
|
|
> but there seemed to be some question about whether the OS was actually
|
|
> doing all these things behind the scene.
|
|
|
|
It had better. Looking at the kernel source is the way to know.
|
|
|
|
> Does anyone know what the major barriers to infinite log replay are in
|
|
> PostgreSQL? I'm trying to look for everything that might need to be
|
|
> changed outside xlog.c, but surely this has come up before. Searching
|
|
> the archives hasn't revealed much.
|
|
|
|
This has been brought up. Could we just save WAL files and get replay?
|
|
I believe some things have to be added to WAL to allow this, but it
|
|
seems possible. However, the pg_dump is just a data dump and does not
|
|
have the file offsets and things. Somehow you would need a tar-type
|
|
backup of the database, and with a running db, it is hard to get a valid
|
|
snapshot of that.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 3: if posting/reading through Usenet, please send an appropriate
|
|
subscribe-nomail command to majordomo@postgresql.org so that your
|
|
message can get through to the mailing list cleanly
|
|
|
|
|
|
|
|
From tgl@sss.pgh.pa.us Mon Jun 24 17:31:57 2002
|
|
Return-path: <tgl@sss.pgh.pa.us>
|
|
Received: from sss.pgh.pa.us (root@[192.204.191.242])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLVuF26684
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
|
|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
|
|
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
|
|
Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
|
|
To: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
|
|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
|
|
message dated "Mon, 24 Jun 2002 17:25:14 -0400"
|
|
Date: Mon, 24 Jun 2002 17:31:56 -0400
|
|
Message-ID: <21482.1024954316@sss.pgh.pa.us>
|
|
From: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Status: ROr
|
|
|
|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
>> Does anyone know what the major barriers to infinite log replay are in
|
|
>> PostgreSQL? I'm trying to look for everything that might need to be
|
|
>> changed outside xlog.c, but surely this has come up before. Searching
|
|
>> the archives hasn't revealed much.
|
|
|
|
> This has been brought up. Could we just save WAL files and get replay?
|
|
> I believe some things have to be added to WAL to allow this, but it
|
|
> seems possible.
|
|
|
|
The Red Hat group has been looking at this somewhat; so far there seem
|
|
to be some minor tweaks that would be needed, but no showstoppers.
|
|
|
|
> Somehow you would need a tar-type
|
|
> backup of the database, and with a running db, it is hard to get a valid
|
|
> snapshot of that.
|
|
|
|
But you don't *need* a "valid snapshot", only a correct copy of
|
|
every block older than the first checkpoint in your WAL log series.
|
|
Any inconsistencies in your tar dump will look like repairable damage;
|
|
replaying the WAL log will fix 'em.
|
|
|
|
regards, tom lane
|
|
|
|
From pgsql-hackers-owner+M24133@postgresql.org Mon Jun 24 22:19:55 2002
|
|
Return-path: <pgsql-hackers-owner+M24133@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P2JsF21543
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:19:54 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id 42391476E53; Mon, 24 Jun 2002 22:09:49 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 22:09:49 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 191654774EB; Mon, 24 Jun 2002 20:26:08 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 8EB90476101
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 19:43:19 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 19:43:19 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 08018476931
|
|
for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:33:53 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5OLXhl26908;
|
|
Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206242133.g5OLXhl26908@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <21482.1024954316@sss.pgh.pa.us>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
Date: Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Tom Lane wrote:
|
|
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
|
|
> >> Does anyone know what the major barriers to infinite log replay are in
|
|
> >> PostgreSQL? I'm trying to look for everything that might need to be
|
|
> >> changed outside xlog.c, but surely this has come up before. Searching
|
|
> >> the archives hasn't revealed much.
|
|
>
|
|
> > This has been brought up. Could we just save WAL files and get replay?
|
|
> > I believe some things have to be added to WAL to allow this, but it
|
|
> > seems possible.
|
|
>
|
|
> The Red Hat group has been looking at this somewhat; so far there seem
|
|
> to be some minor tweaks that would be needed, but no showstoppers.
|
|
|
|
|
|
Good.
|
|
|
|
> > Somehow you would need a tar-type
|
|
> > backup of the database, and with a running db, it is hard to get a valid
|
|
> > snapshot of that.
|
|
>
|
|
> But you don't *need* a "valid snapshot", only a correct copy of
|
|
> every block older than the first checkpoint in your WAL log series.
|
|
> Any inconsistencies in your tar dump will look like repairable damage;
|
|
> replaying the WAL log will fix 'em.
|
|
|
|
Yes, my point was that you need physical file backups, not pg_dump, and
|
|
you have to be tricky about the files changing during the backup. You
|
|
_can_ work around changes to the files during backup.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
/usr/local/bin/mime: cannot create /dev/ttyp3: permission denied
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
|
|
From jrnield@usol.com Mon Jun 24 20:27:45 2002
|
|
Return-path: <jrnield@usol.com>
|
|
Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0RhF10711
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:27:44 -0400 (EDT)
|
|
Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
|
|
by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
|
|
Mon, 24 Jun 2002 20:27:37 -0400
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
From: "J. R. Nield" <jrnield@usol.com>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
|
|
References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
|
|
<1024951786.1793.865.camel@localhost.localdomain>
|
|
<21376.1024953361@sss.pgh.pa.us>
|
|
Content-Type: text/plain
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6)
|
|
Date: 24 Jun 2002 20:28:00 -0400
|
|
Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
|
|
MIME-Version: 1.0
|
|
Status: RO
|
|
|
|
On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
|
|
|
|
> I think you have been missing the point...
|
|
Yes, this appears to be the case. Thanks especially to Curt for clearing
|
|
things up for me.
|
|
|
|
--
|
|
J. R. Nield
|
|
jrnield@usol.com
|
|
|
|
|
|
|
|
|
|
From cjs@cynic.net Mon Jun 24 23:32:23 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net ([63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287
|
|
for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 23:32:23 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC)
|
|
Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: "J. R. Nield" <jrnield@usol.com>
|
|
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
|
|
Message-ID: <Pine.NEB.4.43.0206251229010.17448-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: RO
|
|
|
|
On 24 Jun 2002, J. R. Nield wrote:
|
|
|
|
> All I'm saying is that the entire postgresql block write must be
|
|
> converted into exactly one SCSI write command in all cases, and I don't
|
|
> know a portable way to ensure this.
|
|
|
|
No, there's no portable way. All you can do is give the admin who
|
|
is able to set things up safely the ability to turn of the now-unneeded
|
|
(and expensive) safety-related stuff that postgres does.
|
|
|
|
> I agree with this. My point was only that you need to know what
|
|
> guarantees your operating system/hardware combination provides on a
|
|
> case-by-case basis, and there is no standard way for a program to
|
|
> discover this. Most system administrators are not going to know this
|
|
> either, unless databases are their main responsibility.
|
|
|
|
Certainly this is true of pretty much every database system out there.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From cjs@cynic.net Tue Jun 25 01:09:02 2002
|
|
Return-path: <cjs@cynic.net>
|
|
Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P591F07292
|
|
for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 01:09:01 -0400 (EDT)
|
|
Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
|
|
by academic.cynic.net (Postfix) with ESMTP
|
|
id 517BEF820; Tue, 25 Jun 2002 05:09:02 +0000 (UTC)
|
|
Date: Tue, 25 Jun 2002 14:08:59 +0900 (JST)
|
|
From: Curt Sampson <cjs@cynic.net>
|
|
To: Tom Lane <tgl@sss.pgh.pa.us>
|
|
cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
|
|
Message-ID: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
|
|
MIME-Version: 1.0
|
|
Content-Type: TEXT/PLAIN; charset=US-ASCII
|
|
Status: ROr
|
|
|
|
On Mon, 24 Jun 2002, Tom Lane wrote:
|
|
|
|
> There are a lot of other things we desperately need to spend time
|
|
> on that would not amount to re-engineering large quantities of OS-level
|
|
> code. Given that most Unixen have perfectly respectable disk management
|
|
> subsystems, we prefer to tune our code to make use of that stuff, rather
|
|
> than follow the "conventional wisdom" that databases need to bypass it.
|
|
> ...
|
|
> Oracle can afford to do that sort of thing because they have umpteen
|
|
> thousand developers available. Postgres does not.
|
|
|
|
Well, Oracle also started out, a long long time ago, on systems without
|
|
unified buffer cache and so on, and so they *had* to write this stuff
|
|
because otherwise data would not be cached. So Oracle can also afford to
|
|
maintain it now because the code already exists.
|
|
|
|
cjs
|
|
--
|
|
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
|
|
Don't you know, in this new Dark Age, we're all light. --XTC
|
|
|
|
|
|
From pgsql-hackers-owner+M24154@postgresql.org Tue Jun 25 09:22:38 2002
|
|
Return-path: <pgsql-hackers-owner+M24154@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDMbF03932
|
|
for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 09:22:37 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP
|
|
id C12C3475E4A; Tue, 25 Jun 2002 09:22:32 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:32 2002
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 65471475C7A; Tue, 25 Jun 2002 09:22:23 -0400 (EDT)
|
|
Received: from localhost.localdomain (postgresql.org [64.49.215.8])
|
|
by localhost (Postfix) with ESMTP id 97C8C475A7C
|
|
for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:20 -0400 (EDT)
|
|
Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 09:22:20 2002
|
|
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
|
|
by postgresql.org (Postfix) with ESMTP id 42C0B475A64
|
|
for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:19 -0400 (EDT)
|
|
Received: (from pgman@localhost)
|
|
by candle.pha.pa.us (8.11.6/8.10.1) id g5PDM5B03772;
|
|
Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
|
|
From: Bruce Momjian <pgman@candle.pha.pa.us>
|
|
Message-ID: <200206251322.g5PDM5B03772@candle.pha.pa.us>
|
|
Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
|
|
In-Reply-To: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
|
|
To: Curt Sampson <cjs@cynic.net>
|
|
Date: Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>, "J. R. Nield" <jrnield@usol.com>,
|
|
PostgreSQL Hacker <pgsql-hackers@postgresql.org>
|
|
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
|
|
MIME-Version: 1.0
|
|
Content-Transfer-Encoding: 7bit
|
|
Content-Type: text/plain; charset=US-ASCII
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Spam-Status: No, hits=-3.4 required=5.0
|
|
tests=IN_REP_TO
|
|
version=2.30
|
|
Status: RO
|
|
|
|
Curt Sampson wrote:
|
|
> On Mon, 24 Jun 2002, Tom Lane wrote:
|
|
>
|
|
> > There are a lot of other things we desperately need to spend time
|
|
> > on that would not amount to re-engineering large quantities of OS-level
|
|
> > code. Given that most Unixen have perfectly respectable disk management
|
|
> > subsystems, we prefer to tune our code to make use of that stuff, rather
|
|
> > than follow the "conventional wisdom" that databases need to bypass it.
|
|
> > ...
|
|
> > Oracle can afford to do that sort of thing because they have umpteen
|
|
> > thousand developers available. Postgres does not.
|
|
>
|
|
> Well, Oracle also started out, a long long time ago, on systems without
|
|
> unified buffer cache and so on, and so they *had* to write this stuff
|
|
> because otherwise data would not be cached. So Oracle can also afford to
|
|
> maintain it now because the code already exists.
|
|
|
|
Well, actually, it isn't unified buffer cache that is the issue, but
|
|
rather the older SysV file system had pretty poor performance so
|
|
bypassing it was a bigger win that it is today.
|
|
|
|
--
|
|
Bruce Momjian | http://candle.pha.pa.us
|
|
pgman@candle.pha.pa.us | (610) 853-3000
|
|
+ If your life is a hard drive, | 830 Blythe Avenue
|
|
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 4: Don't 'kill -9' the postmaster
|
|
|
|
|
|
|
|
From pgsql-hackers-owner+M31893@postgresql.org Fri Nov 15 11:25:58 2002
|
|
Return-path: <pgsql-hackers-owner+M31893@postgresql.org>
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id gAFHPvR10276
|
|
for <pgman@candle.pha.pa.us>; Fri, 15 Nov 2002 12:25:57 -0500 (EST)
|
|
Received: from localhost (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with ESMTP
|
|
id A2D5A4774A1; Fri, 15 Nov 2002 11:34:54 -0500 (EST)
|
|
Received: from postgresql.org (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with SMTP
|
|
id 5E898477132; Fri, 15 Nov 2002 11:15:45 -0500 (EST)
|
|
Received: from localhost (postgresql.org [64.49.215.8])
|
|
by postgresql.org (Postfix) with ESMTP id 90CF1475B85
|
|
for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:47 -0500 (EST)
|
|
Received: from Curtis-Vaio (unknown [63.164.0.45])
|
|
by postgresql.org (Postfix) with SMTP id C6CB1475A3F
|
|
for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:46 -0500 (EST)
|
|
Received: from [127.0.0.1] by Curtis-Vaio
|
|
(ArGoSoft Mail Server Freeware, Version 1.8 (1.8.1.7)); Mon, 11 Nov 2002 16:33:42 -0400
|
|
From: "Curtis Faith" <curtis@galtcapital.com>
|
|
To: <pgsql-hackers@postgresql.org>
|
|
Subject: [HACKERS] 500 tpsQL + WAL log implementation
|
|
Date: Mon, 11 Nov 2002 16:33:41 -0400
|
|
Message-ID: <DMEEJMCDOJAKPPFACMPMCEBMCFAA.curtis@galtcapital.com>
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain;
|
|
charset="iso-8859-1"
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Priority: 3 (Normal)
|
|
X-MSMail-Priority: Normal
|
|
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
|
|
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
|
|
Importance: Normal
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by AMaViS new-20020517
|
|
Status: ROr
|
|
|
|
I have been experimenting with empirical tests of file system and device
|
|
level writes to determine the actual constraints in order to speed up the WAL
|
|
logging code.
|
|
|
|
Using a raw file partition and a time-based technique for determining the
|
|
optimal write position, I am able to get 8K writes physically written to disk
|
|
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
|
|
device partitions on IDE disks (with write cache disabled). I will be
|
|
testing it soon under linux with 10,00RPM SCSI which should be even better.
|
|
It is my belief that the mechanism used to achieve these speeds could be
|
|
incorporated into the existing WAL logging code as an abstraction that looks
|
|
to the WAL code just like the file level access currently used. The current
|
|
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
|
|
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second
|
|
|
|
The mechanism works by adjusting the seek offset of the write by using
|
|
gettimeofday to determine approximately where the disk head is in its
|
|
rotation. The mechanism does not use any AIO calls.
|
|
|
|
Assuming the following:
|
|
|
|
1) Disk rotation time is 8.333ms or 8333us (7200 RPM).
|
|
|
|
2) A write at offset 1,500K completes at system time 103s 000ms 000us
|
|
|
|
3) A new write is requested at system time 103s 004ms 166us
|
|
|
|
4) A 390K per rotation alignment of the data on the disk.
|
|
|
|
5) A write must be sent at least 20K ahead of the current head position to
|
|
ensure that it is written in less than one rotation.
|
|
|
|
It can be determined from the above that a write for an offset of something
|
|
slightly more than 195K past the last write, or offset 1,695K will be ahead
|
|
of the current location of the head and will therefore complete in less than
|
|
a single rotation's time.
|
|
|
|
The disk specific metrics (rotation speed, bytes per rotation, base write
|
|
time, etc.) can be derived empirically through a tester program that would
|
|
take a few minutes to run and which could be run at log setup time.
|
|
|
|
The obvious problem with the above mechanism is that the WAL log needs to be
|
|
able to read from the log file in transaction order during recovery. This
|
|
could be provided for using an abstraction that prepends the logical order
|
|
for each block written to the disk and makes sure that the log blocks contain
|
|
either a valid logical order number or some other marker indicating that the
|
|
block is not being used.
|
|
|
|
A bitmap of blocks that have already been used would be kept in memory for
|
|
quickly determining the next set of possible unused blocks but this bitmap
|
|
would not need to be written to disk except during normal shutdown since in
|
|
the even of a failure the bitmaps would be reconstructed by reading all the
|
|
blocks from the disk.
|
|
|
|
Checkpointing and something akin to log rotation could be handled using this
|
|
mechanism as well.
|
|
|
|
So, MY REAL QUESTION is whether or not this is the sort of speed improvement
|
|
that warrants the work of writing the required abstraction layer and making
|
|
this very robust. The WAL code should remain essentially unchanged, with
|
|
perhaps new calls for the five or six routines used to access the log files,
|
|
and handle the equivalent of log rotation for raw device access. These new
|
|
calls would either use the current file based implementation or the new
|
|
logging mechanism depending on the configuration.
|
|
|
|
I anticipate that the extra work required for a PostgreSQL administrator to
|
|
use the proposed logging mechanism would be to:
|
|
|
|
1) Create a raw device partition of the appropriate size
|
|
2) Run the metrics tester for that device partition
|
|
3) Set the appropriate configuration parameters to indicate raw WAL logging
|
|
|
|
I anticipate that the additional space requirements for this system would be
|
|
on the order of 10% to 15% beyond the current file-based implementation's
|
|
requirements.
|
|
|
|
So, is this worth doing? Would a robust implementation likely be accepted for
|
|
7.4 assuming it can demonstrate speed improvements in the range of 500tps?
|
|
|
|
- Curtis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
On Sun, 2005-04-10 at 21:12 -0400, Bruce Momjian wrote:
|
|
> Jim C. Nasby wrote:
|
|
> > Maybe better for -hackers, but here it goes anyway...
|
|
> >
|
|
> > Has anyone looked at compressing WAL's before writing to disk? On a
|
|
> > system generating a lot of WAL it seems there might be some gains to be
|
|
> > had WAL data could be compressed before going to disk, since today's
|
|
> > machines are generally more I/O bound than CPU bound. And unlike the
|
|
> > base tables, you generally don't need to read the WAL, so you don't
|
|
> > really need to worry about not being able to quickly scan through the
|
|
> > data without decompressing it.
|
|
>
|
|
> I have never heard anyone talk about it, but it seems useful. I think
|
|
> compressing the page images written on first page modification since
|
|
> checkpoint would be a big win.
|
|
|
|
Well it was discussed 2-3 years ago as part of the PITR preamble. You
|
|
may be surprised to read that over...
|
|
|
|
A summary of thoughts to date on this are:
|
|
|
|
xlog.c XLogInsert places backup blocks into the wal buffers before
|
|
insertion, so is the right place to do this. It would be possible to do
|
|
this before any LWlocks are taken, so would not not necessarily impair
|
|
scalability.
|
|
|
|
Currently XLogInsert is a severe CPU bottleneck around the CRC
|
|
calculation, as identified recently by Tom. Digging further, the code
|
|
used seems to cause processor stalls on Intel CPUs, possibly responsible
|
|
for much of the CPU time. Discussions to move to a 32-bit CRC would also
|
|
be effected by this because of the byte-by-byte nature of the algorithm,
|
|
whatever the length of the generating polynomial. PostgreSQL's CRC
|
|
algorithm is the fastest BSD code available. Until improvement is made
|
|
there, I would not investigate compression further. Some input from
|
|
hardware tuning specialists is required...
|
|
|
|
The current LZW compression code uses a 4096 byte lookback size, so that
|
|
would need to be modified to extend across a whole block. An
|
|
alternative, suggested originally by Tom and rediscovered by me because
|
|
I just don't read everybody's fine words in history, is to simply take
|
|
out the freespace in the middle of every heap block that consists of
|
|
zeros.
|
|
|
|
Any solution in this area must take into account the variability of the
|
|
size of freespace in database blocks. Some databases have mostly full
|
|
blocks, others vary. There would also be considerable variation in
|
|
compressability of blocks, especially since some blocks (e.g. TOAST) are
|
|
likely to already be compressed. There'd need to be some testing done to
|
|
see exactly the point where the costs of compression produce realisable
|
|
benefits.
|
|
|
|
So any solution must be able to cope with both compressed blocks and
|
|
non-compressed blocks. My current thinking is that this could be
|
|
achieved by using the spare fourth bit of the BkpBlocks portion of the
|
|
XLog structure, so that either all included BkpBlocks are compressed or
|
|
none of them are, and hope that allows benefit to shine through. Not
|
|
thought about heap/index issues.
|
|
|
|
It is possible that an XLogWriter process could be used to assist in the
|
|
CRC and compression calculations also, an a similar process used to
|
|
assist decompression for recovery, in time.
|
|
|
|
I regret I do not currently have time to pursue further.
|
|
|
|
Best Regards, Simon Riggs
|
|
|
|
From pgsql-hackers-owner+M65147=pgman=candle.pha.pa.us@postgresql.org Fri Mar 11 12:35:29 2005
|
|
Return-path: <pgsql-hackers-owner+M65147=pgman=candle.pha.pa.us@postgresql.org>
|
|
Received: from svr1.postgresql.org (svr1.postgresql.org [200.46.204.71])
|
|
by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id j2BIZSk26269
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 13:35:29 -0500 (EST)
|
|
Received: from localhost (unknown [200.46.204.144])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id 7C7DD545B4
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 18:35:25 +0000 (GMT)
|
|
Received: from svr1.postgresql.org ([200.46.204.71])
|
|
by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
|
|
with ESMTP id 31351-02 for <pgman@candle.pha.pa.us>;
|
|
Fri, 11 Mar 2005 18:35:25 +0000 (GMT)
|
|
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id D316E545A2
|
|
for <pgman@candle.pha.pa.us>; Fri, 11 Mar 2005 18:35:24 +0000 (GMT)
|
|
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
|
|
Received: from localhost (unknown [200.46.204.144])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id E70B65640C
|
|
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 11 Mar 2005 18:33:21 +0000 (GMT)
|
|
Received: from svr1.postgresql.org ([200.46.204.71])
|
|
by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
|
|
with ESMTP id 29957-07
|
|
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
|
|
Fri, 11 Mar 2005 18:33:19 +0000 (GMT)
|
|
Received: from hal.kabsi.at (bottom.kabsi.at [195.202.128.74])
|
|
by svr1.postgresql.org (Postfix) with ESMTP id 9D9EE545EE
|
|
for <pgsql-hackers@postgresql.org>; Fri, 11 Mar 2005 18:33:17 +0000 (GMT)
|
|
Received: from [192.168.0.5] (h062040243020.plc.cm.kabsi.at [62.40.243.20])
|
|
by hal.kabsi.at (8.11.1/) with ESMTP id j2BIX7a0000748810;
|
|
Fri, 11 Mar 2005 19:33:07 +0100 (CET)
|
|
Message-ID: <4231E416.4030900@cybertec.at>
|
|
Date: Fri, 11 Mar 2005 19:31:50 +0100
|
|
From: =?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= <postgres@cybertec.at>
|
|
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20040922
|
|
X-Accept-Language: en-us, en
|
|
MIME-Version: 1.0
|
|
To: Simon Riggs <simon@2ndquadrant.com>
|
|
cc: Tom Lane <tgl@sss.pgh.pa.us>,
|
|
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk>,
|
|
pgsql-hackers@postgresql.org
|
|
Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations
|
|
References: <9EB50F1A91413F4FA63019487FCD251D113169@WEBBASEDDC.webbasedltd.local> <23031.1110206390@sss.pgh.pa.us> <1110239639.6117.197.camel@localhost.localdomain>
|
|
In-Reply-To: <1110239639.6117.197.camel@localhost.localdomain>
|
|
Content-Type: text/plain; charset=us-ascii; format=flowed
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Virus-Scanned: by amavisd-new at hub.org
|
|
X-Mailing-List: pgsql-hackers
|
|
Precedence: bulk
|
|
Sender: pgsql-hackers-owner@postgresql.org
|
|
X-Virus-Scanned: by amavisd-new at hub.org
|
|
Status: RO
|
|
|
|
|
|
> One of the things I was thinking about was whether we could use up those
|
|
> cycles more effectively. If we were to include a compression routine
|
|
> before we calculated the CRC that would
|
|
> - reduce the size of the blocks to be written, hence reduce size of xlog
|
|
> - reduce the following CRC calculation
|
|
>
|
|
> I was thinking about using a simple run-length encoding to massively
|
|
> shrink half-empty blocks with lots of zero padding, but we've already
|
|
> got code to LZW the data down also.
|
|
>
|
|
> Best Regards, Simon Riggs
|
|
>
|
|
>
|
|
> ---------------------------(end of broadcast)---------------------------
|
|
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
|
|
|
|
|
|
Simon,
|
|
|
|
I think having a compression routine in there could make real sense.
|
|
We have done some major I/O testing involving compression for a large
|
|
customer some time ago. We have seen that compressing / decompressing on
|
|
the fly is in MOST cases much faster than uncompressed I/O (try a simple
|
|
"cat file | ..." vs." zcat file.gz | ...") - the zcat version will be
|
|
faster on all platforms we have tried (Linux, AIX, Sun on some SAN
|
|
system, etc. ...).
|
|
Also, when building up a large database within one transaction the xlog
|
|
will eat a lot of storage - this can be quite annoying when you have to
|
|
deal with a lot of data).
|
|
Are there any technical reasons which would prevent somebody from
|
|
implementing compression?
|
|
|
|
Best regards,
|
|
|
|
Hans
|
|
|
|
--
|
|
Cybertec Geschwinde u Schoenig
|
|
Schoengrabern 134, A-2020 Hollabrunn, Austria
|
|
Tel: +43/660/816 40 77
|
|
www.cybertec.at, www.postgresql.at
|
|
|
|
|
|
---------------------------(end of broadcast)---------------------------
|
|
TIP 5: Have you checked our extensive FAQ?
|
|
|
|
http://www.postgresql.org/docs/faq
|
|
|