ERROR: missing chunk number 0 for toast value 382548694 in pg_toast

Discussion:

ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

Michael Harris

2011-02-28 00:25:02 UTC

Hi,

We have a PG 8.4 database approx 5TB in size.

We were recently testing our restore procedure against our latest dump. The dumps are taken using the Continuous Archiving method with base dumps taken using tar. Our tar script is set up to ignore missing/modified files but should stop on all other errors.

We are testing the restore on a completely separate machine of a similar spec to the server on which the dumps are made.

On our first attempt we were using PG 8.4.0. We struck problems on restore with the error 'Unexpected timeline ID 0 in log file'. After some googling we identified that this was a known bug in PG 8.4.0 so we upgraded the target machine to 8.4.7 and began the recovery again.

On the second attempt the recovery went through smoothly. However, after starting the application again we started to get the following error:

ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

The table it is trying to access is one in which our application stores various schema information that is used each time the application starts up.

As far as we know the base dump is good and the DB did not complain about any of the WAL files. We are now nervous that our backups are no good, so we want to get to the bottom of it. What can be the reason that these backups are not restoring properly?

Thanks in advance,

Regards
Mike Harris

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vibhor Kumar

2011-02-28 04:06:25 UTC

Permalink

Post by Michael Harris
ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

This seems more like a corrupted toast table.

Did you try to reindex the pg_toast_847386?
REINDEX table pg_toast.pg_toast_847386;
VACUUM ANALYZE <tablename>;

Thanks & Regards,
Vibhor Kumar
***@enterprisedb.com
Blog:http://vibhork.blogspot.com

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Harris

2011-02-28 04:16:41 UTC

Permalink

Post by Vibhor Kumar

Post by Michael Harris
ERROR: missing chunk number 0 for toast value 382548694 in
pg_toast_847386

This seems more like a corrupted toast table.
Did you try to reindex the pg_toast_847386?
REINDEX table pg_toast.pg_toast_847386;
VACUUM ANALYZE <tablename>;

Hi Vibhor,

Thanks for the suggestion.

We didn't try that yet, even though we did see others recommending this as a solution for similar corruptions in the past.

The main reason we have not gone down that path that after getting this error we do not have any confidence in the integrity of the rest of the database after performing the restore - maybe there are many more tables with corruption in them.

We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database, only after restoring from the base backup + PITR.

Regards
Mike Harris

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vibhor Kumar

2011-02-28 04:24:47 UTC

Permalink

Post by Michael Harris
The main reason we have not gone down that path that after getting this error we do not have any confidence in the integrity of the rest of the database after performing the restore - maybe there are many more tables with corruption in them.

May be or may not be.

Did you find anything suspicious in pg_log?
Please share recovery.conf information.

Post by Michael Harris
We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database, only after restoring from the base backup + PITR.

Did you verify base backup (Restoring without PITR) ?

Thanks & Regards,
Vibhor Kumar
***@enterprisedb.com
Blog:http://vibhork.blogspot.com

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Harris

2011-02-28 04:43:55 UTC

Permalink

Hi Vibhor,

Post by Vibhor Kumar

Post by Vibhor Kumar
Did you find anything suspicious in pg_log?

We've been through it all and did not see anything we didn't expect.

Post by Vibhor Kumar

Post by Vibhor Kumar
Please share recovery.conf information.

We did interrupt the restore a few times. The initial recovery.conf file contained only:

restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'

Later we decided to replace the recovery command with a wrapper script that would allow us to leave the restore going unattended over the weekend, and complete up until the latest WAL file on the original database (which is still running). We changed the recovery command to:

restore_command = '/var/lib/pgsql/data/db_restore_dm %f %p'

where the script db_restore_dm contained:

#!/usr/bin/perl

use strict;

my ($pg_f, $pg_p) = @ARGV;
exit 1 if $pg_f eq '00000001.history';

my $xlogBackupFile = "/mnt/dbsbackup/pg_xlog/$pg_f.gz";

while (! -f $xlogBackupFile and !$triggered) {
sleep 2;
}

while (1) {
system("gunzip -c $xlogBackupFile > $pg_p");
last if ($? >> 8 == 0);
sleep 2;
}

We were concerned that shutting down / starting up while recovery is ongoing might cause some problems, but the pg documentation indicates this should be OK and we saw no cause for concern in the pg logs.

Post by Vibhor Kumar
Did you verify base backup (Restoring without PITR) ?

I guess you mean did we restore it up until consistency reached? No so far we have continued restoration until we reached the last WAL file made by the original database.

Regards // Mike

-----Original Message-----
From: Vibhor Kumar [mailto:***@enterprisedb.com]
Sent: Monday, 28 February 2011 3:25 PM
To: Michael Harris
Cc: pgsql-***@postgresql.org
Subject: Re: [GENERAL] ERROR: missing chunk number 0 for toast value 382548694 in pg_toast_847386

Post by Vibhor Kumar
The main reason we have not gone down that path that after getting this error we do not have any confidence in the integrity of the rest of the database after performing the restore - maybe there are many more tables with corruption in them.

May be or may not be.

Did you find anything suspicious in pg_log?
Please share recovery.conf information.

Post by Vibhor Kumar
We are mainly concerned that our backups are valid. The fault does not seem to be present on the original database, only after restoring from the base backup + PITR.

Did you verify base backup (Restoring without PITR) ?

Thanks & Regards,
Vibhor Kumar
***@enterprisedb.com
Blog:http://vibhork.blogspot.com

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Vibhor Kumar

2011-02-28 05:57:05 UTC

Permalink

Post by Michael Harris

Post by Vibhor Kumar
Did you find anything suspicious in pg_log?

We've been through it all and did not see anything we didn't expect.

Post by Vibhor Kumar
Please share recovery.conf information.

restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'
restore_command = '/var/lib/pgsql/data/db_restore_dm %f %p'
#!/usr/bin/perl
use strict;
exit 1 if $pg_f eq '00000001.history';
my $xlogBackupFile = "/mnt/dbsbackup/pg_xlog/$pg_f.gz";
while (! -f $xlogBackupFile and !$triggered) {
sleep 2;
}
while (1) {
system("gunzip -c $xlogBackupFile > $pg_p");
last if ($? >> 8 == 0);
sleep 2;
}

Not sure about above wrapper function. However, if you can share some information from pg_log when you have started the restore with backup_label information.

Try following steps:
1. Untar all the gzipped WAL File in One Location
2. Use Following restore command:
cp <WAL Location>/%f %p

Post by Michael Harris
We were concerned that shutting down / starting up while recovery is ongoing might cause some problems, but the pg documentation indicates this should be OK and we saw no cause for concern in the pg logs.

What options have you used for shutting down?

Thanks & Regards,
Vibhor Kumar
***@enterprisedb.com
Blog:http://vibhork.blogspot.com

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Harris

2011-02-28 06:27:08 UTC

Permalink

Hi Vibhor,

Post by Vibhor Kumar

Post by Vibhor Kumar
Not sure about above wrapper function. However, if you can share some
information from pg_log when you have started the restore with
backup_label information.

Here it is at the beginning:
[2011-02-25 09:40:11 EST] LOG: database system was interrupted; last known up at 2011-02-01 01:04:12 EST
[2011-02-25 09:40:11 EST] LOG: starting archive recovery
[2011-02-25 09:40:11 EST] LOG: restore_command = 'gunzip -c /mnt/dbsbackup/pg_xlog/%f.gz > %p'
gunzip: /mnt/dbsbackup/pg_xlog/00000001.history.gz: No such file or directory
[2011-02-25 09:40:12 EST] LOG: restored log file "0000000100006F720000006F.00024AE0.backup" from archive
[2011-02-25 09:40:12 EST] LOG: restored log file "0000000100006F7200000094" from archive
[2011-02-25 09:40:12 EST] LOG: restored log file "0000000100006F720000006F" from archive
[2011-02-25 09:40:12 EST] LOG: automatic recovery in progress
[2011-02-25 09:40:12 EST] LOG: redo starts at 6F72/6F024AE0, consistency will be reached at 70B1/B75C9AF0
[2011-02-25 09:40:14 EST] LOG: restored log file "0000000100006F7200000070" from archive
[2011-02-25 09:40:14 EST] LOG: restored log file "0000000100006F7200000071" from archive
[2011-02-25 09:40:15 EST] LOG: restored log file "0000000100006F7200000072" from archive

...etc...

We did reach consistency eventually:

[2011-02-25 21:29:28 EST] LOG: restored log file "00000001000070B1000000B5" from archive
[2011-02-25 21:29:29 EST] LOG: restored log file "00000001000070B1000000B6" from archive
[2011-02-25 21:29:29 EST] LOG: restored log file "00000001000070B1000000B7" from archive
[2011-02-25 21:29:29 EST] LOG: consistent recovery state reached
[2011-02-25 21:29:29 EST] LOG: restored log file "00000001000070B1000000B8" from archive
[2011-02-25 21:29:30 EST] LOG: restored log file "00000001000070B1000000B9" from archive
...

Post by Vibhor Kumar

Post by Vibhor Kumar
1. Untar all the gzipped WAL File in One Location
cp <WAL Location>/%f %p

Unfortunately it is not practical to do this, since there are many terabytes of WAL files. They are available on an NFS mounted volume which is accessible from the target machine.

What we can do (and will do if no better suggestions come forward) is to restore again using only the PITR files needed to achieve consistency, ie. up to 70B1/B75C9AF0, a much smaller quantity of WAL files.

Post by Vibhor Kumar

Post by Vibhor Kumar
We were concerned that shutting down / starting up while recovery is
ongoing might cause some problems, but the pg documentation indicates
this should be OK and we saw no cause for concern in the pg logs.

What options have you used for shutting down?

We are using Centos and used the supplied init scripts - the actual command is: service postgresql stop, which translates to:

pg_ctl stop -D '$PGDATA' -s -m fast

Regards
Mike Harris

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general