pg_stat_replication in 9.3

Discussion:

Torsten Förtsch

2014-09-14 12:03:08 UTC

Hi,

I noticed a strange behaviour regarding pg_stat_replication in 9.3. If
called from psql using the \watch command, I see all my replicas. From
time to time one of them drops out and reconnects in a short period of
time, typically ~30 sec.

If I use the same select in plpgsql like:

FOR r in SELECT application_name,
client_addr,
flush_location, clock_timestamp() AS lmd
FROM pg_stat_replication
ORDER BY application_name, client_addr
LOOP
RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%',
r.application_name, r.client_addr, r.lmd,
r.flush_location,
pg_current_xlog_location(),
pg_size_pretty(
pg_xlog_location_diff(
pg_current_xlog_location(),
r.flush_location
)
);
END LOOP;

I see one of the replicas dropping out but never coming back again while
in a parallel session using psql and \watch it indeed does come back.

Is that intended?

Torsten

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andy Colson

2014-09-14 14:24:19 UTC

Permalink

Post by Torsten FÃ¶rtsch
Hi,
I noticed a strange behaviour regarding pg_stat_replication in 9.3. If
called from psql using the \watch command, I see all my replicas. From
time to time one of them drops out and reconnects in a short period of
time, typically ~30 sec.
FOR r in SELECT application_name,
client_addr,
flush_location, clock_timestamp() AS lmd
FROM pg_stat_replication
ORDER BY application_name, client_addr
LOOP
RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%',
r.application_name, r.client_addr, r.lmd,
r.flush_location,
pg_current_xlog_location(),
pg_size_pretty(
pg_xlog_location_diff(
pg_current_xlog_location(),
r.flush_location
)
);
END LOOP;
I see one of the replicas dropping out but never coming back again while
in a parallel session using psql and \watch it indeed does come back.
Is that intended?
Torsten

I wonder if its a transaction thing? Maybe \watch is using a transaction for each (or isn't using transactions at all), whereas the plpgsql is one long transaction?

Also if one of your replicas is far away, it doesn't really surprise me that it might loose connection every once and a while. On the other hand, if the box is on the same subnet, right next to the master, and it was loosing connection, that would be a bad thing.

So, how far away is the replica? And does 'ps ax|grep postgr' show 'idle' or 'idle in transaction' on the \watch and the plpgsql?

-Andy

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Torsten Förtsch

2014-09-14 14:59:07 UTC

Permalink

Post by Andy Colson
I wonder if its a transaction thing? Maybe \watch is using a
transaction for each (or isn't using transactions at all), whereas the
plpgsql is one long transaction?
Also if one of your replicas is far away, it doesn't really surprise me
that it might loose connection every once and a while. On the other
hand, if the box is on the same subnet, right next to the master, and it
was loosing connection, that would be a bad thing.
So, how far away is the replica? And does 'ps ax|grep postgr' show
'idle' or 'idle in transaction' on the \watch and the plpgsql?

The replicas are far away, intercontinental far. I am not complaining
that the replica looses the connection. What makes me wonder is that
within a transaction, pg_stat_replication can forget rows but cannot
acquire new ones. I'd think it should be either report the state at the
beginning of the transaction like now() or the current state like
clock_timestamp(). But currently it's reporting half the current state.

Torsten

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

2014-09-14 16:55:29 UTC

Permalink

Post by Torsten FÃ¶rtsch
The replicas are far away, intercontinental far. I am not complaining
that the replica looses the connection. What makes me wonder is that
within a transaction, pg_stat_replication can forget rows but cannot
acquire new ones. I'd think it should be either report the state at the
beginning of the transaction like now() or the current state like
clock_timestamp(). But currently it's reporting half the current state.

Are you watching the state in a loop inside a single plpgsql function?
If so, I wonder whether the problem is that the plpgsql function's
snapshot isn't changing. From memory, marking the function VOLATILE
would help if that's the issue.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Torsten Förtsch

2014-09-14 20:47:41 UTC

Permalink

Post by Tom Lane
Are you watching the state in a loop inside a single plpgsql function?
If so, I wonder whether the problem is that the plpgsql function's
snapshot isn't changing. From memory, marking the function VOLATILE
would help if that's the issue.

The function is VOLATILE. I attached 2 versions of it. fn-old.sql does
not work because once a slave has disconnected it drops out and does not
come back. fn.sql uses dblink to work around the problem. But it
consumes 2 db connections.

The intent of the function is to be called between operations that may
cause slaves to lag behind. If the lag is below a certain limit, it
simply returns. Otherwise, it waits until the lag drops below a second
limit.

If it were a VOLATILE problem, the functions would not be able to see
when a slave drops out nor changes in the data. But it does see these
changes. Only when a slave comes back online, it is not seen in the
current transaction.

Torsten