Will pg_repack improve this query performance?

Discussion:

Abelard Hoffman

2014-10-15 02:33:46 UTC

I believe this query is well optimized, but it's slow if the all the blocks
aren't already in memory.

Here's example explain output. You can see it takes over 7 seconds to run
when it needs to hit the disk, and almost all of it is related to checking
if the user has "messages."

http://explain.depesz.com/s/BLT

On a second run, it's extremely fast (< 50ms). So I'm thinking it's a lack
of clustering on the "Index Cond: (to_id = users.user_id)" that's the
culprit.

I'm afraid of using CLUSTER due to the exclusive lock, but I found
pg_repack while researching:
http://reorg.github.io/pg_repack/

Does it seem likely that doing an --order-by on the to_id column would have
a significant impact in this case? pg_repack seems pretty stable and safe
at this point?

I am going to try and test this in a dev environment first but wanted
feedback if this seemed like a good direction?

Thanks.

--
Best,
AH

Alban Hertroys

2014-10-15 09:03:38 UTC

Permalink

With your current setup, the database first needs to find candidate rows in the index and then has to check the other conditions against the table, which is likely to involve some disk access.

On a second run, it's extremely fast (< 50ms). So I'm thinking it's a lack of clustering on the "Index Cond: (to_id = users.user_id)" that's the culprit.

That probably means that the relevant parts of the table were still in memory, which means the scan did not need to visit the disk to load the matched rows to filter the NULL conditions in your query.

http://reorg.github.io/pg_repack/

A CLUSTER would help putting rows with the same to_id together. Disk access would be less random that way, so it would help some.

According to your query plan, accessing disks (assuming that’s what made the difference) was 154 (7700 ms / 50 ms) times slower than accessing memory. I don’t have the numbers for your disks or memory, but that doesn’t look like an incredibly unrealistic difference. That begs the question, how random was that disk access and how much can be gained from clustering that data?

Did you try a partial index on to_id with those NULL conditions? That should result in a much smaller index size, which in turn makes it faster to scan - much so if the index is difficult to keep in memory because of its size. More importantly though, the scan wouldn’t need to visit the table to verify those NULL fields.

Does it seem likely that doing an --order-by on the to_id column would have a significant impact in this case? pg_repack seems pretty stable and safe at this point?

Not being familiar with pg_repack I can’t advise on that.

I am going to try and test this in a dev environment first but wanted feedback if this seemed like a good direction?

You can try that CLUSTER or the approach with pg_repack regardless of my suggestion for the partial index. It should speed disk access to those records up regardless of how they are indexed.

Good luck!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Josh Kupershmidt

2014-10-16 20:32:44 UTC

Permalink

Post by Alban Hertroys
A CLUSTER would help putting rows with the same to_id together. Disk access would be less random that way, so it would help some.
According to your query plan, accessing disks (assuming that’s what made the difference) was 154 (7700 ms / 50 ms) times slower than accessing memory. I don’t have the numbers for your disks or memory, but that doesn’t look like an incredibly unrealistic difference. That begs the question, how random was that disk access and how much can be gained from clustering that data?

Other than grouping tuples in a more favorable order to minimize I/O,
the big benefit of running a CLUSTER or pg_repack is that you
eliminate any accumulated bloat. (And if bloat is your real problem,
ideally you can adjust your autovacuum settings to avoid the problem
in the future.) You may want to check on the bloat of that table and
its indexes with something like this:

https://wiki.postgresql.org/wiki/Show_database_bloat

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Abelard Hoffman

2014-10-16 21:41:11 UTC

Permalink

Post by Abelard Hoffman
I believe this query is well optimized, but it's slow if the all the

blocks aren't already in memory.

Post by Abelard Hoffman
Here's example explain output. You can see it takes over 7 seconds to

run when it needs to hit the disk, and almost all of it is related to
checking if the user has "messages."

Post by Abelard Hoffman
http://explain.depesz.com/s/BLT

From that plan it is obvious that the index scan takes the most time. It
looks like you have 3315 rows matching to_id = users.id, of which only 10
match your query conditions after applying the filter.
With your current setup, the database first needs to find candidate rows
in the index and then has to check the other conditions against the table,
which is likely to involve some disk access.

Post by Abelard Hoffman
On a second run, it's extremely fast (< 50ms). So I'm thinking it's a

lack of clustering on the "Index Cond: (to_id = users.user_id)" that's the
culprit.
That probably means that the relevant parts of the table were still in
memory, which means the scan did not need to visit the disk to load the
matched rows to filter the NULL conditions in your query.

Post by Abelard Hoffman
I'm afraid of using CLUSTER due to the exclusive lock, but I found
http://reorg.github.io/pg_repack/

A CLUSTER would help putting rows with the same to_id together. Disk
access would be less random that way, so it would help some.
According to your query plan, accessing disks (assuming thatâs what made
the difference) was 154 (7700 ms / 50 ms) times slower than accessing
memory. I donât have the numbers for your disks or memory, but that doesnât
look like an incredibly unrealistic difference. That begs the question, how
random was that disk access and how much can be gained from clustering that
data?
Did you try a partial index on to_id with those NULL conditions? That
should result in a much smaller index size, which in turn makes it faster
to scan - much so if the index is difficult to keep in memory because of
its size. More importantly though, the scan wouldnât need to visit the
table to verify those NULL fields.

No, I haven't tried a more constrained index. Good point, makes sense.

Post by Abelard Hoffman
Does it seem likely that doing an --order-by on the to_id column would

have a significant impact in this case? pg_repack seems pretty stable and
safe at this point?
Not being familiar with pg_repack I canât advise on that.

Post by Abelard Hoffman
I am going to try and test this in a dev environment first but wanted

feedback if this seemed like a good direction?
You can try that CLUSTER or the approach with pg_repack regardless of my
suggestion for the partial index. It should speed disk access to those
records up regardless of how they are indexed.

I tried pg_repack in dev and it did make a dramatic improvement (pg_repack
took ~ 65 minutes to run). After the repack, I couldn't get the query to
take longer than 750ms. Should be much, much faster in production too.

It seems like maybe the partial index is a better long-term fix though.

And thank you, Josh, about the tip on table bloat. I'll take a look at that
too.

--
Best,
AH

aron123

2014-10-17 10:06:20 UTC

Permalink

Thanks for share…

-----
harry
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Will-pg-repack-improve-this-query-performance-tp5823051p5823385.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general