Querying a time range across multiple partitions

Discussion:

Cal Heldenbrand

2014-09-05 17:31:30 UTC

Hi everyone,

I'm trying to run a select query from a span of child partitions, separated
out in daily tables, in Postgres 9.1.5. The parent looks like this:

# \d logins
Table "public.logins"
Column | Type | Modifiers
-------------+-----------------------------+-----------
username | character varying(50) |
event | character varying(20) |
time | timestamp without time zone |
host | character varying(18) |
hash | character varying(32) |
Triggers:
logins_log_trigger BEFORE INSERT ON logins FOR EACH ROW EXECUTE
PROCEDURE logins_insert_trigger()
Number of child tables: 1581 (Use \d+ to list them.)

And an example child table:

# \d logins_20140904
Table "public.logins_20140904"
Column | Type | Modifiers
-------------+-----------------------------+-----------
username | character varying(50) |
event | character varying(20) |
time | timestamp without time zone |
host | character varying(18) |
hash | character varying(32) |
Indexes:
"logins_20140904_event" hash (event)
"logins_20140904_event_time" btree (event, "time")
"logins_20140904_username" hash (username)
"logins_20140904_username_time" btree (username, "time")
Check constraints:
"logins_20140904_time_check" CHECK ("time" >= '2014-09-04
00:00:00'::timestamp without time zone AND "time" <= '2014-09-04
23:59:59.99'::timestamp without time zone)
Inherits: logins

I'm attempting to run a query that looks something like this:

explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval
'1 week';

Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text =
'1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on
logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...

This shows that it's attempting to run the query against all of my 1500
child tables.

If I change it to manually specify dates with an IN clause, it selects the
appropriate tables:

explain analyze select time,event from logins
where username='bob' and hash='1234' and time in ('2014-09-04',
'2014-09-05', '2014-09-03');

Result (cost=0.00..2.41 rows=3 width=31) (actual time=0.060..0.060 rows=0
loops=1)
-> Append (cost=0.00..2.41 rows=3 width=31) (actual time=0.060..0.060
rows=0 loops=1)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) (actual
time=0.001..0.001 rows=0 loops=1)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text
= '1234'::text) AND ("time" = ANY ('{"2014-09-04 00:00:00","2014-09-05
00:00:00","2014-09-03 00:00:00"}'::timestamp without time zone[])))
-> Bitmap Heap Scan on logins_20140903 logins (cost=1.09..1.20
rows=1 width=14) (actual time=0.039..0.039 rows=0 loops=1)
...

I know I could construct a query manually in my client by walking back
through the calendar dates, but for my own curiosity I'd like to find out a
more elegant way to run this query.

So far, I've been messing around with generating a series, then collecting
it back into an array, but nothing I've tried seems to work.

A few examples:

explain analyze select time,event from logins
where username='bob' and hash=1234' and time in (
generate_series(current_date - interval '3 days', current_date,
interval '1 day')
);
ERROR: argument of IN must not return a set

explain analyze select time,event from logins
where username='bob' and hash='1234' and time in (
select array_agg(series)
from generate_series(current_date - interval '3 days', current_date,
interval '1 day')
as u(series)
);
ERROR: operator does not exist: timestamp without time zone = timestamp
without time zone[]

explain analyze select time,event from logins
where username='bob' and hash=1234 and time in (
select unnest(array_agg(date_trunc('day',series))) from
generate_series(current_date - interval '3 days', current_date,
interval '1 day') as u(series)
)
-- No errors, but still queries all child tables.

Thank you!

--Cal

Alban Hertroys

2014-09-05 21:14:31 UTC

Permalink

Post by Cal Heldenbrand
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval '1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text = '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...
This shows that it's attempting to run the query against all of my 1500 child tables.

What about:
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > (current_date - interval '1 week’)::timestamp without time zone;

Also, you don’t appear to be having an index that starts from “time”, so none of the indexes will be particularly efficient at finding a specific time range. It’s quite possible that that makes PG think that “time” is not a very good candidate to filter on, simply because the optimizer doesn’t look that far.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Cal Heldenbrand

2014-09-05 21:45:34 UTC

Permalink

Post by Cal Heldenbrand
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > (current_date - interval
'1 weekâ)::timestamp without time zone;
Also, you donât appear to be having an index that starts from âtimeâ, so
none of the indexes will be particularly efficient at finding a specific
time range. Itâs quite possible that that makes PG think that âtimeâ is not
a very good candidate to filter on, simply because the optimizer doesnât
look that far.

That didn't seem to work either. The thought did occur to me that the
query planner wasn't using my combined column indexes. I tried adding just
a btree index on time and it still did the same problem.

Your example query there also goes back to 2010 for scanning tables.

Interestingly enough, this query actually works:

# explain analyze select time,event from logins
where username='bob' and hash='1234' and time > '2014-08-29';

Result (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074 rows=0
loops=1)
-> Append (cost=0.00..8.21 rows=8 width=20) (actual time=0.074..0.074
rows=0 loops=1)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66) (actual
time=0.002..0.002 rows=0 loops=1)
Filter: (("time" > '2014-08-29 00:00:00'::timestamp without
time zone) AND ((username)::text = 'bob'::text) AND ((hash)::text =
'1234'::text))
-> Index Scan using logins_20140829_username on logins_20140829
logins (cost=0.00..1.21 rows=1 width=14) (actual time=0.021..0.021 rows=0
loops=1)
Index Cond: ((username)::text = 'bob'::text)

So it seems to me that the query parser isn't preprocessing "current_date -
interval", but a statically defined time span *does* work.

Doesn't that seem weird?

John R Pierce

2014-09-05 21:17:48 UTC

Permalink

Post by Cal Heldenbrand
Number of child tables: 1581

that's an insane number of children. We try and limit it to 50 or so
child tables, for instance, 6 months retention by week, of data will
millions of rows/day.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

Cal Heldenbrand

2014-09-05 21:50:53 UTC

Permalink

This particular use case is for user behavior data mining. The hardware is
beefy, and has tablespaces split out onto SSD/spindle for new & old data.
All of my queries are pretty much a nightly cron process, and I don't
really care too much about the speed. Scanning the full 4 years of data
takes about 30 seconds per query anyway... but I thought it'd be nice to
speed it up when the difference is milliseconds vs 30 seconds.

Jeff Janes

2014-09-07 19:55:24 UTC

Permalink

I've used more than that many for testing purposes, and there was little
problem. The main thing is that your insert trigger (if you have one on
the master table) needs to be structured as a binary search-like nesting of
if..elsif, not a linear-searching like structure. Unless of course almost
all inserts go into the newest partition, then it might make more sense to
do the linear search with that being the first test. But for performance,
better to just insert directly into the correct child table.

I dont' know if that many tables is a good idea in this case, but it is not
prima facie insane.

Cheers,

Jeff

John R Pierce

2014-09-07 20:20:36 UTC

Permalink

Post by John R Pierce

Post by Cal Heldenbrand
Number of child tables: 1581

that's an insane number of children. We try and limit it to 50
or so child tables, for instance, 6 months retention by week, of
data will millions of rows/day.
I've used more than that many for testing purposes, and there was
little problem. The main thing is that your insert trigger (if you
have one on the master table) needs to be structured as a binary
search-like nesting of if..elsif, not a linear-searching like
structure. Unless of course almost all inserts go into the newest
partition, then it might make more sense to do the linear search with
that being the first test. But for performance, better to just insert
directly into the correct child table.

any select that can't be preplanned to a specific child will need to
check all 1500 children. this is far less efficient than checking,
say, 50 and letting the b-tree index of each child reject or narrow down
to the specific row(s). The one is roughly 1500*log(N/1500) while the
other is 50*log(N/50) at least to a first order approximation.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

Andreas Brandl

2014-09-08 20:40:21 UTC

Permalink

John,

Post by John R Pierce

Post by Cal Heldenbrand
Number of child tables: 1581
that's an insane number of children. We try and limit it to 50 or so
child tables, for instance, 6 months retention by week, of data will
millions of rows/day.
I've used more than that many for testing purposes, and there was
little problem. The main thing is that your insert trigger (if you
have one on the master table) needs to be structured as a binary
search-like nesting of if..elsif, not a linear-searching like
structure. Unless of course almost all inserts go into the newest
partition, then it might make more sense to do the linear search
with that being the first test. But for performance, better to just
insert directly into the correct child table.

any select that can't be preplanned to a specific child will need to
check all 1500 children. this is far less efficient than checking,
say, 50 and letting the b-tree index of each child reject or narrow
down to the specific row(s). The one is roughly 1500*log(N/1500)
while the other is 50*log(N/50) at least to a first order
approximation.

can you explain that further? In the end, that argument sounds like it would always be more efficient to use a single table and its index instead, rather than partitioning it (log(N) < c*log(N/c) for any c > 1, if I'm not totally lost today).

So, is there any insights of how many partitions are still useful?

I have tables with roughly 1000 partitions and did not have any issues so far. Even with having INSERT rules that are lined up worst-case (from past to current, while data is always being inserted for the current date), I haven't seen any considerable degradation of INSERT performance so far.

Thanks,
Andreas

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John R Pierce

2014-09-08 20:51:06 UTC

Permalink

Post by Andreas Brandl
can you explain that further? In the end, that argument sounds like it would always be more efficient to use a single table and its index instead, rather than partitioning it (log(N) < c*log(N/c) for any c > 1, if I'm not totally lost today).

it indeed would.

good reasons for partitioning include...

* efficient date based bulk deletion (we have a very large table that
has 6 months retention, so we partition by week and delete the oldest
week when a new week starts... dropping a partition is far faster than
deleting 20 million records by date)

* needing to put data across several tablespaces - I haven't had to do
this.

* more efficient vacuuming - really really large tables, like 100 GB,
take a LONG time to vacuum. sane sized partitions will vacuum in less
time, and since older time-based partitions aren't typically updated,
they can be frozen.
--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jeff Janes

2014-09-08 21:40:52 UTC

Permalink

Post by Andreas Brandl
John,

Post by John R Pierce

any select that can't be preplanned to a specific child will need to
check all 1500 children. this is far less efficient than checking,
say, 50 and letting the b-tree index of each child reject or narrow
down to the specific row(s). The one is roughly 1500*log(N/1500)
while the other is 50*log(N/50) at least to a first order
approximation.

Right. Partitioning is NOT a generic way to improve query performance.
With rare exceptions, partitioning makes query performance worse, and then
partition pruning may or may not serve to recover most of that lost
performance.

Partitioning allows you conspire with PostgreSQL to micromanage the the
layout of the data in ways that improve manageability and maintainability.
For example, you pay a little overhead each time you run a query and
(perhaps) each time you insert a row, but in exchange for that you can
"delete" a year of data with a nearly-instantaneous command.

Cheers,

Jeff

David G Johnston

2014-09-05 22:28:29 UTC

Permalink

Post by Cal Heldenbrand
explain analyze select time,event from logins
where username='bob' and hash='1234' and time in (
select array_agg(series)
from generate_series(current_date - interval '3 days', current_date,
interval '1 day')
as u(series)
);
ERROR: operator does not exist: timestamp without time zone = timestamp
without time zone[]
explain analyze select time,event from logins

Probably doesn't help but the above would have to be written:

WHERE time = ANY(
SELECT array_agg(...) FROM ...
)

I don't know whether the planner is smart enough to optimize on "= ANY()" -
whether statically generated or determined at runtime.

I am pretty sure that constraint exclusion must be done during the planning
stages and that you have to pass in a literal set of values (or an array)
that you generate in a previous query.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Querying-a-time-range-across-multiple-partitions-tp5817958p5817989.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jeff Janes

2014-09-07 19:44:08 UTC

Permalink

Post by Cal Heldenbrand
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval
'1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text
= '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on
logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...
This shows that it's attempting to run the query against all of my 1500
child tables.

I believe the problem is that the planner (which does the partition
pruning) is not willing to materialize the value of current_date, so it
can't use a specific value to prune partitions. After all, the date might
change between the planner and the executor, if you leave the plan open for
a long time, or make the call very close to midnight.

You will probably have to select the current_date (or get it from your
system, or cron, or whoever triggers the script), and then hardcode it into
the query.

Cheers,

Jeff

Cal Heldenbrand

2014-09-08 13:16:06 UTC

Permalink

Thanks Jeff! That's what I wanted to confirm, that I need to hard code /
pregenerate my dates in the query. I was mainly curious why it didn't work
with current_date, and that answers it.

And BTW, all my inserts happen on the most recent table, so my insert
trigger with a linear search is ordered date descending. It seems to work
fine since my use case is to insert data once, and never change it again.
Of course my only problem is when a select query confuses the planner, and
searches my entire set. ;-)

Thanks,

--Cal

Post by Jeff Janes

Post by Cal Heldenbrand
explain analyze select time,event from logins
where username='bob' and hash='1234' and time > current_date - interval
'1 week';
Result (cost=0.00..765.11 rows=1582 width=14)
-> Append (cost=0.00..765.11 rows=1582 width=14)
-> Seq Scan on logins (cost=0.00..0.00 rows=1 width=66)
Filter: (((username)::text = 'bob'::text) AND ((hash)::text
= '1234'::text) AND ("time" > (('now'::text)::date - '7 days'::interval)))
-> Index Scan using logins_20100501_username_time on
logins_20100501 logins (cost=0.01..0.48 rows=1 width=14)
...
This shows that it's attempting to run the query against all of my 1500
child tables.

I believe the problem is that the planner (which does the partition
pruning) is not willing to materialize the value of current_date, so it
can't use a specific value to prune partitions. After all, the date might
change between the planner and the executor, if you leave the plan open for
a long time, or make the call very close to midnight.
You will probably have to select the current_date (or get it from your
system, or cron, or whoever triggers the script), and then hardcode it into
the query.
Cheers,
Jeff