Discussion:
How to find greatest record before known values fast
Andrus
2014-10-02 20:49:49 UTC
Permalink
I’m looking for a way to increase select statement speed in Postgres 9.0.

Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately.
However Postgres scans all rows:

explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from firma2.ALGSA
where laonr=1 and kuupaev <=current_date and
(kuupaev,kellaaeg) <= ( current_date, '23 59' )

"Aggregate (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458 loops=1)"
" Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <= ROW(('now'::text)::date, '23 59'::bpchar)))"
"Total runtime: 1608.846 ms"

In real query instead of 1, current_date and '23 59' there are variable parameters.

Table has both indexes present but postgres will not use them.
Indexes can changed and query can re-written if this helps.
Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg must be char(5) type.

Query contains reduntant condition `kuupaev <=current_date` but index is still not used.

I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist.

How to speed this query ?


Table structure is :

CREATE TABLE firma2.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
osak character(10) NOT NULL,
toode character(20) NOT NULL,
partii character(15),
kogus numeric(12,4) NOT NULL DEFAULT 0,
hind numeric(15,5) NOT NULL DEFAULT 0,
kulum numeric(15,5) NOT NULL DEFAULT 0,
tegkogus numeric(12,4),
stkuupaev date,
klient character(12),
masin character(5),
CONSTRAINT algsa_pkey PRIMARY KEY (id)
);


CREATE INDEX algsa_kuupaev_idx
ON firma2.algsa
USING btree
(kuupaev);

CREATE INDEX algsa_kuupaev_kellaaeg_idx
ON firma2.algsa
USING btree
(kuupaev, kellaaeg);

using

"PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit"

Posted also in

http://stackoverflow.com/questions/26165745/how-find-greatest-tuple-before-given-2-column-tuple-in-postgres-fast

Andrus.
Jim Nasby
2014-10-03 02:44:38 UTC
Permalink
So kellaaeg is a time? Your best bet here would be to create an index that is an actual timestamp comprised of both kuupaev and kellaaeg. You could do this with to_timestamp by concatinating both fields together, or it may be easier to replace the space in kellaaeg with a colon and cast it to time, then add the two:

kuupaev + replace( kellaaeg, ' ', ':' )::time

I know you can't alter the table, but can you create a view on top of the table? If you did that, you could have a real timestamp field in the view that is calculated from kuupaev and kellaaeg and you can create a functional index that uses the same calculation. That would be the easiest way to use this.
I’m looking for a way to increase select statement speed in Postgres 9.0.
Table has required index present. Desired result can obtained using index (kuupaev,kellaaeg) immediately.
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from firma2.ALGSA
where laonr=1 and kuupaev <=current_date and
(kuupaev,kellaaeg) <= ( current_date, '23 59' )
"Aggregate (cost=6932.65..6932.67 rows=1 width=10) (actual time=1608.590..1608.592 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..6571.49 rows=144464 width=10) (actual time=0.032..922.431 rows=144458 loops=1)"
" Filter: ((laonr = 1::numeric) AND (kuupaev <= ('now'::text)::date) AND (ROW(kuupaev, kellaaeg) <= ROW(('now'::text)::date, '23 59'::bpchar)))"
"Total runtime: 1608.846 ms"
In real query instead of 1, current_date and '23 59' there are variable parameters.
Table has both indexes present but postgres will not use them.
Indexes can changed and query can re-written if this helps.
Table structure cannot changed. char columns cannot replaced with varchar columns. kuupaev must be date and kellaaeg must be char(5) type.
Query contains reduntant condition `kuupaev <=current_date` but index is still not used.
I tried also `SELECT max( (kuupaev,kellaaeg ))` but got error that max() function does not exist.
How to speed this query ?
CREATE TABLE firma2.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
osak character(10) NOT NULL,
toode character(20) NOT NULL,
partii character(15),
kogus numeric(12,4) NOT NULL DEFAULT 0,
hind numeric(15,5) NOT NULL DEFAULT 0,
kulum numeric(15,5) NOT NULL DEFAULT 0,
tegkogus numeric(12,4),
stkuupaev date,
klient character(12),
masin character(5),
CONSTRAINT algsa_pkey PRIMARY KEY (id)
);
CREATE INDEX algsa_kuupaev_idx
ON firma2.algsa
USING btree
(kuupaev);
CREATE INDEX algsa_kuupaev_kellaaeg_idx
ON firma2.algsa
USING btree
(kuupaev, kellaaeg);
using
"PostgreSQL 9.0.3, compiled by Visual C++ build 1500, 32-bit"
Posted also in
http://stackoverflow.com/questions/26165745/how-find-greatest-tuple-before-given-2-column-tuple-in-postgres-fast
Andrus.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Tom Lane
2014-10-03 03:54:30 UTC
Permalink
Post by Jim Nasby
So kellaaeg is a time? Your best bet here would be to create an index
that is an actual timestamp comprised of both kuupaev and kellaaeg.
The real problem with this query, or at least with the index design,
is that the index design isn't accounting for the need to constrain
"laonr". The best way to create the index is with laonr first, and
instead of the max() write something like

regression=# create table foo (f1 int, f2 date, f3 bpchar(5));
CREATE TABLE
regression=# create index on foo (f1, f2, f3);
CREATE INDEX
regression=# explain select * from foo where f1 = 1 and (f2,f3) <= (current_date, '23 59') order by f2 desc, f3 desc limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (cost=0.01..5.44 rows=1 width=17)
-> Index Scan Backward using foo_f1_f2_f3_idx on foo (cost=0.01..16.32 rows=3 width=17)
Index Cond: ((f1 = 1) AND (ROW(f2, f3) <= ROW(('now'::text)::date, '23 59'::bpchar)))
(3 rows)

(tested on 9.0.18, should work on newer versions too)

BTW, the fact that newer versions are refusing to create an index on
"kuupaev||kellaaeg" should not be dismissed as mere pedantry.
The reason for that is that this expression involves a cast from
date to text, and the format of the text you get depends on DateStyle.
And that should call your attention to the fact that *the original query
gives the wrong answer*, or at least an answer that I bet is not the one
you want, unless DateStyle chances to be ISO.

IMO, one of the ten deadly sins of database design is using text-string
mashing to accomplish operations that are not naturally textual. This
query and the table design itself are in dire need of sackcloth and ashes.
Had the table designer had the wit to use a timestamp field rather than
this unholy mashup, we'd not be having this conversation.

regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrus
2014-10-03 06:28:57 UTC
Permalink
Hi!
Post by Jim Nasby
So kellaaeg is a time? Your best bet here would be to create an index that
is an actual timestamp comprised of both >kuupaev and kellaaeg. You could
do this with to_timestamp by concatinating both fields together, or it may
be easier to replace the space in kellaaeg with a colon and cast it to
kuupaev + replace( kellaaeg, ' ', ':' )::time
I know you can't alter the table, but can you create a view on top of the
table? If you did that, you could have a real >timestamp field in the view
that is calculated from kuupaev and kellaaeg and you can create a
functional index that >uses the same calculation. That would be the easiest
way to use this.
Thank you.
I solved this by creating composite index on 3 columns and re-writing query
as Tom recommended.
It looks like Tom's recommendation is simpler for me.

Andrus.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Merlin Moncure
2014-10-03 14:44:26 UTC
Permalink
Post by Andrus
Hi!
Post by Jim Nasby
So kellaaeg is a time? Your best bet here would be to create an index that
is an actual timestamp comprised of both >kuupaev and kellaaeg. You could do
this with to_timestamp by concatinating both fields together, or it may be
easier to replace the space in kellaaeg with a colon and cast it to time,
kuupaev + replace( kellaaeg, ' ', ':' )::time
I know you can't alter the table, but can you create a view on top of the
table? If you did that, you could have a real >timestamp field in the view
that is calculated from kuupaev and kellaaeg and you can create a functional
index that >uses the same calculation. That would be the easiest way to use
this.
Thank you.
I solved this by creating composite index on 3 columns and re-writing query
as Tom recommended.
It looks like Tom's recommendation is simpler for me.
Also,

*) quit using char() type. use varchar() -- at least in postgres,
it's better in every respect. the char type pads out the fields on
disk. (this is a common noobie error in postgres since that may not
necessarily be true in other databases)

*) numeric type gives fixed point operations and clean comparisons and
so is generally necessary, but it in some cases floating point
(float4/float8) are more compact and give better performance without
much downside.

*) 9.0 is up to 9.0.18. Time to upgrade. (it's a binary only replacement).

*) indexes like this: (kuupaev), (kuupaev, kellaaeg) are mostly
redundant because the composite index can service queries on kuupaev
nearly as well as the single field index. Either kill the single
field index to get better memory utilization or reverse the fields in
the composite index to (kellaaeg, kuupaev) if you make frequent
searches on 'kellaaeg'.

Indexes match quals in left to right order to give the best
performance. So, an index on a,b,c gives good performance for
searches on (a), (a,b), and (a,b,c). There are certain limited
exceptions to this rule but it's a good design principle to know.

merlin
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Vanasco
2014-10-03 16:21:46 UTC
Permalink
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy.

I have 2 tables:
resource
resource_2_tag

I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.

both tables have around 1.6million records.

If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
If the database doesn't need to warm up, it averages 1.76seconds.

The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.

here's an explain -- http://explain.depesz.com/s/PndC

I tried a subquery instead of a join, and the query optimized the plan to the same.

i'm hoping someone will see something that I just don't see.



Table "public.resource_2_tag"
Column | Type | Modifiers
-----------------------+---------+-----------
resource_id | integer |
tag_id | integer |
Indexes:
"_idx_speed_resource_2_tag__resource_id" btree (resource_id)
"_idx_speed_resource_2_tag__tag_id" btree (tag_id)

Table "public.resource"
Column | Type | Modifiers
-------------------------------------+-----------------------------+----------------------------------------------------------
id | integer | not null default nextval('resource_id_seq'::regclass)
resource_attribute1_id | integer |
lots of other columns | |
Indexes:
"resource_attribute1_idx" btree (resource_attribute1_id)

--------------------------------------------------------------------------------

select count(*) from resource;
-- 1669729

select count(*) from resource_2_tag;
-- 1676594

select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms

select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms

popping the 5k elements into an "in" clause, will run the query in around 100ms.


EXPLAIN ANALYZE
SELECT
resource_2_tag.tag_id AS resource_2_tag_tag_id,
count(resource_2_tag.tag_id) AS counted
FROM
resource_2_tag
JOIN resource ON resource.id = resource_2_tag.resource_id
WHERE
resource.resource_attribute1_id = 614
GROUP BY resource_2_tag.tag_id
ORDER BY counted DESC
LIMIT 25 OFFSET 0;

--------------------------------------------------------------------------------

Limit (cost=76659.61..76659.68 rows=25 width=4) (actual time=1502.902..1502.913 rows=25 loops=1)
-> Sort (cost=76659.61..76672.47 rows=5141 width=4) (actual time=1502.900..1502.906 rows=25 loops=1)
Sort Key: (count(resource_2_tag.tag_id))
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=76463.13..76514.54 rows=5141 width=4) (actual time=1487.016..1495.206 rows=13887 loops=1)
-> Hash Join (cost=35867.88..76437.42 rows=5141 width=4) (actual time=97.654..1453.337 rows=27068 loops=1)
Hash Cond: (resource_2_tag.resource_id = resource.id)
-> Seq Scan on resource_2_tag (cost=0.00..25847.94 rows=1676594 width=8) (actual time=0.032..513.046 rows=1676594 loops=1)
-> Hash (cost=35803.88..35803.88 rows=5120 width=4) (actual time=97.576..97.576 rows=5184 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 183kB
-> Bitmap Heap Scan on resource (cost=272.68..35803.88 rows=5120 width=4) (actual time=5.911..90.264 rows=5184 loops=1)
Recheck Cond: (resource_attribute1_id = 614)
-> Bitmap Index Scan on resource_attribute1_idx (cost=0.00..271.40 rows=5120 width=0) (actual time=3.575..3.575 rows=5184 loops=1)
Index Cond: (resource_attribute1_id = 614)
Total runtime: 1503.146 ms
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jim Nasby
2014-10-06 21:56:19 UTC
Permalink
Post by Jonathan Vanasco
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy.
resource
resource_2_tag
I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.
both tables have around 1.6million records.
If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
If the database doesn't need to warm up, it averages 1.76seconds.
The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.
here's an explain -- http://explain.depesz.com/s/PndC
I tried a subquery instead of a join, and the query optimized the plan to the same.
i'm hoping someone will see something that I just don't see.
Table "public.resource_2_tag"
Column | Type | Modifiers
-----------------------+---------+-----------
resource_id | integer |
tag_id | integer |
"_idx_speed_resource_2_tag__resource_id" btree (resource_id)
"_idx_speed_resource_2_tag__tag_id" btree (tag_id)
Table "public.resource"
Column | Type | Modifiers
-------------------------------------+-----------------------------+----------------------------------------------------------
id | integer | not null default nextval('resource_id_seq'::regclass)
resource_attribute1_id | integer |
lots of other columns | |
"resource_attribute1_idx" btree (resource_attribute1_id)
--------------------------------------------------------------------------------
select count(*) from resource;
-- 1669729
select count(*) from resource_2_tag;
-- 1676594
select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms
select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms
popping the 5k elements into an "in" clause, will run the query in around 100ms.
EXPLAIN ANALYZE
SELECT
resource_2_tag.tag_id AS resource_2_tag_tag_id,
count(resource_2_tag.tag_id) AS counted
FROM
resource_2_tag
JOIN resource ON resource.id = resource_2_tag.resource_id
WHERE
resource.resource_attribute1_id = 614
GROUP BY resource_2_tag.tag_id
ORDER BY counted DESC
LIMIT 25 OFFSET 0;
Don't join to the resource table; there's no reason to because you're not pulling anything from it.

If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select:

SELECT r.*, counted
FROM resource r
JOIN (
SELECT tag_id, count(*)
FROM resource_2_tag
GROUP BY tag_id
ORDER BY tag_id
LIMIT 25
) t ON ...
;
--
--
Jim Nasby, Data Architect, Blue Treble
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Vanasco
2014-10-06 23:07:35 UTC
Permalink
Post by Jim Nasby
Don't join to the resource table; there's no reason to because you're not pulling anything from it.
Thanks the reply!

I'm not pulling anything from the resource table, but the join is necessary because I'm filtering based on it. ( see the WHERE clause )

I'm not trying to find the most used overall tags, but the ones that are used by resources with a (variable) id on a column in the resources table. .
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Marc Mamin
2014-10-07 14:02:05 UTC
Permalink
Post by Jonathan Vanasco
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy.
resource
resource_2_tag
I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.
both tables have around 1.6million records.
If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
If the database doesn't need to warm up, it averages 1.76seconds.
The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.
here's an explain -- http://explain.depesz.com/s/PndC
I tried a subquery instead of a join, and the query optimized the plan to the same.
i'm hoping someone will see something that I just don't see.
Table "public.resource_2_tag"
Column | Type | Modifiers
-----------------------+---------+-----------
resource_id | integer |
tag_id | integer |
"_idx_speed_resource_2_tag__resource_id" btree (resource_id)
"_idx_speed_resource_2_tag__tag_id" btree (tag_id)
Table "public.resource"
Column | Type | Modifiers
-------------------------------------+-----------------------------+----------------------------------------------------------
id | integer | not null default nextval('resource_id_seq'::regclass)
resource_attribute1_id | integer |
lots of other columns | |
"resource_attribute1_idx" btree (resource_attribute1_id)
--------------------------------------------------------------------------------
select count(*) from resource;
-- 1669729
select count(*) from resource_2_tag;
-- 1676594
select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms
select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms
popping the 5k elements into an "in" clause, will run the query in around 100ms.
EXPLAIN ANALYZE
SELECT
resource_2_tag.tag_id AS resource_2_tag_tag_id,
count(resource_2_tag.tag_id) AS counted
FROM
resource_2_tag
JOIN resource ON resource.id = resource_2_tag.resource_id
WHERE
resource.resource_attribute1_id = 614
GROUP BY resource_2_tag.tag_id
ORDER BY counted DESC
LIMIT 25 OFFSET 0;
Hi,
it seems to me that your subquery may deliver duplicate ids.
And with the selectivity of your example, I would expect an index usage
instead of a table scan. You may check how up to date your statistics are
and try to raise the statistic target on the column resource_2_tag.tag_id.
Also try a CTE form for your query:


WITH A as (SELECT DISTINCT id FROM resource
WHERE resource_attribute1_id = 614
)
SELECT
resource_2_tag.tag_id AS resource_2_tag_tag_id,
count(resource_2_tag.tag_id) AS counted
FROM
resource_2_tag
JOIN A ON A.id = resource_2_tag.resource_id
ORDER BY counted DESC
LIMIT 25;

regards,
Marc Mamin
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Vanasco
2014-10-07 15:42:42 UTC
Permalink
Post by Marc Mamin
Hi,
it seems to me that your subquery may deliver duplicate ids.
And with the selectivity of your example, I would expect an index usage
instead of a table scan. You may check how up to date your statistics are
and try to raise the statistic target on the column resource_2_tag.tag_id.
It shouldn't be able to deliver duplicate ids.

=> SELECT COUNT(*) FROM (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614) AS foo;
count ------- 5184

=> SELECT COUNT(*) FROM (SELECT id FROM resource WHERE resource_attribute1_id = 614) AS foo;
count ------- 5184

However, adding in the DISTINCT drastically changed the query plan, and did give a speedup.

Your comment made me focus on the notion of a Table Scan. I assumed it did the seq scan - and there would not be much savings otherwise - because the table is just 2 ids.

I was wrong.

I noticed that I never put a PRIMARY KEY constraint on that table.

So i tried adding a PRIMARY KEY constraint, then running vacuum analyze...

And that solved all my problems.

the original query ended up being the fastest at 260ms ( down from 1760 )

Join - 260ms
Subquery w/DISTINCT - 300ms
CTE - 330
CTE w/DISTINCT - 345ms
Subquery (no DISTINCT) - 1500ms
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrus
2014-10-03 20:28:27 UTC
Permalink
Hi!

Thank you for explanations.
the char type pads out the fields on disk.
It looks like you wrote that char takes more disk space.

from

http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf

page 28:

Unlike many
databases, char(n) is NOT stored as afixed-sizedfield
in Postgres. It is treated exactly the sameas
varchar(n)except for being padded

So char type does not take more space than varchar.

Andrus.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Merlin Moncure
2014-10-03 20:45:23 UTC
Permalink
Post by Andrus
Hi!
Thank you for explanations.
the char type pads out the fields on disk.
It looks like you wrote that char takes more disk space.
from
http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf
Unlike many
databases, char(n) is NOT stored as afixed-sizedfield in
Postgres. It is treated exactly the sameas
varchar(n)except for being padded
So char type does not take more space than varchar.
I beg to differ:

postgres=# create table t1(v char(100));
CREATE TABLE
postgres=# create table t2(v varchar(100));
CREATE TABLE
postgres=# insert into t1 select '' from generate_series(1,1000000);
INSERT 0 1000000
Time: 5951.023 ms
postgres=# insert into t2 select '' from generate_series(1,1000000);
INSERT 0 1000000
Time: 2083.323 ms
postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class
where relname = 't1';
pg_size_pretty
────────────────
128 MB
(1 row)

postgres=# select pg_size_pretty(pg_relation_size(oid)) from pg_class
where relname = 't2';
pg_size_pretty
────────────────
35 MB
(1 row)

merlin
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrus
2014-10-04 20:34:40 UTC
Permalink
Hi!

Thank you.

People often looks for different period sales using different filters.
There are lot of sales and every sale is individual record in sales table.
So increasing sequential scan speed is important.

I tried

create table t1(v char(100), p numeric(12,5));
create table t2(v varchar(100), p numeric(12,5));
insert into t1 select '', generate_series from generate_series(1,1000000);
insert into t2 select '', generate_series from generate_series(1,1000000);

and after that measured speed of

select sum(p) from t1

and

select sum(p) from t2

both of them took approximately 800 ms

So it looks like thee is no difference in sequential scan speed and thus no
need to change char types.

Andrus
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrus
2014-10-04 20:46:02 UTC
Permalink
Hi!

Thank you.

In my db people often looks for different period sales using different
filters and will sum
There are lot of sales and every sale is individual record in sales table.
So increasing sequential scan speed is important.

I tried

create table t1(v char(100), p numeric(12,5));
create table t2(v varchar(100), p numeric(12,5));
insert into t1 select '', generate_series from generate_series(1,1000000);
insert into t2 select '', generate_series from generate_series(1,1000000);

and after that measured speed of

select sum(p) from t1

and

select sum(p) from t2

Both of them took approximately 800 ms

Also select max(length(v)) from t1 and select max(length(v)) from t2

speed is the same

Also I tested with 1600000 rows production char(100) type column database by
running

select sum(decimalcolumn) from testtable

before and after running

alter table testtable alter char100column type varchar(100)

Select sum command tooks 1700 ms in both cases.

So there is no difference in sequential scan speed.
Replacing char with varchar requires re-writing some parts of code.
Disk space is minor issue compared to cost of code-rewrite.
It looks like it is not reasonable to replace char with varchar.

Andrus
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Merlin Moncure
2014-10-06 15:10:47 UTC
Permalink
Post by Andrus
In my db people often looks for different period sales using different
filters and will sum
There are lot of sales and every sale is individual record in sales table.
So increasing sequential scan speed is important.
I tried
create table t1(v char(100), p numeric(12,5));
create table t2(v varchar(100), p numeric(12,5));
insert into t1 select '', generate_series from generate_series(1,1000000);
insert into t2 select '', generate_series from generate_series(1,1000000);
and after that measured speed of
select sum(p) from t1
and
select sum(p) from t2
Both of them took approximately 800 ms
So there is no difference in sequential scan speed.
Replacing char with varchar requires re-writing some parts of code.
Disk space is minor issue compared to cost of code-rewrite.
It looks like it is not reasonable to replace char with varchar.
Sure, in this trivial case it's not different (both tables are small,
fit in cache, and the numeric calculation is where the bulk of time is
getting spent). But if your table is double the size it's going to
have impacts on many real world workloads. I'm not in any way saying
to go change up your database but I'd definitely avoid char() for all
new code.

merlin
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adrian Klaver
2014-10-03 22:36:50 UTC
Permalink
Post by Andrus
Hi!
Thank you for explanations.
the char type pads out the fields on disk.
It looks like you wrote that char takes more disk space.
from
http://www.pgcon.org/2013/schedule/attachments/269_tour-of-postgresql-data-types.pdf
Unlike many
databases, char(n) is NOT stored as afixed-sizedfield
in Postgres. It is treated exactly the sameas
varchar(n)except for being padded
So char type does not take more space than varchar.
Which directly contradicts the information on page 27:

Character Types (or Strings)
Name
Description

varchar(n)
variable-length with limit

char(n)
fixed-length, blank padded

text
variable unlimited length


and the docs:

http://www.postgresql.org/docs/9.3/interactive/datatype-character.html

Values of type character are physically padded with spaces to the
specified width n, and are stored and displayed that way. However, the
padding spaces are treated as semantically insignificant. Trailing
spaces are disregarded when comparing two values of type character, and
they will be removed when converting a character value to one of the
other string types. Note that trailing spaces are semantically
significant in character varying and text values, and when using pattern
matching, e.g. LIKE, regular expressions.


Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a few
extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance advantages
in some other database systems, there is no such advantage in
PostgreSQL; in fact character(n) is usually the slowest of the three
because of its additional storage costs. In most situations text or
character varying should be used instead.
Post by Andrus
Andrus.
--
Adrian Klaver
***@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Tom Lane
2014-10-03 22:59:36 UTC
Permalink
Post by Andrus
Unlike many
databases, char(n) is NOT stored as afixed-sizedfield
in Postgres. It is treated exactly the sameas
varchar(n)except for being padded
This info is probably not as well worded as it could be, but it's not
really wrong. The key point is that char(N) is blank-padded (thereby
wasting space) to be N *characters*, but that is not necessarily N
*bytes*, because of possible multi-byte characters. Therefore the engine
has to treat it as a variable-length datatype.

I believe in some some other DBMSes, char(N) means N *bytes* and is
treated as a fixed-size datatype. Dunno what they do with multibyte
encodings.

regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Adrian Klaver
2014-10-03 23:15:58 UTC
Permalink
Post by Tom Lane
Post by Andrus
Unlike many
databases, char(n) is NOT stored as afixed-sizedfield
in Postgres. It is treated exactly the sameas
varchar(n)except for being padded
This info is probably not as well worded as it could be, but it's not
really wrong. The key point is that char(N) is blank-padded (thereby
wasting space) to be N *characters*, but that is not necessarily N
*bytes*, because of possible multi-byte characters. Therefore the engine
has to treat it as a variable-length datatype.
Well that is a distinction I had missed, thanks for the heads up. So the
bottom line is char(N) is variable length, but for values that do not
reach length N will require more storage space then varchar(n).
Post by Tom Lane
I believe in some some other DBMSes, char(N) means N *bytes* and is
treated as a fixed-size datatype. Dunno what they do with multibyte
encodings.
regards, tom lane
--
Adrian Klaver
***@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
David G Johnston
2014-10-03 03:04:21 UTC
Permalink
Post by Andrus
Table has both indexes present but postgres will not use them.
Indexes can changed and query can re-written if this helps.
Table structure cannot changed. char columns cannot replaced with varchar
columns. kuupaev must be date and kellaaeg must be char(5) type.
Don't discard the option to add additional trigger-maintained columns to the
table and then modify the where clause to use those new columns while still
outputting the originals.

You also need to explain whether you can update from 9.0.3 to 9.0.18

As Craig Ringer already said working with seldom used types in unusual ways
on ancient releases means that you may be stuck with second rate workarounds
or just living with the performance degradation. I guess at worse you could
delete data so the scan doesn't take as long...

David J.



--
View this message in context: http://postgresql.1045698.n5.nabble.com/How-to-find-greatest-record-before-known-values-fast-tp5821461p5821523.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Loading...