Discussion:
How does PostgerSQL planner decide driving table
Shingo horiuchi
2014-10-06 05:38:16 UTC
Permalink
Dear All,



I have read about query tuning and attempt to check the impact of

exchanging the driving table in Join condition.



To test Simple Join condition, I prepared Two tables.

One is the table for employees and another is the table for departments.

Employees table has the foreign key which referencing departments table.



The proportion of each table, filtered table and joined table is below:

(E means employees table and D means departments table.)

#rows in E #rows in D #filtered rows in E
#filtered rows in D #rows in E and D

10000 490
1000
245 9800



After make the index on the filtered column, I tried the query:

EXPLAIN ANALYZE

SELECT D.Department_Name, E.Last_Name, E.First_Name

FROM Employees E, Departments D

WHERE E.Department_Id=D.Department_Id

AND E.Exempt_Flag='Y'

AND D.US_Based_Flag='Y'

;



Result was:

Hash Join (cost=8.85..241.59 rows=499 width=15) (actual time=0.105..2.052
rows=518 loops=1)

Hash Cond: (e.department_id = d.department_id)

-> Seq Scan on employees e (cost=0.00..209.00 rows=5000 width=17)
(actual time=0.007..1.541 rows=5000 loops=1)

Filter: (exempt_flag = 'Y'::bpchar)

Rows Removed by Filter: 5000

-> Hash (cost=8.24..8.24 rows=49 width=14) (actual time=0.087..0.087
rows=49 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 3kB

-> Bitmap Heap Scan on departments d (cost=4.63..8.24 rows=49
width=14) (actual time=0.069..0.078 rows=49 loops=1)

Recheck Cond: (us_based_flag = 'Y'::bpchar)

-> Bitmap Index Scan on dept2_flg_idx (cost=0.00..4.62
rows=49 width=0) (actual time=0.063..0.063 rows=49 loops=1)

Index Cond: (us_based_flag = 'Y'::bpchar)

Total runtime: 2.095 ms



In order to exchange the driving table, I tried the query:

EXPLAIN ANALYZE

SELECT D.Department_Name, E.Last_Name, E.First_Name

FROM Departments D, Employees E

WHERE D.Department_Id=E.Department_Id

AND E.Exempt_Flag='Y'

AND D.US_Based_Flag='Y'

;



However, the result was same.

I think this is because the query planner can optimizer the 2nd query based
on table statistics of E and D.

E being the larger number of records and has higher filtering rate so it
continues to be driving table.



Now, I tried another test case to confirm my assumption.

The proportion of another test case is different from above one.

The proportion of each table, filtered table and joined table is below:

(E means employees table and D means departments table.)

#rows in E #rows in D #filtered rows in E
#filtered rows in D #rows in E and D

10000 490 5000
49 9800

The important point is the difference in the filtering rate.

In this case, departments table is higher filtering rate,

so taking departments table as driving table will be able to cut the
computational cost, I think.



I tried same query:

EXPLAIN ANALYZE

SELECT D.Department_Name, E.Last_Name, E.First_Name

FROM Employees E, Departments D

WHERE E.Department_Id=D.Department_Id

AND E.Exempt_Flag='Y'

AND D.US_Based_Flag='Y'

And

EXPLAIN ANALYZE

SELECT D.Department_Name, E.Last_Name, E.First_Name

FROM Employees E, Departments D

WHERE E.Department_Id=D.Department_Id

AND E.Exempt_Flag='Y'

AND D.US_Based_Flag='Y'



The results was same as earlier , though I expected driving table to be
Departments table in both these cases

Question:

How does PostgerSQL planner decide driving table?



My environment is

Postgres ver. 9.2.7 on x86_64-redhat-linux-gnu

CentOS ver. 7.0.1406

on Virtual Box ver. 4.3.16

in Windows7



Regards,



Shingo Horiuchi
Emanuel Calvo
2014-10-06 16:24:24 UTC
Permalink
Post by Shingo horiuchi
Dear All,
I have read about query tuning and attempt to check the impact of
exchanging the driving table in Join condition.
To test Simple Join condition, I prepared Two tables.
One is the table for employees and another is the table for departments.
Employees table has the foreign key which referencing departments table.
(E means employees table and D means departments table.)
#rows in E #rows in D #filtered rows in
E #filtered rows in D #rows in E
and D
10000 490
1000
245 9800
EXPLAIN ANALYZE
SELECT D.Department_Name, E.Last_Name, E.First_Name
FROM Employees E, Departments D
WHERE E.Department_Id=D.Department_Id
AND E.Exempt_Flag='Y'
AND D.US_Based_Flag='Y'
;
Hash Join (cost=8.85..241.59 rows=499 width=15) (actual
time=0.105..2.052 rows=518 loops=1)
Hash Cond: (e.department_id = d.department_id)
-> Seq Scan on employees e (cost=0.00..209.00 rows=5000 width=17)
(actual time=0.007..1.541 rows=5000 loops=1)
Filter: (exempt_flag = 'Y'::bpchar)
Rows Removed by Filter: 5000
-> Hash (cost=8.24..8.24 rows=49 width=14) (actual
time=0.087..0.087 rows=49 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
-> Bitmap Heap Scan on departments d (cost=4.63..8.24
rows=49 width=14) (actual time=0.069..0.078 rows=49 loops=1)
Recheck Cond: (us_based_flag = 'Y'::bpchar)
-> Bitmap Index Scan on dept2_flg_idx
(cost=0.00..4.62 rows=49 width=0) (actual time=0.063..0.063 rows=49
loops=1)
Index Cond: (us_based_flag = 'Y'::bpchar)
Total runtime: 2.095 ms
I would recommend to run an ANALYZE on both tables.

You can't exchange the seqscan table on that query due that it needs to
read ALL the
records on E table. If you seqscan D first, the engine needs to read
again all the E
records.

If you want all the employees across all departments, you'll always end
up reading all
the employees.

What you can do to other plans, is disabling enable_hashjoin.
Post by Shingo horiuchi
*/ /*
EXPLAIN ANALYZE
SELECT D.Department_Name, E.Last_Name, E.First_Name
FROM Departments D, Employees E
WHERE D.Department_Id=E.Department_Id
AND E.Exempt_Flag='Y'
AND D.US_Based_Flag='Y'
;
However, the result was same.
I think this is because the query planner can optimizer the 2^nd
query based on table statistics of E and D.
E being the larger number of records and has higher filtering rate so
it continues to be driving table.
Now, I tried another test case to confirm my assumption.
The proportion of another test case is different from above one.
(E means employees table and D means departments table.)
#rows in E #rows in D #filtered rows in
E #filtered rows in D #rows in E
and D
10000
490 5000
49 9800
The important point is the difference in the filtering rate.
In this case, departments table is higher filtering rate,
so taking departments table as driving table will be able to cut the
computational cost, I think.
EXPLAIN ANALYZE
SELECT D.Department_Name, E.Last_Name, E.First_Name
FROM Employees E, Departments D
WHERE E.Department_Id=D.Department_Id
AND E.Exempt_Flag='Y'
AND D.US_Based_Flag='Y'
And
EXPLAIN ANALYZE
SELECT D.Department_Name, E.Last_Name, E.First_Name
FROM Employees E, Departments D
WHERE E.Department_Id=D.Department_Id
AND E.Exempt_Flag='Y'
AND D.US_Based_Flag='Y'
Both queries are identical (?).
--
--
Emanuel Calvo http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Loading...