Grouping Sets
From PostgreSQL Wiki
I collect info about Grouping Sets feature.
What is it:
- http://it.toolbox.com/blogs/db2luw/olap-sql-part-3-grouping-sets-8125
- http://www.inf.uni-konstanz.de/dbis/teaching/ws0405/information-systems/tutorial-04.pdf
- http://www.nabble.com/proposal---GROUPING-SETS-td19509168.html - my proposal
- http://www.pgsql.cz/patches/gsets.diff.gz - WIP patch
Grouping Sets allows more repeated grouping clauses in one query. The purpose is to support the analytic multidimensional view of data. The keywords CUBE and ROLLUP ware used originally only as a group clause flag, but in ANSI SQL they provide more general solutions. It's a big enhancement of GROUP BY clause:
Contents |
Introduction
Table cars:
create table cars(name varchar, place varchar, count integer);
insert into cars values('skoda', 'czech rep.', 10000);
insert into cars values('skoda', 'germany', 5000);
insert into cars values('bmw', 'czech rep.', 100);
insert into cars values('bmw', 'germany', 1000);
insert into cars values('opel', 'czech rep.', 7000);
insert into cars values('opel', 'germany', 7000);
postgres=# select * from cars;
name | place | count
-------+------------+-------
skoda | czech rep. | 10000
skoda | germany | 5000
bmw | czech rep. | 100
bmw | germany | 1000
opel | czech rep. | 7000
opel | germany | 7000
(6 rows)
Classic queries are:
postgres=# select name, sum(count) from cars group by name; name | sum -------+------- bmw | 1100 skoda | 15000 opel | 14000 (3 rows) postgres=# select place, sum(count) from cars group by place; place | sum ------------+------- germany | 13000 czech rep. | 17100 (2 rows)
But these queries are isolated and don't allow rich view on data - they are one dimensional.
Multidimensional queries allow us to combine partial views into one report:
<postgres=# select name, place, sum(count) from cars group by grouping sets(name, place, ());
name | place | sum
-------+------------+-------
bmw | | 1100
skoda | | 15000
opel | | 14000
| germany | 13000
| czech rep. | 17100
| | 30100
(6 rows)
Last query allows fast view on count of cars grouped by name and by place (there are two dimensions). Last query is equivalent to query:
select name, null as place, sum(count) from cars group by name union all select null, place, sum(count) from cars group by place union all select null, null, sum(count) from cars;
All combination of attributes do CUBE operator:
postgres=# select name, place, sum(count) from cars group by cube(name, place);
name | place | sum
-------+------------+-------
skoda | czech rep. | 10000
opel | germany | 7000
bmw | germany | 1000
bmw | czech rep. | 100
opel | czech rep. | 7000
skoda | germany | 5000
bmw | | 1100
skoda | | 15000
opel | | 14000
| germany | 13000
| czech rep. | 17100
| | 30100
(12 rows)
Rollup analyze is supported with ROLLUP operator
postgres=# select name, place, sum(count) from cars group by rollup(name, place);
name | place | sum
-------+------------+-------
skoda | czech rep. | 10000
opel | germany | 7000
bmw | germany | 1000
bmw | czech rep. | 100
opel | czech rep. | 7000
skoda | germany | 5000
bmw | | 1100
skoda | | 15000
opel | | 14000
| | 30100
(10 rows)
ROLLUP and CUBE operators should be simple to transform to GROUPING SETS spec:
select name, place, sum(count) from cars group by grouping sets((name, place),name, place,());
name | place | sum
-------+------------+-------
skoda | czech rep. | 10000
opel | germany | 7000
bmw | germany | 1000
bmw | czech rep. | 100
opel | czech rep. | 7000
skoda | germany | 5000
bmw | | 1100
skoda | | 15000
opel | | 14000
| germany | 13000
| czech rep. | 17100
| | 30100
(12 rows)
Virtual values
Sure, there are some fields that don't have any value (SQL uses NULL value for this case). We need to detect NULLs in data and also generated NULLs. For this purpose ANSI SQL has function grouping(col) that returns one when column is used for grouping or zero when column and all derivated values are NULL. From world of commercial databases comes the function grouping_id:
postgres=# select name, place, sum(count), grouping(name), grouping(place), grouping_id(name, place)
from cars group by grouping sets((name, place),name, place,());
name | place | sum | grouping | grouping | grouping_id
-------+------------+-------+----------+----------+-------------
skoda | czech rep. | 10000 | 1 | 1 | 3
opel | germany | 7000 | 1 | 1 | 3
bmw | germany | 1000 | 1 | 1 | 3
bmw | czech rep. | 100 | 1 | 1 | 3
opel | czech rep. | 7000 | 1 | 1 | 3
skoda | germany | 5000 | 1 | 1 | 3
bmw | | 1100 | 1 | 0 | 2
skoda | | 15000 | 1 | 0 | 2
opel | | 14000 | 1 | 0 | 2
| germany | 13000 | 0 | 1 | 1
| czech rep. | 17100 | 0 | 1 | 1
| | 30100 | 0 | 0 | 0
(12 rows)
Implementation
Current implementation is based on modification of normal hash aggregation. Hash are used because it allows one scan of data and parallel processing of all grouping sets (it doesn't need materialization). The disadvantage is that only types that support hashing are supported, and there could be some "out of memory problem" for very large results. This feature is a little bit similar CTE. It allows repeated scan of some source (result, table), and probably GROUPING SETS should be based on CTE (non hashed mode). CTEs are more general, but GROUPING SETS would be implemented more effectively, because difference to normal query is only in GROUP BY clause. Internally GROUPING SETS suffers from same problem as CTE. It broke current internal system of PostgreSQL data pipeline which is based on data pulling. GS needs data pushing. It's solved with helper node that should carry one tuple (feeder) and special node operation, these ensure only one tuple processing. Hash operations in PostgreSQL are bulk operations - first, hash table is completely filled and later is read row by row.
You can see feeder node in query plan (one feeder is used many times):
postgres=# explain select name, place, sum(count),
grouping(name), grouping(place), grouping_id(name, place)
from cars group by grouping sets((name, place),name, place,());
QUERY PLAN
-----------------------------------------------------------
Grouping Sets (cost=1.08..1.29 rows=9 width=18)
-> Seq Scan on cars (cost=0.00..1.06 rows=6 width=18)
-> HashAggregate (cost=1.10..1.14 rows=3 width=18)
-> Feeder (cost=0.00..1.06 rows=6 width=18)
-> HashAggregate (cost=1.09..1.13 rows=3 width=18)
-> Feeder (cost=0.00..1.06 rows=6 width=18)
-> HashAggregate (cost=1.09..1.11 rows=2 width=18)
-> Feeder (cost=0.00..1.06 rows=6 width=18)
-> Aggregate (cost=1.08..1.09 rows=1 width=18)
-> Feeder (cost=0.00..1.06 rows=6 width=18)
(10 rows)
Calls of functions grouping and grouping_id are replaced at planner time with adequate values:
postgres=# explain verbose select name, place, sum(count),
grouping(name), grouping(place), grouping_id(name, place)
from cars group by grouping sets((name, place),name, place,());
QUERY PLAN
---------------------------------------------------------------------------------------
Grouping Sets (cost=1.08..1.29 rows=9 width=18)
Output: name, place, sum(count), 0, 0, 0
-> Seq Scan on cars (cost=0.00..1.06 rows=6 width=18)
Output: name, place, count
-> HashAggregate (cost=1.10..1.14 rows=3 width=18)
Output: name, place, sum(count), 1, 1, 3
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: name, place, count
-> HashAggregate (cost=1.09..1.13 rows=3 width=18)
Output: name, NULL::character varying, sum(count), 1, 0, 2
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: name, place, count
-> HashAggregate (cost=1.09..1.11 rows=2 width=18)
Output: NULL::character varying, place, sum(count), 0, 1, 1
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: name, place, count
-> Aggregate (cost=1.08..1.09 rows=1 width=18)
Output: NULL::character varying, NULL::character varying, sum(count), 0, 0, 0
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: name, place, count
(20 rows)
Known limits
Only hash based mode is supported - code for materialization mode (with rescans) should be probably shared with CTE.
Known bugs
a) problem with functions - when target list contains functions, that are not in group by clause:
postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(name,place);
QUERY PLAN
-----------------------------------------------------------
Grouping Sets (cost=1.09..1.21 rows=5 width=18)
Output: sum(sin((count)::double precision))
-> Seq Scan on cars (cost=0.00..1.06 rows=6 width=18)
Output: count, name, place
-> HashAggregate (cost=1.09..1.14 rows=3 width=18)
Output: sum(sin((count)::double precision))
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: count, name, place
-> HashAggregate (cost=1.09..1.12 rows=2 width=18)
Output: sum(sin((count)::double precision))
-> Feeder (cost=0.00..1.06 rows=6 width=18)
Output: count, name, place
(12 rows)
postgres=# select sum(sin(count)) from cars group by grouping sets(name,place);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
Problem is in source plan, it should to do scalar function call too, like:
postgres=# explain verbose select sum(sin(count)) from cars group by grouping sets(sin(count));
QUERY PLAN
-------------------------------------------------------------
Grouping Sets (cost=1.09..1.20 rows=5 width=4)
Output: sum(sin((count)::double precision))
-> Seq Scan on cars (cost=0.00..1.06 rows=6 width=4)
Output: count, sin((count)::double precision)
-> HashAggregate (cost=1.09..1.20 rows=5 width=4)
Output: sum((sin((count)::double precision)))
-> Feeder (cost=0.00..1.06 rows=6 width=4)
Output: count, sin((count)::double precision)
(8 rows)
b) some strange hashing (in some cases)
postgres=# select name, place from cars group by grouping sets(name, place);
name | place
-------+------------
skoda |
bmw |
opel |
| czech rep.
| czech rep.
| czech rep.
(6 rows)
--with one aggregate the result is correct:
postgres=# select name, place, sum(count) from cars group by grouping sets(name, place);
name | place | sum
-------+------------+-------
skoda | | 15000
bmw | | 1100
opel | | 14000
| germany | 13000
| czech rep. | 17100
(5 rows)
Current state
Current patch is very experimental. Mainly, the planer stage should be refactored. There are a lot of lines of code redundant with GROUP clause implementation (that would be subset of GROUPING SETS implementation)- Any planner experts are welcome! Current GROUPING SETS planner part is a reason for lot of crashes :(.
ToDo
- grouping planner refactoring
- documentation
Notes
- using qualifier in GROUP BY clause (default is ALL)
SELECT ... GROUP BY [ALL|DISTINCT] GROUPING SETS(... GROUP BY DISTINCT means SELECT ... UNION SELECT ... UNION ... SELECT
