It’s been a while since PostgreSQL brought so many new things that are interesting to developers, like the version 9.5. We found a lot of features that will make our life easier and our code safer. Atomic UPSERT
and skipping locked rows SELECT...SKIP LOCKED
is definitely something that we were looking for for a long time.
UPSERT
UPSERT is probably one of the most wanted features in PostgreSQL. It will be used by developers who work on highly concurrent software. It brings us the opportunity to atomically execute insert or update commands based on a certain condition.
Let’s prepare the table for testing the UPSERT command:
CREATE TABLE upsert_test (
id SERIAL NOT NULL PRIMARY KEY,
type int4 NOT NULL,
number int4 NOT NULL,
name text NOT NULL
);
ALTER TABLE upsert_test ADD CONSTRAINT upsert_test_unique UNIQUE (type, number);
Now we can try some basic UPSERT variations. Let’s suppose we populated this table with two rows, so it looks like this:
id | type | number | name
----+------+--------+---------
1 | 1 | 1 | Name 1
2 | 1 | 2 | Name 2
According to the documentation, we can implement a behavior if a conflict happens during the INSERT
in one of two ways: we can decide to DO NOTHING
(skip insert) or execute UPDATE
on the conflicted row. What is most important is that PostgreSQL will ensure our update is executed atomically with check (it can’t happen that another transaction adds a conflicted row between check and insert). This is great because developers had to handle this on their own with some SQL gymnastics. Let’s see the most common use case. We want to insert a new entry in the table, but in case of any constraint violation, we would skip that insert (with the assumption that another transaction has already inserted the wanted value):
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT DO NOTHING;
INSERT 0 0
As we can see, the insert was finished without any errors, but no new values were inserted. It is important to properly handle this in code. Sometimes successful inserts are handled in code with the assumption that one row is inserted. Now we have to consider the case when there are no inserted rows because the conflict is ignored during the insert. What happens if we want to change the conflicted row:
-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) DO UPDATE SET name = EXCLUDED.name;
INSERT 0 1
It is clear from the example that if the conflicted row exists, it will be updated with the new name. It is important to know that we must have a unique index on columns type
and number
. If such an index doesn’t exist, an error is raised. We can also address the exact constraint name, but it won’t be wise. It is better to address exact columns, as we can change the constraint name in the future or create other constraints, with different order of columns. We can also add predicate to conflict the columns:
-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) WHERE type = 1 DO UPDATE SET name = EXCLUDED.name;
If we have predicate (in this case it is WHERE type = 1
) and we don’t have an exact index, PostgreSQL will try to find an index without predicate (non-partial index) and use it. This means that the conflict will be checked on all conflicts on type
and number
, not just those with type = 1
! We can also add WHERE
when updating a conflicted row:
-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) DO UPDATE SET name = EXCLUDED.name WHERE type = 1;
In this example only rows with type = 1
will be updated. If the row doesn’t satisfy that condition, it will be ignored. It is important to know that all conflicted rows will be locked, even if they are not updated because of the WHERE
condition.
New options for GROUP BY
In PostgreSQL 9.5 we have a new option for GROUP BY
and we can use it by creating grouping sets. There are three ways (options) for these grouping sets: GROUPING SETS
, ROLLUP
and CUBE
. Let’s suppose we have this employee table:
SELECT * FROM employee;
id | full_name | age | salary | gender | role
----+------------------+-----+-----------+--------+-----------------
1 | Barry Sutton | 31 | 60000.00 | Male | Manager
2 | Austin Jensen | 26 | 52500.00 | Male | Developer
3 | Charles Robbins | 48 | 80400.00 | Male | Project Manager
4 | Sylvester Carson | 22 | 45800.00 | Male | Developer
5 | Julius Gonzales | 38 | 71000.00 | Male | Sales
6 | Lana Lewis | 34 | 63700.00 | Female | Manager
7 | Gail Ingram | 30 | 54300.00 | Female | Developer
8 | Delores Brock | 24 | 48250.00 | Female | Sales
9 | Whitney Mcguire | 51 | 79800.00 | Female | Project Manager
10 | Kristy Romero | 55 | 102800.00 | Female | CEO
(10 rows)
If we want to see the average salary by all combinations of gender, role and age groups, we can do that as follows:
SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY GROUPING SETS (role, gender, age / 10, ());
role | gender | age_group | avg
-----------------+--------+-----------+--------
CEO | | | 102800
Developer | | | 50867
Manager | | | 61850
Project Manager | | | 80100
Sales | | | 59625
| | | 65855
| | 2x | 48850
| | 3x | 62250
| | 4x | 80400
| | 5x | 91300
| Female | | 69770
| Male | | 61940
(12 rows)
We can see average salaries for each group and also overall (the line with empty values for role, gender and age). Apart from that, we can see the average salaries grouped by the combination of all three groups:
SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY CUBE (role, gender, age / 10);
role | gender | age_group | avg
-----------------+--------+-----------+--------
CEO | Female | 5x | 102800
CEO | Female | | 102800
CEO | | | 102800
Developer | Female | 3x | 54300
Developer | Female | | 54300
Developer | Male | 2x | 49150
Developer | Male | | 49150
Developer | | | 50867
Manager | Female | 3x | 63700
Manager | Female | | 63700
Manager | Male | 3x | 60000
Manager | Male | | 60000
Manager | | | 61850
Project Manager | Female | 5x | 79800
Project Manager | Female | | 79800
Project Manager | Male | 4x | 80400
Project Manager | Male | | 80400
Project Manager | | | 80100
Sales | Female | 2x | 48250
Sales | Female | | 48250
Sales | Male | 3x | 71000
Sales | Male | | 71000
Sales | | | 59625
| | | 65855
Developer | | 2x | 49150
Sales | | 2x | 48250
| | 2x | 48850
Developer | | 3x | 54300
Manager | | 3x | 61850
Sales | | 3x | 71000
| | 3x | 62250
Project Manager | | 4x | 80400
| | 4x | 80400
CEO | | 5x | 102800
Project Manager | | 5x | 79800
| | 5x | 91300
| Female | 2x | 48250
| Female | 3x | 59000
| Female | 5x | 91300
| Female | | 69770
| Male | 2x | 49150
| Male | 3x | 65500
| Male | 4x | 80400
| Male | | 61940
(44 rows)
We now see all possible combinations of values from columns that are mentioned in the GROUP BY
statements. At the end, if we want to just group from left to right, we can use the ROLLUP
option:
SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY ROLLUP (role, gender, age / 10);
role | gender | age_group | avg
-----------------+--------+-----------+--------
CEO | Female | 5x | 102800
CEO | Female | | 102800
CEO | | | 102800
Developer | Female | 3x | 54300
Developer | Female | | 54300
Developer | Male | 2x | 49150
Developer | Male | | 49150
Developer | | | 50867
Manager | Female | 3x | 63700
Manager | Female | | 63700
Manager | Male | 3x | 60000
Manager | Male | | 60000
Manager | | | 61850
Project Manager | Female | 5x | 79800
Project Manager | Female | | 79800
Project Manager | Male | 4x | 80400
Project Manager | Male | | 80400
Project Manager | | | 80100
Sales | Female | 2x | 48250
Sales | Female | | 48250
Sales | Male | 3x | 71000
Sales | Male | | 71000
Sales | | | 59625
| | | 65855
(24 rows)
SELECT … SKIP LOCKED
As opposite to NOWAIT
, SKIP LOCKED
is added as an option in SELECT
query. The NOWAIT
option (from previous versions of PostgreSQL) simply raises an error if a row can’t be locked. SKIP LOCKED,
on the other hand, just skips that row. This is very useful in some highly concurrent environments when we are selecting a certain number of rows, but we can live without some of them (for example selecting some number of random rows to be shown to the user).
SELECT … TABLESAMPLE
The TABLESAMPLE
option allows us to get just one percent of the table. This is particularly useful for creating statistical analyses on large tables. By default, there are two methodologies of choosing a random part of a table: SYSTEM
and BERNOULLI
. SYSTEM
gets one (or more) physical parts of a table (pages on disk) and returns them. If our inserts are evenly distributed, this is good enough for us, but if they are not we should use BERNOULLI
method. It scans the entire table and returns random rows. It is slower, especially when we get lower percent of entries from the table. For example, if we have a table with one million entries and we get one percent of rows, the SYSTEM
method (in my testing environment) lasts about 7 ms and BERNOULLI
about 43 ms. If we want 50 percent of entries, the SYSTEM
method lasts about 292 ms and BERNOULLI
about 331 ms. We can pass random seed as a parameter and, if the table did not change in the meantime, then the same row set is returned each time. If we omit a random seed then a new one is generated on each call. For example:
SELECT * FROM employees TABLESAMPLE BERNOULLI (1);
UPDATE multiple columns from SELECT sub-query
To update multiple columns with a sub-select query returning a row with the same number and types of columns, you can now do this:
UPDATE executors ex
SET (full_name, role) = (
SELECT em.name, em.role
FROM employee em
WHERE em.employee_id = ex.employee_id
);
In the above query, two columns (full_name
and role
) are updated with the result of the sub-query. There is no need to create two separate assignments, and keep the result of the sub-query in temporal variables, or use the WITH
statement.
Suggest misspelled column names
This small addition is very useful, especially with bad autocomplete on the default psql client. Very often, we misspell column names when executing some queries:
SELECT id, nam FROM employees;
ERROR: column "nam" does not exist
LINE 1: select id, nam from employees;
^
HINT: Perhaps you meant to reference the column "employees.name".
Unfortunately, if we misspell more than one column name, we will be notified about the first one only, so we have to execute the same query again to see the issue in subsequent parts of the query.
DDL changes
The most important is that IF NOT EXISTS
clause is added to CREATE TABLE AS
, CREATE INDEX
, CREATE SEQUENCE
and CREATE MATERIALIZED VIEW
, and IF EXISTS
clause is added to ALTER TABLE...RENAME CONSTRAINT
.
Changes in indexes
Block Range Index (BRIN) is added to PostgreSQL 9.5. BRIN index is useful for columns whose values do have some correlation with the location in the table (physical location on pages on a disk). For example, if we have the time of row insertion, most of the time rows with similar time will be inserted in similar periods and thus will be on the same pages on the disk. This is very useful for huge tables with large B-TREE indexes. The B-TREE index is still much faster as BRIN index sometimes fetches data for pages that don’t contain the searched value. On the other hand, they are much smaller and faster to create. Which index will be used is just a question of compromise and use case of that particular table. We can compare some values from a table with 66 million rows:
BRIN | B-Tree | |
---|---|---|
Index size | 136 KB | 1414 MB |
Duration | 4.5 ms | 0.374 ms |
This is an analysis for counting the number of rows with a certain int value. The query without index lasts about 11.6 seconds. This is a big difference in size and in performance. With these numbers, the BRIN index will definitely be used for huge tables and queries that are executed occasionally. It is important to make sure that the index is maintained regularly. If new rows are added with existing range values (the values are already in existing rows), then the index is maintained as expected — the new page is added to existing set of pages. On the other hand, if a new range is added (for example with a value greater than any of the previously inserted), the index must be maintained manually. This can be done by calling VACUUM
on the table, or by calling the brin_summarize_new_pages(regclass)
function.
Sorting optimization
According to the release notes for PostgreSQL 9.5, sorting should be significantly faster than before. I tried some tests and it is obvious that sorting is faster, especially for non-indexed tables. In one specific example, I sorted 1 million random strings with 32 characters. On PostgreSQL 9.4 it lasted for 6.2 seconds and on PostgreSQL 9.5, 3.4seconds. On indexed columns, this improvement is less visible but it exists — it is about 5%.
Utilities
The most notable change in tools is that now vacuumdb has a -j option with the number of jobs to be executed in parallel:
vacuumdb -j 8
Note that each job creates its own connection to DB, so be sure the max number of connections is greater than the number of jobs.