Indexing and Querying Postgres Hstore
Hstore
is a really useful datatype in Postgres because you can easily extend your schema, and store different data. But, you might get to a point where you’ll have to query for specific keys. That’s what happened to us.
The Use Case
We have an entity in our database which contains an hstore column, to store some specifics. At a certain point during our development, we decided we want to query the hstore column, and, of course, we wanted it to be fast. We have a limited amount of keys, several of them, relatively static. The table itself contained a dozen million entries, where some keys were contained for many entries, and some for a smaller percent. The values were unique between themselves for a certain key. We are using PostgreSQL 9.2 and all the examples will be run on this version.
Preparing Test Data
So, let’s prepare the test by first installing the hstore
extension into our database:
CREATE EXTENSION hstore;
Now we can insert some data into it:
CREATE TABLE entity (
id serial NOT NULL PRIMARY KEY,
store hstore
);
Let’s say we have three keys: key1
, key2
and key3
to be used in our hstore. We’ll insert 1 million entries with all three keys, 1 million with keys 1 and 2, 2 millions with key 1, 1.5 million with key 2, and 500 thousand entries with key3. Four million entries will have no values in hstore.
INSERT INTO entity(store)
SELECT ('key1 =>' || i || ',key2=>' || i || ',key3=>' || i)::hstore
FROM generate_series(1,1000000) AS i;
INSERT INTO entity(store)
SELECT ('key1 =>' || i || ',key2=>' || i)::hstore
FROM generate_series(1000001,2000000) AS i;
INSERT INTO entity(store)
SELECT ('key1 =>' || i)::hstore
FROM generate_series(2000001,4000000) AS i;
INSERT INTO entity(store)
SELECT ('key2 =>' || i)::hstore
FROM generate_series(4000001,5500000) AS i;
INSERT INTO entity(store)
SELECT ('key3 =>' || i)::hstore
FROM generate_series(5500001,6000000) AS i;
INSERT INTO entity(store)
SELECT ''::hstore
FROM generate_series(6000001,10000000) AS i;
Testing
So, here are now a few timings for some queries:
EXPLAIN ANALYZE SELECT count(*) FROM entity WHERE exist(store, 'key1');
QUERY PLAN
--------------------------------------------------------------------------
Aggregate (cost=108408.67..108408.67 rows=1 width=0)
(actual time=2849.406..2849.407 rows=1 loops=1)
-> Seq Scan on entity (cost=0.00..106742.00 rows=3333333 width=0)
(actual time=0.019..2443.097 rows=4000000 loops=1)
Filter: exist(store, 'key1'::text)
Rows Removed by Filter: 6000000
Total runtime: 2849.450 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM entity WHERE store->'key1' = '950555';
QUERY PLAN
--------------------------------------------------------------------------
Seq Scan on entity (cost=0.00..111742.00 rows=50000 width=36)
(actual time=509.584..2531.219 rows=1 loops=1)
Filter: ((store -> 'key1'::text) = '950555'::text)
Rows Removed by Filter: 9999999
Total runtime: 2531.248 ms
(4 rows)
As can be read, over 2 seconds isn’t something regarded as fast when talking about fetching a single row from a database, and sequential scan is performed, so we should add an index. We’ll test on key1 only. Create the index and check the query again.
CREATE UNIQUE INDEX "key1_idx" ON "entity" (
(store -> 'key1')
);
EXPLAIN ANALYZE SELECT count(*) FROM entity WHERE exist(store, 'key1');
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=108408.67..108408.67 rows=1 width=0)
(actual time=2601.304..2601.304 rows=1 loops=1)
-> Seq Scan on entity (cost=0.00..106742.00 rows=3333333 width=0)
(actual time=0.017..2238.583 rows=4000000 loops=1)
Filter: exist(store, 'key1'::text)
Rows Removed by Filter: 6000000
Total runtime: 2601.344 ms
(5 rows)
EXPLAIN ANALYZE SELECT * FROM entity WHERE store->'key1' = '950555';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using key1_idx on entity (cost=0.00..2.33 rows=1 width=36)
(actual time=0.060..0.061 rows=1 loops=1)
Index Cond: ((store -> 'key1'::text) = '950555'::text)
Total runtime: 0.088 ms
(3 rows)
Now the fetching of an entity is much faster, since it is using the index we created. Counting of entities containing key1 is still slow, and it uses sequential scan, but it’s not our use case. But when we are talking about indices we should also consider their spatial efficiency. Key1 is the most common one, and only 40% of entities contain it. With the current index, we have 60% of unnecessary entries in the index. So, we’ll add the condition used for checking the number of entities containing key1.
DROP INDEX "key1_idx";
CREATE UNIQUE INDEX "key1_idx" ON "entity" (
(store -> 'key1')
) WHERE exist(store, 'key1');
EXPLAIN ANALYZE SELECT count(*) FROM entity WHERE exist(store, 'key1');
QUERY PLAN
----------------------------------------------------------------------------
Aggregate (cost=97550.82..97550.83 rows=1 width=0)
(actual time=1585.147..1585.148 rows=1 loops=1)
-> Index Scan using key1_idx on entity
(cost=0.00..95884.16 rows=3333333 width=0)
(actual time=0.054..1203.578 rows=4000000 loops=1)
Total runtime: 1585.187 ms
(3 rows)
EXPLAIN ANALYZE SELECT * FROM entity WHERE store->'key1' = '950555';
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on entity (cost=0.00..111742.00 rows=50000 width=36)
(actual time=523.964..2826.130 rows=1 loops=1)
Filter: ((store -> 'key1'::text) = '950555'::text)
Rows Removed by Filter: 9999999
Total runtime: 2826.160 ms
(4 rows)
Well, we decreased the runtime for the first query. On the other hand, what’s really bad is that our main query for fetching the entity is once again not using the index. This is the point where we do some digging in our index knowledge, and find that to use the index, query must use the condition set in the index. So, let’s try a different query.
EXPLAIN ANALYZE SELECT * FROM entity
WHERE exist(store,'key1') AND store->'key1' = '950555';
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using key1_idx on entity (cost=0.00..2.17 rows=1 width=36)
(actual time=0.035..0.036 rows=1 loops=1)
Index Cond: ((store -> 'key1'::text) = '950555'::text)
Total runtime: 0.063 ms
(3 rows)
This is something that works. Now that we have choosen our index, and our query, we should add all indices to our database. This is one of the possible weakpoints to this solution, we have to statically create indices over all used keys, but this works in our use case, since we have only several of them, and they are static.
We didn’t take index maintenance into consideration, since our ration of reading towards writing allows it, but depending on your use case you might want to test that as well.
Conclusion
An approach to querying an hstore column over a limited and static set of keys can be solved using conditional indices. That’s what worked for us. There are other ways to index an hstore, of course, and you should try it with your use case. But, always measure, and always measure according to your specific use case. When creating an index, model it together with the query you are using.