Skip to content

carlosganzerla/pg_set

Repository files navigation

build

pg_set

This extensions adds integer sets to PostgreSQL. It provides a new data type pg_set that can store a set of integers (int4) efficiently, along with various functions and operators to manipulate these sets.

Features

  • Written in C
  • Efficient storage of integer sets using a mask (it's smaller than int4[] for small enough sets)
  • Array-compatible text representation and efficient casting to and from array
  • Mathematical set properties and operations: union, intersection, difference, containment, etc.
  • Index support for GIN, GiST and Hash
  • Statistics collection and array-like selectivity support for all indexable operators
  • Null elements are not supported. This is annoying for arrays and don't really make sense for sets.

Motivation

Originally I wrote this extension to learn about more the low-level details of Postgres types, but I ended up adding more and more features so I decided to publish it. The case that motivated me to write this extension was one where we stored sets of IDs that referenced an external table (see the quasi_monotonic on the benchmark). We felt that creating a table was overkill, so we needed to store sets of values. We also had a requirement that values could not be duplicated and the order didn't matter, so what we needed was a set, but a constrained array worked. Traditionally, PostgreSQL has the following options:

  • Use a plain int4[] array and ensure uniqueness at the application level or with functions (e.g. intarray).
  • Use hstore extension to store sets of integers as keys
  • Use jsonb extension to store sets of integers as keys

Back then we used intarray with constraints as it's pretty straightforward to use, but I thought that it'd be nice to have a set type to make it natural. Since I didn't find one, I decided to write this extension to learn more about PostgreSQL internals.

Installation

Clone the repository and run:

make
sudo make install

Then, in your database:

CREATE EXTENSION pg_set;

Usage

Basic usage

-- Creating sets
-- Array-compatible representation
SELECT '{1,2,3}'::pg_set;

-- No duplication
SELECT '{1,1,1,1,1}'::pg_set;

-- With int4 args
SELECT pg_set_create(1,2,3);

CREATE TABLE reference (
    id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    external_ids pg_set NOT NULL
);

INSERT INTO reference (external_ids) VALUES
    ('{1,2,3}'),
    ('{3,4,5,6}'),
    ('{7,8,9}'),
    ('{1,3,5,7,9}');

-- contains 1 and 3
SELECT * FROM reference WHERE external_ids @> '{1,3}';

-- is contained by the set
SELECT * FROM reference WHERE external_ids <@ '{1,2,3,4,7,8,9}';

-- contains 4
SELECT * FROM reference WHERE external_ids @> 4;

-- Same as above
SELECT * FROM reference WHERE 4 <@ external_ids;

-- contains 1 or 3
SELECT * FROM reference WHERE external_ids && '{1,3}';

-- Equality
SELECT * FROM reference WHERE external_ids = '{1,2,3}';

-- Inequality
SELECT * FROM reference WHERE external_ids <> '{1,2,3}';

-- Count of elements in the set
SELECT * FROM reference WHERE pg_set_count(external_ids) > 3;

Set operations

-- Union
UPDATE
    reference
SET
    external_ids = external_ids + '{4}'
RETURNING *;

-- Add element
UPDATE
    reference
SET
    external_ids = external_ids + 5
RETURNING *;

-- Works on both sides
UPDATE
    reference
SET
    external_ids = 5 + external_ids
RETURNING *;

-- Can also remove an element
UPDATE
    reference
SET
    external_ids = external_ids - 5
RETURNING *;

-- Interesection
UPDATE
    reference
SET
    external_ids = external_ids * '{3,4,5}'
WHERE
    external_ids && '{3,4,5}'
RETURNING *;

-- Difference
UPDATE
    reference
SET
    external_ids = external_ids - '{3,4,5}'
WHERE
    external_ids && '{3,4,5}'
RETURNING *;

ANALYZE support

Supports almost the same statistics as arrays (default stats + most_common_elems, most_common_elem_freqs and element_count_histogram). Does not support correlation and histogram_bounds as sets don't support less-than operators. Selectivity functions work for the @> (both set and integer cases), && and = operators (Postgres default works well here), which are all indexable operators so far.

Index support

GiST

The GiST index supports the @>, && and = operators through the gist_pg_set_ops. The implementation uses an RD-tree data structure with built-in lossy compression. It approximates sets as a bit array mask and also contains the minimum and maximum set elements to speed up overlap queries at the expense of index size. It has an optional masklen parameter which is the mask length in bits. It goes from 16 bits (2 bytes) to 16064 bits (2016 bytes). The default is 16 bytes. A higher masklen will increase precision at the expense of index size. So masklen must be balanced with index size for optimal performance.

CREATE INDEX ON reference USING gist (external_ids);
CREATE INDEX ON reference USING gist (external_ids gist_pg_set_ops (masklen=2048));

Exclusion constraints

You can also use exclusion constraints with sets using GiST. This makes theam great to use in conjunction with btree_gist:

CREATE EXTENSION btree_gist;

CREATE TABLE room_booking (
    id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    room_id int NOT NULL,
    booked_slots pg_set NOT NULL,
    EXCLUDE USING gist (room_id WITH =, booked_slots WITH &&)
);

INSERT INTO room_booking (room_id, booked_slots) VALUES
    (1, '{1,2,3}'),
    (1, '{5,6,7}'),
    (2, '{1,2,3}');

INSERT INTO room_booking (room_id, booked_slots) VALUES
    (1, '{3,9}'); -- Fails, overlaps with first entry

GIN

Supports the same operators as GiST. Works exactly as an int4[] GIN, as GIN is a tree of elements. It's implemented on the gin_pg_set_ops operator class.

CREATE INDEX ON reference USING gin (external_ids);

Hash

Supports only = as usual. It's implemented on the hash_pg_set_ops operator class.

CREATE INDEX ON reference USING hash (external_ids);

Benchmark

See pg_set_benchmark.

Future work

Currently, this extension supports only int4 element sets. It would be nice to add support for int2 and int8, but that requires a good amount of work, as we need to redefine basically all functions at the SQL level and find a way to reuse the internals without compromising performance. The same applies to float4 and float8, and pretty much any fixed-length, sortable type.

About

Integer set data type for PostgreSQL

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages