`pg_set`

This extensions adds integer sets to PostgreSQL. It provides a new data type pg_set that can store a set of integers (int4) efficiently, along with various functions and operators to manipulate these sets.

Features

Written in C
Efficient storage of integer sets using a mask (it's smaller than int4[] for small enough sets)
Array-compatible text representation and efficient casting to and from array
Mathematical set properties and operations: union, intersection, difference, containment, etc.
Index support for GIN, GiST and Hash
Statistics collection and array-like selectivity support for all indexable operators
Null elements are not supported. This is annoying for arrays and don't really make sense for sets.

Motivation

Originally I wrote this extension to learn about more the low-level details of Postgres types, but I ended up adding more and more features so I decided to publish it. The case that motivated me to write this extension was one where we stored sets of IDs that referenced an external table (see the quasi_monotonic on the benchmark). We felt that creating a table was overkill, so we needed to store sets of values. We also had a requirement that values could not be duplicated and the order didn't matter, so what we needed was a set, but a constrained array worked. Traditionally, PostgreSQL has the following options:

Use a plain int4[] array and ensure uniqueness at the application level or with functions (e.g. intarray).
Use hstore extension to store sets of integers as keys
Use jsonb extension to store sets of integers as keys

Back then we used intarray with constraints as it's pretty straightforward to use, but I thought that it'd be nice to have a set type to make it natural. Since I didn't find one, I decided to write this extension to learn more about PostgreSQL internals.

Installation

Clone the repository and run:

make
sudo make install

Then, in your database:

CREATE EXTENSION pg_set;

Usage

Basic usage

-- Creating sets
-- Array-compatible representation
SELECT '{1,2,3}'::pg_set;

-- No duplication
SELECT '{1,1,1,1,1}'::pg_set;

-- With int4 args
SELECT pg_set_create(1,2,3);

CREATE TABLE reference (
    id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    external_ids pg_set NOT NULL
);

INSERT INTO reference (external_ids) VALUES
    ('{1,2,3}'),
    ('{3,4,5,6}'),
    ('{7,8,9}'),
    ('{1,3,5,7,9}');

-- contains 1 and 3
SELECT * FROM reference WHERE external_ids @> '{1,3}';

-- is contained by the set
SELECT * FROM reference WHERE external_ids <@ '{1,2,3,4,7,8,9}';

-- contains 4
SELECT * FROM reference WHERE external_ids @> 4;

-- Same as above
SELECT * FROM reference WHERE 4 <@ external_ids;

-- contains 1 or 3
SELECT * FROM reference WHERE external_ids && '{1,3}';

-- Equality
SELECT * FROM reference WHERE external_ids = '{1,2,3}';

-- Inequality
SELECT * FROM reference WHERE external_ids <> '{1,2,3}';

-- Count of elements in the set
SELECT * FROM reference WHERE pg_set_count(external_ids) > 3;

Set operations

-- Union
UPDATE
    reference
SET
    external_ids = external_ids + '{4}'
RETURNING *;

-- Add element
UPDATE
    reference
SET
    external_ids = external_ids + 5
RETURNING *;

-- Works on both sides
UPDATE
    reference
SET
    external_ids = 5 + external_ids
RETURNING *;

-- Can also remove an element
UPDATE
    reference
SET
    external_ids = external_ids - 5
RETURNING *;

-- Interesection
UPDATE
    reference
SET
    external_ids = external_ids * '{3,4,5}'
WHERE
    external_ids && '{3,4,5}'
RETURNING *;

-- Difference
UPDATE
    reference
SET
    external_ids = external_ids - '{3,4,5}'
WHERE
    external_ids && '{3,4,5}'
RETURNING *;

`ANALYZE` support

Supports almost the same statistics as arrays (default stats + most_common_elems, most_common_elem_freqs and element_count_histogram). Does not support correlation and histogram_bounds as sets don't support less-than operators. Selectivity functions work for the @> (both set and integer cases), && and = operators (Postgres default works well here), which are all indexable operators so far.

Index support

GiST

The GiST index supports the @>, && and = operators through the gist_pg_set_ops. The implementation uses an RD-tree data structure with built-in lossy compression. It approximates sets as a bit array mask and also contains the minimum and maximum set elements to speed up overlap queries at the expense of index size. It has an optional masklen parameter which is the mask length in bits. It goes from 16 bits (2 bytes) to 16064 bits (2016 bytes). The default is 16 bytes. A higher masklen will increase precision at the expense of index size. So masklen must be balanced with index size for optimal performance.

CREATE INDEX ON reference USING gist (external_ids);
CREATE INDEX ON reference USING gist (external_ids gist_pg_set_ops (masklen=2048));

Exclusion constraints

You can also use exclusion constraints with sets using GiST. This makes theam great to use in conjunction with btree_gist:

CREATE EXTENSION btree_gist;

CREATE TABLE room_booking (
    id int4 GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    room_id int NOT NULL,
    booked_slots pg_set NOT NULL,
    EXCLUDE USING gist (room_id WITH =, booked_slots WITH &&)
);

INSERT INTO room_booking (room_id, booked_slots) VALUES
    (1, '{1,2,3}'),
    (1, '{5,6,7}'),
    (2, '{1,2,3}');

INSERT INTO room_booking (room_id, booked_slots) VALUES
    (1, '{3,9}'); -- Fails, overlaps with first entry

GIN

Supports the same operators as GiST. Works exactly as an int4[] GIN, as GIN is a tree of elements. It's implemented on the gin_pg_set_ops operator class.

CREATE INDEX ON reference USING gin (external_ids);

Hash

Supports only = as usual. It's implemented on the hash_pg_set_ops operator class.

CREATE INDEX ON reference USING hash (external_ids);

Benchmark

See pg_set_benchmark.

Future work

Currently, this extension supports only int4 element sets. It would be nice to add support for int2 and int8, but that requires a good amount of work, as we need to redefine basically all functions at the SQL level and find a way to reuse the internals without compromising performance. The same applies to float4 and float8, and pretty much any fixed-length, sortable type.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
data		data
expected		expected
sql		sql
.clang-format		.clang-format
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
format.sh		format.sh
hash_set.c		hash_set.c
pg_set--1.0.sql		pg_set--1.0.sql
pg_set.control		pg_set.control
pg_set.h		pg_set.h
pg_set_analyze.c		pg_set_analyze.c
pg_set_gin.c		pg_set_gin.c
pg_set_gist.c		pg_set_gist.c
pg_set_io.c		pg_set_io.c
pg_set_op.c		pg_set_op.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`pg_set`

Features

Motivation

Installation

Usage

Basic usage

Set operations

`ANALYZE` support

Index support

GiST

Exclusion constraints

GIN

Hash

Benchmark

Future work

About

Uh oh!

Releases

Packages

Languages

License

carlosganzerla/pg_set

Folders and files

Latest commit

History

Repository files navigation

pg_set

Features

Motivation

Installation

Usage

Basic usage

Set operations

ANALYZE support

Index support

GiST

Exclusion constraints

GIN

Hash

Benchmark

Future work

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`pg_set`

`ANALYZE` support

Packages