Skip to content

Compress raw payload #6

@simark

Description

@simark

Just a random idea, but we could compress the raw payload stored in the dynamic data table, since that's the table that is going to grow a lot over time.

I checked for the length of the values in that column in my inventory:

100-200 chars:      284
300-400 chars:     3575
400-500 chars:   100767
500-600 chars:     1200
600-700 chars:       42
700-800 chars:        2
1100-1200 chars:    291
1200-1300 chars:     27
1500-1600 chars:     33
1600-1700 chars:      1

I searched a bit about compression of small strings. The main suggestion I found was to use an algorithm that allows passing an initial dictionary already "trained" on a data set representative of what you want to compress. You just need to make sure to pass that same dictionary to the decompressor. So if we would generate such a dictionary and embed it in the source, and it would never change.

A pre-trained dictionary is not as important on long inputs, since the algorithm "optimizes" its dictionary as it goes, so it amortizes the impact of the beginning of the input being compressed with a default (non-optimal for this input) dictionary. But for short strings like this, the impact of the non-optimal portion is greater.

In the Python standard library, there's zlib we could use, with the zdict parameter:

https://docs.python.org/3/library/zlib.html#zlib.compressobj

There's also the pyzstd library:

https://pyzstd.readthedocs.io/en/latest/#zstd-dict

I made this test for fun with pyzstd, see how much we would gain by using a pre-training dictionary:

import sqlite3
import pyzstd

s = sqlite3.connect("/home/smarchi/src/CanadianTracker/inventory.db")
rows = s.execute("SELECT raw_payload from products_dynamic").fetchall()
rows = [x[0].encode() for x in rows]

print("Building dictionary")
d = pyzstd.train_dict(rows, 110 * 1024)

compressor_without_d = pyzstd.ZstdCompressor()
compressor_with_d = pyzstd.ZstdCompressor(zstd_dict=d)


def compress_without_d(r):
    return compressor_without_d.compress(r, mode=pyzstd.ZstdCompressor.FLUSH_FRAME)


def compress_with_d(r):
    return compressor_with_d.compress(r, mode=pyzstd.ZstdCompressor.FLUSH_FRAME)


print("Compressing without dict")
rows_compressed_without_d = [compress_without_d(r) for r in rows]
print("Compressing with dict")
rows_compressed_with_d = [compress_with_d(r) for r in rows]

size_uncompressed = sum([len(x) for x in rows])
size_without_d = sum([len(x) for x in rows_compressed_without_d])
size_with_d = sum([len(x) for x in rows_compressed_with_d])

print(f"Size uncompressed: {size_uncompressed // 1024 // 1024} MiB")
print(f"Size without dict: {size_without_d // 1024 // 1024} MiB")
print(f"Size with dict: {size_with_d // 1024 // 1024} MiB")

Results are:

Size uncompressed: 46 MiB
Size without dict: 33 MiB
Size with dict: 7 MiB

So, it seems worth it to me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions