Compress raw payload

Just a random idea, but we could compress the raw payload stored in the dynamic data table, since that's the table that is going to grow a lot over time.

I checked for the length of the values in that column in my inventory:

```
100-200 chars:      284
300-400 chars:     3575
400-500 chars:   100767
500-600 chars:     1200
600-700 chars:       42
700-800 chars:        2
1100-1200 chars:    291
1200-1300 chars:     27
1500-1600 chars:     33
1600-1700 chars:      1
```

I searched a bit about compression of small strings.  The main suggestion I found was to use an algorithm that allows passing an initial dictionary already "trained" on a data set representative of what you want to compress.  You just need to make sure to pass that same dictionary to the decompressor.  So if we would generate such a dictionary and embed it in the source, and it would never change.

A pre-trained dictionary is not as important on long inputs, since the algorithm "optimizes" its dictionary as it goes, so it amortizes the impact of the beginning of the input being compressed with a default (non-optimal for this input) dictionary.  But for short strings like this, the impact of the non-optimal portion is greater.

In the Python standard library, there's zlib we could use, with the `zdict` parameter:

https://docs.python.org/3/library/zlib.html#zlib.compressobj

There's also the pyzstd library:

https://pyzstd.readthedocs.io/en/latest/#zstd-dict

I made this test for fun with pyzstd, see how much we would gain by using a pre-training dictionary:

```python
import sqlite3
import pyzstd

s = sqlite3.connect("/home/smarchi/src/CanadianTracker/inventory.db")
rows = s.execute("SELECT raw_payload from products_dynamic").fetchall()
rows = [x[0].encode() for x in rows]

print("Building dictionary")
d = pyzstd.train_dict(rows, 110 * 1024)

compressor_without_d = pyzstd.ZstdCompressor()
compressor_with_d = pyzstd.ZstdCompressor(zstd_dict=d)


def compress_without_d(r):
    return compressor_without_d.compress(r, mode=pyzstd.ZstdCompressor.FLUSH_FRAME)


def compress_with_d(r):
    return compressor_with_d.compress(r, mode=pyzstd.ZstdCompressor.FLUSH_FRAME)


print("Compressing without dict")
rows_compressed_without_d = [compress_without_d(r) for r in rows]
print("Compressing with dict")
rows_compressed_with_d = [compress_with_d(r) for r in rows]

size_uncompressed = sum([len(x) for x in rows])
size_without_d = sum([len(x) for x in rows_compressed_without_d])
size_with_d = sum([len(x) for x in rows_compressed_with_d])

print(f"Size uncompressed: {size_uncompressed // 1024 // 1024} MiB")
print(f"Size without dict: {size_without_d // 1024 // 1024} MiB")
print(f"Size with dict: {size_with_d // 1024 // 1024} MiB")
```

Results are:

```
Size uncompressed: 46 MiB
Size without dict: 33 MiB
Size with dict: 7 MiB
```

So, it seems worth it to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress raw payload #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Compress raw payload #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions