Skip to content

Performance improvements#11

Open
Glandos wants to merge 10 commits intoDyonR:mainfrom
Glandos:pg_perf
Open

Performance improvements#11
Glandos wants to merge 10 commits intoDyonR:mainfrom
Glandos:pg_perf

Conversation

@Glandos
Copy link
Copy Markdown
Contributor

@Glandos Glandos commented Feb 16, 2025

It changes the SQL a lot, but I kept all existing checks

  • don't query files torrents are not inserted
  • use basic types as much as possible (no strftime, hex)
  • run sqlite in batch instead of window

Here are my results, ran on AMD Ryzen Pro 4750G + 16GB RAM on SSD, with a 29 millions magnetico database. Arguments list: --source-name magnetico --add-files --add-files-limit 200 --insert-content

  • On an empty postgresql target: from 260 it/s to 980 it/s
  • On an already filled postgreql target (so a lot of conflicts): from 260 it/s (unchanged) to 6000 it/s

So it's a huge win for pre-filled postgresql target, since this will never query.

Please double check that I didn't miss something.
I didn't added #10 in this, but I ran my tests with it.

Glandos and others added 9 commits February 9, 2025 16:20
- don't query files torrents are not inserted
- use basic types as much as possible
- run sqlite in batch instead of window
but use it after computing details, it will filter path with only nul bytes
@Glandos
Copy link
Copy Markdown
Contributor Author

Glandos commented Feb 26, 2025

I've added support for pgcopy.

The code is somewhat more complicated, but performance are nearly doubled.
The only missing bit is insertion in torrent_contents because it uses a tsvector which is not supported by pgcopy. Not yet: altaurog/pgcopy#47

@Glandos Glandos marked this pull request as ready for review February 26, 2025 23:00
@Glandos
Copy link
Copy Markdown
Contributor Author

Glandos commented Feb 26, 2025

Using pgcopy for file insertion removed the detection of duplicated file that are all empty. I don't know if I can restore that easily, since it's done in a generator.

@Glandos
Copy link
Copy Markdown
Contributor Author

Glandos commented Mar 1, 2025

Target DB status main output pg_perf output
empty 2282/29000001 [00:30<107:59:03, 74.59it/s] 58000/29000001 [00:30<4:15:12, 1890.13it/s]
full 9811/29000001 [00:30<25:05:19, 320.97it/s] 226000/29000001 [00:30<1:05:05, 7367.77it/s]

"empty" is a database without any entries. "full" is the database with all entries existing. This is useful for re-importing entries with only a few changes, e.g. merging divergent data.

The benchmark was run with:

find ./bitmagnet/base/ -type f -exec dd if={} iflag=nocache count=0 status=none \; && systemctl restart postgresql@17-bitmagnet
dd if=./database.sqlite3 iflag=nocache count=0; uv run --with 'pgcopy @ https://github.com/Glandos/pgcopy/archive/refs/heads/tsvector.zip' --with 'tqdm' magnetico2database.py --dbname bitmagnet_full --user bitmagnet --password 'XXX' --host localhost --port 5433 --source-name magnetico --add-files --add-files-limit 200 --insert-content ./database.sqlite3
  • dd is useful to evict file cache.
  • pgcopy is my own branch, waiting for merging
  • I replaced bitmagnet_full with bitmagnet_empty as dbname for the other test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant