Skip to content

Add script for compression + pickle deletion#1095

Merged
KaspariK merged 2 commits intomasterfrom
u/kkasp/compress-uncompressed
Feb 26, 2026
Merged

Add script for compression + pickle deletion#1095
KaspariK merged 2 commits intomasterfrom
u/kkasp/compress-uncompressed

Conversation

@KaspariK
Copy link
Member

@KaspariK KaspariK commented Feb 18, 2026

Script for migrating Tron's DynamoDB state store to gzip-compressed binary. I've added 3 commands:

  • status: Scans the table and reports how many keys need migration.
  • compress: Gzip-compresses items/partitions using TransactWriteItems. Initially this was slow enough that I wanted to do it while Tron was running so I added some ConditionExpression guards to help prevent conflicting writes. I've since sped this up quite a bit so if we really want we can take Tron down while this runs and it shouldn't be too terrible, though the most heavily written jobs are already migrated so this is mostly just historic runs/less frequent jobs.
  • delete-pickles: Removes pickle data (val, num_partitions). This will run once we've stopped writing pickles in #TODO

Testing

  • Dev: https://fluffy.yelpcorp.com/i/gnnHl4fLZMwDNcGlCJpkZHCpQXtDjMX6.html
  • Prod: https://fluffy.yelpcorp.com/i/J1B6tKLhsK4m9p448Hr26STtqqSf6fzz.html
  • 8 workers seems like a sweet spot. I think this could be cut down a bit more by skipping some of the json validation but I'm not trying to set records or anything 🤷‍♂️.
  • Peaks at around 30% of our write capacity, which is totally fine with our overall write usage.
  • The failed keys are from throttling (the usual suspects). These actually throttle on literally the first Transaction. Since the write and deletes (also technically a write) are in the same transaction we are making enough traffic to hit the hot key issue. I can either manually convert these using --keys ahead of time a few at a time, or do so after. I don't think it makes a difference.
  • Restarts before and after succeeded with no issues.
  • Skipped keys are old failed deletes. Harmless cruft.

General plan:

  • Compress
    • Via this script after it's merged
  • Remove pickle reads (doing this first keeps things more revert safe since worst case we'd still be writing new runs)
  • Remove pickle writes
    • TODO
  • Remove pickles
    • Via this script after it's merged

@KaspariK KaspariK marked this pull request as ready for review February 24, 2026 21:11
@KaspariK KaspariK requested a review from a team as a code owner February 24, 2026 21:11
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i'm assuming this is mostly temporary, and the general skeleton looks fine to me!)

# Max DynamoDB object size is 400KB. Since we save two copies of the object (pickled and JSON),
# we need to consider this max size applies to the entire item, so we use a max size of 200KB
# for each version.
OBJECT_SIZE = 150_000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was gonna ask if we should grab this from tron/serialize/runstate/dynamodb_state_store.py - but i assume we'll delete this once we're done so it doesn't really matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah pretty much

@KaspariK KaspariK merged commit 0530334 into master Feb 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants