Skip to content

Conversation

@9ary
Copy link

@9ary 9ary commented Aug 17, 2024

Hi!

This PR is a dump of changes I've made while operating this script. Since this repo was relatively easy to find when I was looking for this exact functionality, I'm putting this here for visibility. I don't expect things to be mergeable in this state, but I'd be happy to clean it up if there's interest.

Other than the first commit (updating dependencies) which was necessary to get going, the only change I originally intended to make was adding support for syncing tags. Things quickly spiraled out of control when the runtime exploded from this change and sent me down the rabbithole of optimizing the script.

For context, the organization I'm mirroring has ~90 repos, most of which rarely change, but a few are forks of high-profile projects with hundreds of tags. On my first attempt, the job ran into the rate limit after about 40 minutes, and I ended up interrupting it.

I noticed the code was fetching downstream references one by one, so I changed it to fetch the entire list at once. This cut down the number of requests dramatically, and allowed the job to complete quicker than it was before I touched it.
I then added some more batching which improved runtimes a bit more, and finally threading for request concurrency which got the runtime on par with the old fast updates.

I also attempted to reduce the amount of work done by comparing dates for all repos, but ended up reverting that, since the reliability tradeoff wasn't worth it.

Since it's so fast now, I disabled fast updates entirely and am now running full updates twice daily.

Here some rough numbers to give you an idea of the runtime progression:

  • initial run: 3 minutes (this only involved creating forks)
  • subsequent updates: 6-8 minutes for full runs and about 30s for fast runs
  • initial attempt at adding tags: 1h+, DNF
  • first successful run with tags: 5m9s
  • subsequent full runs with tags: under 4m
  • fetching downstream repos in a single call: under 3m
  • caching PIP packages: no measurable speedup
  • threading: 40s

It's possible to reduce the number of requests a bit more, by combining src_repo.get_branches() and src_repo.get_tags() into a single .get_git_refs() call, but otherwise I think it's close to optimal (unless it's possible to do more batching).

Reducing the number of API calls helps a lot to avoid hitting rate limits, while pure wall clock runtime optimizations are useful if the repo running the mirroring job is private, since github then "bills" by the minute (the free plan gets 2000 minutes monthly, but public repos are apparently unmetered).

9ary added 13 commits May 13, 2024 14:53
This should cut down on the number of requests significantly to speed up
the process and avoid hitting the rate limit.
Fast runs finish very quickly, but still count as a full minute due to
how github does accounting.
It's cheaper on credits to run the full update a bit more often, and
more reliable too.
No point over-optimizing for now, let's prioritize reliability.

This reverts commit 1ba8993.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant