Misc improvements #3
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
This PR is a dump of changes I've made while operating this script. Since this repo was relatively easy to find when I was looking for this exact functionality, I'm putting this here for visibility. I don't expect things to be mergeable in this state, but I'd be happy to clean it up if there's interest.
Other than the first commit (updating dependencies) which was necessary to get going, the only change I originally intended to make was adding support for syncing tags. Things quickly spiraled out of control when the runtime exploded from this change and sent me down the rabbithole of optimizing the script.
For context, the organization I'm mirroring has ~90 repos, most of which rarely change, but a few are forks of high-profile projects with hundreds of tags. On my first attempt, the job ran into the rate limit after about 40 minutes, and I ended up interrupting it.
I noticed the code was fetching downstream references one by one, so I changed it to fetch the entire list at once. This cut down the number of requests dramatically, and allowed the job to complete quicker than it was before I touched it.
I then added some more batching which improved runtimes a bit more, and finally threading for request concurrency which got the runtime on par with the old fast updates.
I also attempted to reduce the amount of work done by comparing dates for all repos, but ended up reverting that, since the reliability tradeoff wasn't worth it.
Since it's so fast now, I disabled fast updates entirely and am now running full updates twice daily.
Here some rough numbers to give you an idea of the runtime progression:
It's possible to reduce the number of requests a bit more, by combining
src_repo.get_branches()andsrc_repo.get_tags()into a single.get_git_refs()call, but otherwise I think it's close to optimal (unless it's possible to do more batching).Reducing the number of API calls helps a lot to avoid hitting rate limits, while pure wall clock runtime optimizations are useful if the repo running the mirroring job is private, since github then "bills" by the minute (the free plan gets 2000 minutes monthly, but public repos are apparently unmetered).