fix: use read lock for clone reuse check to unblock parallel plans#6376
fix: use read lock for clone reuse check to unblock parallel plans#6376matthewmrichter wants to merge 9 commits intorunatlantis:mainfrom
Conversation
|
This PR addresses the root cause behind #5364 and #1618. The core issue is that The fix adds a read-lock fast path for the common case (directory exists, correct commit), allowing all goroutines to pass through simultaneously. |
5184efa to
59f7b17
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements an optimistic read-lock fast path in the Clone() method to unblock parallel plans. When the clone directory already exists and is at the correct commit (the common case for parallel plans), the check now runs under a read lock instead of an exclusive write lock. This allows concurrent plan goroutines to pass the clone check simultaneously instead of serializing on the write lock. If the fast path fails, execution falls through to the existing write-lock path for cloning or updating.
Changes:
- Add read-lock fast path for the common case (directory exists and is up-to-date)
- Move
wrappedGitContextdeclaration earlier to support both fast and slow paths - Preserve existing write-lock path as fallback for clone/update operations
Root Cause: introduced in v0.41.0 by PR #6230The serialization behavior was introduced by commit That PR made two changes that together create the serialization:
The combination creates a deadlock-like serialization: Prior to v0.41.0, The fix in this PR adds a read-lock fast path in |
Safety analysis — this fix preserves the protections from #6230CC @pseudomorph @jamengual @lukemassa — as author/reviewers of #6230 which introduced the git fencing. #6230 solved two real problems with
This PR's fast path is safe because it only fires when no git mutation is needed:
Merge strategy is handled correctly: The pre-existing gap between Net effect: parallel plans that don't need git mutations (the common autoplan case) can run concurrently, while plans that do need mutations still serialize safely through the write lock. |
|
This seems sensible to me. Will need the others to chime in for approval. |
| // Fast path: if the directory already exists and is at the right commit, | ||
| // verify under a READ lock so we don't block concurrent plan operations. | ||
| // isBranchAtTargetRef only runs "git rev-parse" which is read-only. | ||
| if _, err := os.Stat(cloneDir); err == nil { |
There was a problem hiding this comment.
This is a pre-existing pattern, not introduced by this PR. The cloneDir variable is constructed by w.cloneDir() (line 100) which builds a path under Atlantis's managed data directory — this same variable is already used in os.Stat(), os.RemoveAll(), os.MkdirAll(), and git clone throughout the existing code (attemptReuseCloneDir, forceClone, etc.).
This PR adds one new os.Stat(cloneDir) call on line 106 as an early check before the existing code paths. The path construction and all downstream filesystem operations are unchanged.
The values that feed into cloneDir (repo name, PR number, workspace) come from GitHub webhook payloads which are validated by Atlantis's repo allowlist (ATLANTIS_REPO_ALLOWLIST) and GitHub's own repo naming constraints. Path traversal is not possible through these values.
|
I think this issue might also be triggered with MergeAgain. We've been running an image with this fix in. So far, so good. I'll put a fix together for the MergeAgain issue and test it out. If it works, I'll put the patch up as a review comment. |
|
Confirmed and fixed. This is what I used: ❯ git diff upstream/main..HEAD server/events/working_dir.go | cat
diff --git a/server/events/working_dir.go b/server/events/working_dir.go
index ecf7c6e0..160892e8 100644
--- a/server/events/working_dir.go
+++ b/server/events/working_dir.go
@@ -157,6 +169,12 @@ func (w *FileWorkspace) attemptReuseCloneDir(logger logging.SimpleLogging, c wra
// MergeAgain merges again with upstream if we are using the merge checkout strategy,
// and upstream has been modified since we last checked.
// It returns a flag indicating whether we had to merge with upstream again.
+//
+// Locking strategy: recheckDiverged uses the read and ref lock to serialize git metadata
+// operations (URL updates, fetch, status) without taking the repo write lock.
+// The write lock is only acquired when we actually need to merge.
func (w *FileWorkspace) MergeAgain(
logger logging.SimpleLogging,
headRepo models.Repo,
@@ -168,28 +186,31 @@ func (w *FileWorkspace) MergeAgain(
}
cloneDir := w.cloneDir(p.BaseRepo, p, workspace)
- // We atomically set the recheckRequiredMap flag here before grabbing the clone lock.
- // If the flag is cleared after we grab the lock, it means some other thread
- // did the necessary work late enough that we do not have to do it again.
+ // We atomically set the recheckRequiredMap flag here before grabbing any lock.
+ // If the flag is cleared after we grab the write lock, it means some other
+ // thread did the necessary work late enough that we do not have to do it again.
recheckRequiredMap.Store(cloneDir, struct{}{})
- // Unconditionally wait for the clone lock here, if anyone else is doing any clone
- // operation in this directory, we wait for it to finish before we check anything.
+ c := wrappedGitContext{cloneDir, headRepo, p}
+
+ if !w.recheckDiverged(logger, p, headRepo, cloneDir) {
+ recheckRequiredMap.Delete(cloneDir)
+ return false, nil
+ }
+
+ // Diverged — acquire the write lock to perform the merge, which mutates the
+ // working tree (git reset --hard + git merge).
gitWriteUnlockFn := w.gitWriteLock(cloneDir)
defer gitWriteUnlockFn()
if _, exists := recheckRequiredMap.Load(cloneDir); !exists {
- logger.Debug("Skipping upstream check. Some other thread has done this for us")
+ logger.Debug("Skipping upstream merge. Some other thread has done this for us")
return false, nil
}
recheckRequiredMap.Delete(cloneDir)
- c := wrappedGitContext{cloneDir, headRepo, p}
- if w.recheckDiverged(logger, p, headRepo, cloneDir) {
- logger.Info("base branch may have been updated, using merge strategy and will merge again")
- return true, w.mergeAgain(logger, c)
- }
- return false, nil
+ logger.Info("base branch may have been updated, using merge strategy and will merge again")
+ return true, w.mergeAgain(logger, c)
}
// recheckDiverged returns true if the branch we're merging into has diverged
@@ -199,16 +220,20 @@ func (w *FileWorkspace) MergeAgain(
// and we have to perform a new merge.
// If there are any errors we return true since we prefer to assume divergence
// for safety.
-// Locks are acquired by the caller.
+// Acquires gitRefLock internally to serialize git metadata operations.
+// Does NOT require the caller to hold the repo read or write lock.
func (w *FileWorkspace) recheckDiverged(logger logging.SimpleLogging, p models.PullRequest, headRepo models.Repo, cloneDir string) bool {
if !w.CheckoutMerge {
- // It only makes sense to warn that main has diverged if we're using
- // the checkout merge strategy. If we're just checking out the branch,
- // then it doesn't matter what's going on with main because we've
- // decided to always run off the branch.
return false
}
+ // Hold the ref and read locks for the entire sequence: URL updates write .git/config,
+ // remote update writes .git/refs/remotes/, and hasDiverged does fetch+status.
+ // All of these touch shared git metadata and must be serialized with concurrent HasDiverged callers.
+ unlockRef := w.gitRefLock(cloneDir)
+ defer unlockRef()
+ gitReadUnlockFn := w.gitReadLock(cloneDir)
+ defer gitReadUnlockFn()
// Bring our remote refs up to date.
// Reset the URL in case we are using github app credentials since these might have
// expired and refreshed and the URL would now be different.
@@ -237,9 +262,6 @@ func (w *FileWorkspace) recheckDiverged(logger logging.SimpleLogging, p models.P
}
}
- // We already hold the write lock; take ref lock so fetch is serialized with other HasDiverged callers.
- unlockRef := w.gitRefLock(cloneDir)
- defer unlockRef()
return w.hasDiverged(logger, cloneDir)
}
@@ -251,7 +273,7 @@ func (w *FileWorkspace) HasDiverged(logger logging.SimpleLogging, cloneDir strin
}
// Hold ref lock and repo read lock for the full duration (fetch + status) so we don't race with
- // clone/reset/merge. recheckDiverged does not take the read lock because it already holds the write lock.
+ // clone/reset/merge.
unlockGitRefLock := w.gitRefLock(cloneDir)
defer unlockGitRefLock()
@@ -265,8 +287,8 @@ func (w *FileWorkspace) HasDiverged(logger logging.SimpleLogging, cloneDir strin
}
// hasDiverged runs fetch and git status to detect divergence. Caller must hold
-// gitRefLock(cloneDir); if not already holding the repo write lock (e.g. from recheckDiverged),
-// caller must also hold gitReadLock(cloneDir).
+// gitRefLock(cloneDir) and gitReadLock(cloneDir) (or the write lock)
func (w *FileWorkspace) hasDiverged(logger logging.SimpleLogging, cloneDir string) bool {
logger.Debug("HasDiverged: running git fetch in %s", cloneDir)
cmd := exec.Command("git", "fetch")
@@ -360,8 +382,8 @@ func (w *FileWorkspace) hasDivergedForPatterns(logger logging.SimpleLogging, clo
for _, file := range nonEmptyChangedFiles {
match, err := pm.MatchesOrParentMatches(file)
if err != nil {
- logger.Debug("HasDiverged: match error for file %q: %s, treating as diverged", file, err)
- return true
+ logger.Debug("HasDiverged: match error for file %q: %s", file, err)
+ continue
}
if match {
logger.Debug("HasDiverged: file %q matched patterns - branch has diverged", file) |
|
I can patch in your changes to this PR if you want @pseudomorph, thanks! Also if anyone has a suggestion to overcome the seemingly false positive CodeQL check, let me know. |
c751477 to
85ecf2b
Compare
|
Applied your patch — pushed as a follow-up commit. Tests pass locally. Thanks @pseudomorph for extending the fast-path pattern to Summary of the second commit:
|
|
@matthewmrichter could you fix the conflicts? |
…l plans When parallel_plan is enabled, all project goroutines call Clone() which acquires an exclusive write lock on the shared clone directory before checking if the repo is at the correct commit. Meanwhile, runSteps() holds a read lock on the same directory for plan execution. Since RWMutex.Lock() waits for all readers to release, each project's Clone() blocks until the previous project's plan completes — serializing parallel plans to effectively 1 at a time. Fix: add an optimistic fast path that checks the commit under a READ lock first (git rev-parse is read-only). If the directory exists and is at the correct commit, return immediately without ever acquiring the write lock. Only fall through to the write-lock path when the directory needs cloning or updating. This enables true parallel plan execution for the common case where the clone directory is already checked out at the right commit (e.g., autoplan on push, re-plan via comment). Fixes runatlantis#5364 Relates to runatlantis#1618 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Richter <matthew.richter@reserv.com>
Restructures MergeAgain() to check divergence first under read locks, only acquiring the write lock when a merge is actually needed. This mirrors the Clone() optimization in this PR and eliminates the second serialization point for parallel plans using merge checkout strategy. Changes: - MergeAgain() now calls recheckDiverged() first; write lock acquired only when divergence is detected and a merge must be performed. - recheckDiverged() now acquires its own gitRefLock + gitReadLock internally, making the locking contract explicit rather than relying on an implicit "caller holds the write lock" convention. - hasDivergedForPatterns() now continues on per-file match errors instead of returning true (avoids false positives from transient pattern errors forcing unnecessary merges). Signed-off-by: Matt Richter <matthew.richter@reserv.com>
85ecf2b to
39875d3
Compare
Done! |
|
@pseudomorph one more review before we merge? |
Co-authored-by: Ross Strickland <rkstrickland@gmail.com> Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
|
@matthewmrichter I added suggestion if you can merge them |
Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com> Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com> Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
|
hey @matthewmrichter , if you could merge my suggested changes, I messed up the lint. After that, I think we should merge that asap. |
Co-authored-by: Nicolas Vanheuverzwijn <nicolas.vanheu@gmail.com> Signed-off-by: Matt Richter <4481324+matthewmrichter@users.noreply.github.com>
|
the tests will need to be fixed. @matthewmrichter |
|
I enable auto-merge once tests are fixed. A maintainer will need to re-approve/run the tests once the tests are fixed. |
Summary
Clone()so parallel plan goroutines don't serialize on the shared clone directory write lockProblem
When
parallel_plan: trueis enabled, Atlantis dispatches project plans as concurrent goroutines viasizedwaitgroup. However, each goroutine callsClone()which unconditionally acquires an exclusive write lock (RWMutex.Lock()) on the shared clone directory before checking if the repo is at the correct commit.Meanwhile,
runSteps()holds a read lock (RWMutex.RLock()) on the same directory during plan execution. SinceRWMutex.Lock()waits for all readers to release, each project'sClone()blocks until the previous project's plan finishes:This serializes parallel plans to effectively 1 at a time, regardless of pool size.
Fix
Add a fast path before the write lock: check if the directory exists and is at the correct commit under a read lock.
isBranchAtTargetRef()only runsgit rev-parsewhich is read-only and safe under a shared lock.All goroutines can now pass the fast path simultaneously and proceed to plan in parallel.
If the fast path fails (directory doesn't exist, wrong commit, error), execution falls through to the existing write-lock path which re-checks via
attemptReuseCloneDir()— no TOCTOU race.Testing
All existing clone tests pass. The fast path is exercised by
TestClone_MasterHasDivergedwhich logs"repo is at correct commit ... (fast path)".Fixes #5364
Relates to #1618
🤖 Generated with Claude Code