Skip to content

Conversation

@amirejaz
Copy link
Contributor

Summary

This PR implements automatic authentication monitoring for remote workloads using OAuth2 authentication. The system proactively detects when authentication tokens expire or fail, automatically marking workloads as unauthenticated without requiring client requests.

Changes

New Component: AuthenticatedTokenSource

  • Wrapper around oauth2.TokenSource that provides:
    • Per-request token retrieval via Token() method
    • Background monitoring goroutine that checks token validity
    • Automatic status updates to WorkloadStatusUnauthenticated on authentication failures
    • Exponential backoff retry mechanism for transient errors

Key Features

1. Intelligent Error Classification

  • Only marks workloads as unauthenticated for hard authentication failures:
    • oauth2.RetrieveError with HTTP status codes 400/401
    • Error bodies containing invalid_grant, invalid_client, or invalid_token
  • Transient errors (network timeouts, generic errors, context cancellations) are not treated as authentication failures
  • Prevents false positives from temporary network issues

2. Background Monitoring

  • Monitors token validity in the background using a timer-based approach
  • Schedules token checks at token expiry time ("lazy" scheduling)
  • Automatically detects expired tokens and attempts refresh
  • Stops monitoring when authentication failure is detected (requires re-authentication)

3. Exponential Backoff

  • Implements exponential backoff for transient errors
  • Starts at 1 second, doubles on each retry, capped at 2 minutes
  • Allows monitoring to recover from temporary network issues
  • Resets backoff to 1 second on successful token retrieval

4. Lifecycle Management

  • Integrated with Runner lifecycle:
    • Monitoring starts when remote workload is initialized
    • Monitoring stops gracefully when Runner.Cleanup() is called (via context cancellation)
  • Uses sync.Once for idempotent channel closure
  • Thread-safe timer management with proper cleanup

Integration Points

Runner Integration

  • AuthenticatedTokenSource is created for remote workloads with authentication
  • Uses child context for monitoring that can be cancelled during cleanup
  • Status updates go through existing StatusManager interface

Status Updates

  • Calls StatusManager.SetWorkloadStatus() with WorkloadStatusUnauthenticated
  • Reason includes the authentication error details for debugging
  • Status manager handles idempotency (multiple calls are safe)

Technical Details

Error Detection

// Only oauth2.RetrieveError with specific characteristics are considered auth errors
- HTTP 400/401 status codes
- Error body contains: invalid_grant, invalid_client, invalid_token
// All other errors (including network errors) are treated as transient

Monitoring Flow

  1. Initial check runs immediately (1ms timer)
  2. On successful token retrieval, schedules next check at token expiry
  3. On token expiry, attempts refresh
  4. On authentication error, marks as unauthenticated and stops monitoring
  5. On transient error, retries with exponential backoff

Testing

  • Comprehensive unit test suite covering:
    • Successful token retrieval
    • Authentication error detection and status updates
    • Transient error handling (no false positives)
    • Background monitoring with token expiry
    • Exponential backoff recovery
    • Context cancellation and cleanup
    • Error classification logic
  • All tests use gomock for StatusManager mocks
  • Tests account for timing variability in background monitoring

Benefits

  • Proactive Detection: Authentication failures are detected automatically, not just when clients make requests
  • Accurate Status: Only marks unauthenticated for actual authentication failures, not transient errors
  • Better UX: Users see accurate workload status without needing to make requests
  • Resilient: Recovers from transient errors automatically
  • Clean Integration: Fits seamlessly into existing Runner and status management systems

Example Scenarios

Scenario 1: Refresh Token Expires

  • Background monitor detects expired token
  • Attempts refresh, receives invalid_grant error
  • Immediately marks workload as WorkloadStatusUnauthenticated
  • Monitoring stops (requires manual re-authentication)

Scenario 2: Temporary Network Issue

  • Background monitor encounters network timeout
  • Retries with exponential backoff (

@JAORMX
Copy link
Collaborator

JAORMX commented Nov 3, 2025

@claude please review this PR and put emphasis on security as well

@claude
Copy link
Contributor

claude bot commented Nov 3, 2025

Claude finished @JAORMX's task —— View job


Security-Focused Code Review

I've completed a thorough security review of this PR, focusing on the authentication monitoring implementation. Overall, this is a well-designed and secure implementation with excellent error handling and safety patterns. Below are my findings:


✅ Security Strengths

1. Excellent Error Classification (pkg/runner/authenticated_token_source.go:154-172)

The isAuthenticationError() function properly distinguishes between authentication failures and transient errors:

  • ✅ Only treats oauth2.RetrieveError with HTTP 400/401 or specific error bodies as auth failures
  • ✅ Network timeouts, context cancellations, and generic errors are correctly treated as transient
  • ✅ Prevents false positives that could incorrectly mark workloads as unauthenticated

Security Impact: This prevents denial-of-service scenarios where transient network issues would permanently mark workloads as unauthenticated.

2. No Token Leakage in Logs

  • ✅ Error messages (pkg/runner/authenticated_token_source.go:55, 114) include %v formatting which uses Go's error interface
  • ✅ The oauth2.RetrieveError type doesn't expose actual token values in its error string
  • ✅ Status reasons are logged for debugging but don't contain sensitive token data

Security Impact: Tokens are never logged or exposed through error messages.

3. Thread-Safe Implementation

  • ✅ Uses sync.Once for idempotent channel closure (pkg/runner/authenticated_token_source.go:26, 151)
  • ✅ Proper mutex usage in mockTokenSource tests
  • ✅ Timer management properly handles races with stopTimer() method (94-101)
  • ✅ Context cancellation is properly synchronized

Security Impact: No race conditions that could lead to double-free, panic, or incorrect state.

4. Token Not Stored in Memory Longer Than Needed

  • AuthenticatedTokenSource wraps oauth2.TokenSource and delegates token retrieval (52)
  • ✅ Tokens are not cached in the wrapper; the underlying TokenSource handles caching
  • ✅ When monitoring detects expiry, it immediately attempts refresh rather than storing expired tokens

Security Impact: Minimizes attack surface for token theft from memory.

5. Proper Lifecycle Management (pkg/runner/runner.go:198-206, 400-406)

  • ✅ Monitoring context is properly created as a child context that can be cancelled
  • monitoringCancel() is called during Cleanup() to stop background goroutine
  • ✅ Goroutine properly exits on context cancellation (pkg/runner/authenticated_token_source.go:78-80)
  • ✅ No goroutine leaks

Security Impact: Prevents resource exhaustion attacks and ensures clean shutdown.

6. Secure Middleware Integration (pkg/transport/middleware/token_injection.go)

  • ✅ Token retrieval errors return HTTP 401 immediately (25)
  • ✅ Token is properly set in Authorization header with Bearer scheme (29)
  • ✅ Middleware handles nil tokenSource gracefully (19)

Security Impact: Proper HTTP status codes prevent information leakage about auth state.

7. Comprehensive Test Coverage

The test suite covers critical security scenarios:

  • ✅ Authentication error detection and status updates
  • ✅ Transient error handling (no false positives)
  • ✅ Multiple concurrent Token() calls
  • ✅ Background monitoring with token expiry
  • ✅ Context cancellation and cleanup
  • ✅ Error classification logic

Security Impact: High confidence that security-critical paths are tested.


⚠️ Security Considerations (Not Issues, but Worth Noting)

1. Status Manager Error Handling (pkg/runner/authenticated_token_source.go:145)

The error from SetWorkloadStatus() is intentionally ignored (_ = ats.statusManager.SetWorkloadStatus(...)).

Analysis: This is acceptable because:

  • The status manager is idempotent (multiple calls are safe per PR description)
  • Authentication failure is a "best effort" notification
  • The actual authentication failure is still returned to the caller
  • The monitoring loop stops regardless of status update success

Recommendation: Consider logging a debug message if status update fails, but current approach is acceptable.

2. Background Context Usage (pkg/runner/authenticated_token_source.go:146)

The markAsUnauthenticated() method uses context.Background() instead of the monitoring context.

Analysis: This is correct because:

  • We want the status update to complete even if monitoring context is cancelled
  • Status update is a final cleanup operation
  • Using monitoring context could cause status update to be cancelled prematurely

Verdict: Correct design choice.

3. Exponential Backoff Overflow Check (pkg/runner/authenticated_token_source.go:123)

The code checks ats.backoff == 0 to detect overflow after ats.backoff *= 2.

Analysis:

  • In Go, time.Duration is an int64
  • When int64 overflows during multiplication, it wraps to negative, then eventually zero
  • The check catches this and resets to maxBackoff

Verdict: Works correctly, though the condition ats.backoff > ats.maxBackoff would catch it earlier.

4. Timer Edge Cases (pkg/runner/authenticated_token_source.go:94-101)

The stopTimer() method properly handles the edge case where the timer has already fired.

Analysis:

  • Uses non-blocking select to drain channel if timer fired between Stop() returning false and the select
  • Prevents goroutine from blocking forever
  • Standard Go pattern for timer cleanup

Verdict: Correctly implemented.


🔒 Security Best Practices Followed

  1. Principle of Least Privilege: Monitoring only updates status, doesn't modify auth config
  2. Defense in Depth: Multiple checks (HTTP status + error body) for auth errors
  3. Fail Secure: On uncertainty (transient errors), doesn't mark as unauthenticated
  4. Secure Defaults: Exponential backoff prevents retry storms
  5. Clean Resource Management: Proper context cancellation and goroutine cleanup
  6. Separation of Concerns: Status updates separate from token retrieval
  7. Idempotency: Status manager handles duplicate updates gracefully

📊 Code Quality Observations

Positive:

  • Clear separation between per-request token retrieval (Token()) and background monitoring
  • Excellent comments explaining design decisions
  • Comprehensive test coverage with clear test names
  • Proper use of Go idioms (sync.Once, context cancellation, select statements)

Minor Code Quality Notes (Non-Security):

  1. Line pkg/runner/authenticated_token_source.go:123: The overflow check could be simplified to just ats.backoff > ats.maxBackoff since multiplication wraps negative before zero
  2. Test at pkg/runner/authenticated_token_source_test.go:217: Uses polling with sleep which is fine but consider using event channels for more reliable timing in CI environments

🎯 Final Security Assessment

Overall Security Rating: STRONG ✅

This PR implements authentication monitoring with excellent security properties:

  • No token leakage vectors identified
  • Proper thread safety and resource management
  • Intelligent error classification prevents false positives
  • Clean lifecycle management without resource leaks
  • Comprehensive test coverage of security-critical paths

The implementation follows Go security best practices and demonstrates defensive programming with proper error handling, context management, and concurrency control.

Recommendation: APPROVE from a security perspective. This is a well-designed feature that enhances security by proactively detecting authentication failures without introducing new vulnerabilities.


📋 Checklist for Reviewer

  • No credential leakage in logs or errors
  • No race conditions or data races
  • Proper resource cleanup (goroutines, timers, contexts)
  • Correct error classification (no false positives)
  • Thread-safe concurrent access
  • No integer overflows or undefined behavior
  • Proper HTTP status codes (401 for auth failures)
  • Comprehensive test coverage
  • No information disclosure vulnerabilities
  • Follows principle of least privilege

@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 62.35294% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.36%. Comparing base (538b803) to head (4303809).

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 12 Missing ⚠️
pkg/transport/middleware/token_injection.go 0.00% 11 Missing ⚠️
pkg/auth/monitored_token_source.go 89.83% 5 Missing and 1 partial ⚠️
pkg/vmcp/aggregator/cli_discoverer.go 0.00% 2 Missing ⚠️
pkg/transport/http.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2421      +/-   ##
==========================================
+ Coverage   55.35%   55.36%   +0.01%     
==========================================
  Files         277      279       +2     
  Lines       26451    26524      +73     
==========================================
+ Hits        14641    14686      +45     
- Misses      10459    10490      +31     
+ Partials     1351     1348       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -0,0 +1,171 @@
package runner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this code goes in the runner; wouldn't it be better if it's part of pkg/auth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot access statusManager inside the pkg/auth because it creates the import cycle issue. Though this should go to the pkg/auth and I can create a simple adapter interface inside this file that has the SetWorkloadStatus function and it needed to be passed from the runner:

type StatusUpdater interface {
	SetWorkloadStatus(ctx context.Context, workloadName string, status string, reason string) error
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved under pkg/auth

ats.stopOnce.Do(func() { close(ats.stopMonitoring) })
}

func isAuthenticationError(err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is reliable.
Unfortunately the oauth library doesn't always return a consistent error when there's an authentication issue.
For example if there's no refresh token and the access token is expired it just returns a simple error. I can't link the code but in oauth2.go this is the logic:

	if tf.refreshToken == "" {
		return nil, errors.New("oauth2: token expired and refresh token is not set")
	}

Perhaps we can remove this function and always assume that if tokenSource.Token() has an error then it's unauthenticated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I considered that too, but added this check to avoid marking the state as unauthenticated due to transient or network errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now marking the state to unauthenticated if we get the error from Token()

return tok, nil
}

func (ats *AuthenticatedTokenSource) startBackgroundMonitoring() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the background monitoring necessary in this scenario?
Would it be easier to add a check in ListWorkloads and GetWorkload like we do for the unhealthy status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think background monitoring is needed for proactive authentication checks when there are no client requests or user commands. It’s very lightweight and runs once per proxy after each token expiry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants