Skip to content

Circuit Resilience: Add Health Tracking, Auto-Rebuild, and Relay Blacklisting #1

@TONresistor

Description

@TONresistor

Problem Statement

Currently, tonnet-proxy lacks resilience mechanisms when relay nodes become unavailable. This leads to poor user experience and extended downtime periods.

Issue 1: No Health Check Before Relay Selection

Current behavior:

// directory.go:115
entryRelay := entries[cryptoRandInt(len(entries))]  // Blind random selection

The SelectCircuitRelays() function selects relays randomly without checking availability. Dead relays can be selected repeatedly, wasting retry attempts.

Impact: Failed circuit builds even when healthy relays are available.


Issue 2: No Automatic Detection of Broken Circuit

Current behavior:

// main.go:84-90
streamID, err := backend.OpenStream(ctx, r.Host, 80)
if err != nil {
    http.Error(w, "Failed to connect to site", http.StatusBadGateway)
    return  // Error returned, no tracking
}

When requests fail, the proxy returns 502 errors but doesn't track failures. There's no mechanism to detect that the circuit itself is broken vs a temporary network issue.

Impact: Users experience continuous 502 errors until manual restart or scheduled rotation.


Issue 3: No Automatic Circuit Rebuild

Current behavior:

Once a circuit is established, it remains in use until:

  • Manual restart of the proxy
  • Scheduled rotation (if --rotate flag is set)

There's no automatic rebuild when the circuit becomes unusable.

Impact: Extended downtime periods. Users must wait for rotation or restart the proxy manually.


Issue 4: No Temporary Blacklist for Failed Relays

Current behavior:

// main.go:232-237
for attempt := 1; attempt <= maxRetries; attempt++ {
    relays, err := dir.SelectCircuitRelays()  // Can select the same dead relay again!

When a circuit build fails, the same relay can be selected on the next attempt since there's no memory of which relays failed.

Impact:

  • Retry attempts wasted on known-dead relays
  • Example: Relay A is down → selected in attempt 1, 2, and 3 → all fail

Proposed Solution

A unified approach using two simple mechanisms:

  1. Relay Blacklist in Directory - Track failed relays temporarily
  2. Circuit Health Monitor in ProxyHandler - Detect failures and trigger rebuild

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     ProxyHandler                        │
│  ┌─────────────────┐  ┌─────────────────────────────┐  │
│  │ consecutiveErr  │  │ rebuilding (atomic flag)    │  │
│  └────────┬────────┘  └─────────────────────────────┘  │
│           │                                             │
│           ▼                                             │
│  ┌──────────────────────────────────────────────────┐  │
│  │  On request failure: increment counter           │  │
│  │  If counter >= 3: trigger async rebuild          │  │
│  │  On success: reset counter                       │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                      Directory                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │  blacklist map[string]time.Time                  │  │
│  │                                                  │  │
│  │  MarkFailed(addr)  → add to blacklist            │  │
│  │  MarkSuccess(addr) → remove from blacklist       │  │
│  │  FilterByRole()    → excludes blacklisted        │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Implementation Details

1. Modify internal/directory/directory.go

Add blacklist tracking:

const BlacklistDuration = 5 * time.Minute

type Directory struct {
    Version   int     `json:"version"`
    Updated   string  `json:"updated"`
    Relays    []Relay `json:"relays"`

    // Health tracking
    mu        sync.RWMutex
    blacklist map[string]time.Time // address -> failure timestamp
}

// MarkFailed adds a relay to the temporary blacklist
func (d *Directory) MarkFailed(addr string) {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.blacklist == nil {
        d.blacklist = make(map[string]time.Time)
    }
    d.blacklist[addr] = time.Now()
}

// MarkSuccess removes a relay from the blacklist
func (d *Directory) MarkSuccess(addr string) {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.blacklist != nil {
        delete(d.blacklist, addr)
    }
}

// isBlacklisted checks if a relay is temporarily unavailable
func (d *Directory) isBlacklisted(addr string) bool {
    d.mu.RLock()
    defer d.mu.RUnlock()
    if d.blacklist == nil {
        return false
    }
    failTime, exists := d.blacklist[addr]
    if !exists {
        return false
    }
    return time.Since(failTime) <= BlacklistDuration
}

// FilterByRole - modified to exclude blacklisted relays
func (d *Directory) FilterByRole(role string) []Relay {
    var result []Relay
    for _, r := range d.Relays {
        if r.HasRole(role) && !d.isBlacklisted(r.Address) {
            result = append(result, r)
        }
    }
    return result
}

// FilterByRoleIncludeBlacklisted - fallback when too few relays available
func (d *Directory) FilterByRoleIncludeBlacklisted(role string) []Relay {
    var result []Relay
    for _, r := range d.Relays {
        if r.HasRole(role) {
            result = append(result, r)
        }
    }
    return result
}

Update SelectCircuit() to fallback when blacklist filters too many:

func (d *Directory) SelectCircuit() (entry, middle, exit *Relay, err error) {
    entries := d.FilterByRole("entry")
    middles := d.FilterByRole("middle")
    exits := d.FilterByRole("exit")

    // Fallback to all relays if too few available
    if len(entries) == 0 {
        entries = d.FilterByRoleIncludeBlacklisted("entry")
    }
    if len(middles) == 0 {
        middles = d.FilterByRoleIncludeBlacklisted("middle")
    }
    if len(exits) == 0 {
        exits = d.FilterByRoleIncludeBlacklisted("exit")
    }
    // ... rest unchanged
}

2. Modify cmd/main.go

Add circuit health monitoring:

const MaxConsecutiveFailures = 3

type ProxyHandler struct {
    // ... existing fields ...
    
    // Circuit health tracking
    consecutiveFailures int32 // atomic counter
    rebuilding          int32 // atomic flag
    lastRelays          []client.RelayInfo
    lastRelaysMu        sync.RWMutex
}

// trackFailure increments failure counter and triggers rebuild if needed
func (h *ProxyHandler) trackFailure(ctx context.Context) {
    if h.directMode {
        return
    }
    
    count := atomic.AddInt32(&h.consecutiveFailures, 1)
    
    if count >= MaxConsecutiveFailures {
        if atomic.CompareAndSwapInt32(&h.rebuilding, 0, 1) {
            go h.rebuildCircuit(context.Background())
        }
    }
}

// trackSuccess resets the failure counter
func (h *ProxyHandler) trackSuccess() {
    if h.directMode {
        return
    }
    atomic.StoreInt32(&h.consecutiveFailures, 0)
    
    // Mark current relays as healthy
    if h.dir != nil {
        h.lastRelaysMu.RLock()
        for _, r := range h.lastRelays {
            h.dir.MarkSuccess(r.Address)
        }
        h.lastRelaysMu.RUnlock()
    }
}

// rebuildCircuit attempts to rebuild after consecutive failures
func (h *ProxyHandler) rebuildCircuit(ctx context.Context) {
    defer atomic.StoreInt32(&h.rebuilding, 0)
    defer atomic.StoreInt32(&h.consecutiveFailures, 0)
    
    fmt.Println("\nCircuit appears broken, attempting rebuild...")
    
    // Mark current relays as failed
    if h.dir != nil {
        h.lastRelaysMu.RLock()
        for _, r := range h.lastRelays {
            h.dir.MarkFailed(r.Address)
        }
        h.lastRelaysMu.RUnlock()
    }
    
    newCircuit, newRelays, err := buildCircuitWithRetryAndRelays(ctx, h.gate, h.dir, h.retries)
    if err != nil {
        fmt.Printf("Rebuild failed: %v (keeping current circuit)\n", err)
        return
    }
    
    // Swap circuits atomically
    h.backendMu.Lock()
    oldBackend := h.backend
    h.backend = newCircuit
    h.backendMu.Unlock()
    
    h.lastRelaysMu.Lock()
    h.lastRelays = newRelays
    h.lastRelaysMu.Unlock()
    
    if oldBackend != nil {
        oldBackend.Close()
    }
    
    fmt.Printf("Circuit rebuilt [%s]\n\n", hex.EncodeToString(newCircuit.ID)[:8])
}

Update ServeHTTP to call tracking functions:

func (h *ProxyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // ... existing code ...
    
    streamID, err := backend.OpenStream(ctx, r.Host, 80)
    if err != nil {
        h.trackFailure(ctx)  // ADD THIS
        http.Error(w, "Failed to connect to site", http.StatusBadGateway)
        return
    }
    
    // ... send request ...
    
    respData, err := backend.RecvData(ctx, streamID)
    if err != nil {
        h.trackFailure(ctx)  // ADD THIS
        http.Error(w, "Failed to receive response", http.StatusBadGateway)
        return
    }
    
    h.trackSuccess()  // ADD THIS
    
    // ... rest unchanged ...
}

Update buildCircuitWithRetry to mark relays:

func buildCircuitWithRetryAndRelays(ctx context.Context, gate *adnl.Gateway, dir *directory.Directory, maxRetries int) (*client.ClientCircuit, []client.RelayInfo, error) {
    var lastErr error

    for attempt := 1; attempt <= maxRetries; attempt++ {
        relays, err := dir.SelectCircuitRelays()
        if err != nil {
            lastErr = err
            continue
        }

        circuit, err := client.NewClientCircuit(ctx, gate, relays)
        if err == nil {
            // Success - mark relays as healthy
            for _, r := range relays {
                dir.MarkSuccess(r.Address)
            }
            return circuit, relays, nil
        }

        lastErr = err
        
        // Mark failed relays
        for _, r := range relays {
            dir.MarkFailed(r.Address)
        }
    }

    return nil, nil, fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
}

Behavior After Implementation

Scenario: Relay Dies Mid-Session

Request 1: Success (failCount = 0)
Request 2: 502 Error (failCount = 1)
Request 3: 502 Error (failCount = 2)
Request 4: 502 Error (failCount = 3) → Triggers rebuildCircuit()
           ├── Current relays blacklisted for 5 min
           ├── New circuit built with different relays
           └── failCount reset to 0
Request 5: Success (using new circuit)

Scenario: Multiple Relays Down During Build

Attempt 1: Select A, B, C → A is down → Fail → A, B, C blacklisted
Attempt 2: Select D, E, F → All healthy → Success

Summary of Changes

File Change Lines
directory.go Add blacklist map and methods ~60
directory.go Modify FilterByRole to exclude blacklisted ~5
main.go Add failure tracking and auto-rebuild ~80
main.go Modify buildCircuitWithRetry ~15

Total: ~160 lines of new code


Backwards Compatibility

  • No breaking changes to CLI flags
  • No changes to circuit protocol
  • No changes to directory JSON format
  • Existing deployments gain resilience features automatically

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions