Circuit Resilience: Add Health Tracking, Auto-Rebuild, and Relay Blacklisting

## Problem Statement

Currently, tonnet-proxy lacks resilience mechanisms when relay nodes become unavailable. This leads to poor user experience and extended downtime periods.

### Issue 1: No Health Check Before Relay Selection

**Current behavior:**
```go
// directory.go:115
entryRelay := entries[cryptoRandInt(len(entries))]  // Blind random selection
```

The `SelectCircuitRelays()` function selects relays randomly without checking availability. Dead relays can be selected repeatedly, wasting retry attempts.

**Impact:** Failed circuit builds even when healthy relays are available.

---

### Issue 2: No Automatic Detection of Broken Circuit

**Current behavior:**
```go
// main.go:84-90
streamID, err := backend.OpenStream(ctx, r.Host, 80)
if err != nil {
    http.Error(w, "Failed to connect to site", http.StatusBadGateway)
    return  // Error returned, no tracking
}
```

When requests fail, the proxy returns 502 errors but doesn't track failures. There's no mechanism to detect that the circuit itself is broken vs a temporary network issue.

**Impact:** Users experience continuous 502 errors until manual restart or scheduled rotation.

---

### Issue 3: No Automatic Circuit Rebuild

**Current behavior:**

Once a circuit is established, it remains in use until:
- Manual restart of the proxy
- Scheduled rotation (if `--rotate` flag is set)

There's no automatic rebuild when the circuit becomes unusable.

**Impact:** Extended downtime periods. Users must wait for rotation or restart the proxy manually.

---

### Issue 4: No Temporary Blacklist for Failed Relays

**Current behavior:**
```go
// main.go:232-237
for attempt := 1; attempt <= maxRetries; attempt++ {
    relays, err := dir.SelectCircuitRelays()  // Can select the same dead relay again!
```

When a circuit build fails, the same relay can be selected on the next attempt since there's no memory of which relays failed.

**Impact:**
- Retry attempts wasted on known-dead relays
- Example: Relay A is down → selected in attempt 1, 2, and 3 → all fail

---

## Proposed Solution

A unified approach using two simple mechanisms:

1. **Relay Blacklist** in `Directory` - Track failed relays temporarily
2. **Circuit Health Monitor** in `ProxyHandler` - Detect failures and trigger rebuild

### Architecture Overview

```
┌─────────────────────────────────────────────────────────┐
│                     ProxyHandler                        │
│  ┌─────────────────┐  ┌─────────────────────────────┐  │
│  │ consecutiveErr  │  │ rebuilding (atomic flag)    │  │
│  └────────┬────────┘  └─────────────────────────────┘  │
│           │                                             │
│           ▼                                             │
│  ┌──────────────────────────────────────────────────┐  │
│  │  On request failure: increment counter           │  │
│  │  If counter >= 3: trigger async rebuild          │  │
│  │  On success: reset counter                       │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                      Directory                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │  blacklist map[string]time.Time                  │  │
│  │                                                  │  │
│  │  MarkFailed(addr)  → add to blacklist            │  │
│  │  MarkSuccess(addr) → remove from blacklist       │  │
│  │  FilterByRole()    → excludes blacklisted        │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
```

---

## Implementation Details

### 1. Modify `internal/directory/directory.go`

Add blacklist tracking:

```go
const BlacklistDuration = 5 * time.Minute

type Directory struct {
    Version   int     `json:"version"`
    Updated   string  `json:"updated"`
    Relays    []Relay `json:"relays"`

    // Health tracking
    mu        sync.RWMutex
    blacklist map[string]time.Time // address -> failure timestamp
}

// MarkFailed adds a relay to the temporary blacklist
func (d *Directory) MarkFailed(addr string) {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.blacklist == nil {
        d.blacklist = make(map[string]time.Time)
    }
    d.blacklist[addr] = time.Now()
}

// MarkSuccess removes a relay from the blacklist
func (d *Directory) MarkSuccess(addr string) {
    d.mu.Lock()
    defer d.mu.Unlock()
    if d.blacklist != nil {
        delete(d.blacklist, addr)
    }
}

// isBlacklisted checks if a relay is temporarily unavailable
func (d *Directory) isBlacklisted(addr string) bool {
    d.mu.RLock()
    defer d.mu.RUnlock()
    if d.blacklist == nil {
        return false
    }
    failTime, exists := d.blacklist[addr]
    if !exists {
        return false
    }
    return time.Since(failTime) <= BlacklistDuration
}

// FilterByRole - modified to exclude blacklisted relays
func (d *Directory) FilterByRole(role string) []Relay {
    var result []Relay
    for _, r := range d.Relays {
        if r.HasRole(role) && !d.isBlacklisted(r.Address) {
            result = append(result, r)
        }
    }
    return result
}

// FilterByRoleIncludeBlacklisted - fallback when too few relays available
func (d *Directory) FilterByRoleIncludeBlacklisted(role string) []Relay {
    var result []Relay
    for _, r := range d.Relays {
        if r.HasRole(role) {
            result = append(result, r)
        }
    }
    return result
}
```

Update `SelectCircuit()` to fallback when blacklist filters too many:

```go
func (d *Directory) SelectCircuit() (entry, middle, exit *Relay, err error) {
    entries := d.FilterByRole("entry")
    middles := d.FilterByRole("middle")
    exits := d.FilterByRole("exit")

    // Fallback to all relays if too few available
    if len(entries) == 0 {
        entries = d.FilterByRoleIncludeBlacklisted("entry")
    }
    if len(middles) == 0 {
        middles = d.FilterByRoleIncludeBlacklisted("middle")
    }
    if len(exits) == 0 {
        exits = d.FilterByRoleIncludeBlacklisted("exit")
    }
    // ... rest unchanged
}
```

---

### 2. Modify `cmd/main.go`

Add circuit health monitoring:

```go
const MaxConsecutiveFailures = 3

type ProxyHandler struct {
    // ... existing fields ...
    
    // Circuit health tracking
    consecutiveFailures int32 // atomic counter
    rebuilding          int32 // atomic flag
    lastRelays          []client.RelayInfo
    lastRelaysMu        sync.RWMutex
}

// trackFailure increments failure counter and triggers rebuild if needed
func (h *ProxyHandler) trackFailure(ctx context.Context) {
    if h.directMode {
        return
    }
    
    count := atomic.AddInt32(&h.consecutiveFailures, 1)
    
    if count >= MaxConsecutiveFailures {
        if atomic.CompareAndSwapInt32(&h.rebuilding, 0, 1) {
            go h.rebuildCircuit(context.Background())
        }
    }
}

// trackSuccess resets the failure counter
func (h *ProxyHandler) trackSuccess() {
    if h.directMode {
        return
    }
    atomic.StoreInt32(&h.consecutiveFailures, 0)
    
    // Mark current relays as healthy
    if h.dir != nil {
        h.lastRelaysMu.RLock()
        for _, r := range h.lastRelays {
            h.dir.MarkSuccess(r.Address)
        }
        h.lastRelaysMu.RUnlock()
    }
}

// rebuildCircuit attempts to rebuild after consecutive failures
func (h *ProxyHandler) rebuildCircuit(ctx context.Context) {
    defer atomic.StoreInt32(&h.rebuilding, 0)
    defer atomic.StoreInt32(&h.consecutiveFailures, 0)
    
    fmt.Println("\nCircuit appears broken, attempting rebuild...")
    
    // Mark current relays as failed
    if h.dir != nil {
        h.lastRelaysMu.RLock()
        for _, r := range h.lastRelays {
            h.dir.MarkFailed(r.Address)
        }
        h.lastRelaysMu.RUnlock()
    }
    
    newCircuit, newRelays, err := buildCircuitWithRetryAndRelays(ctx, h.gate, h.dir, h.retries)
    if err != nil {
        fmt.Printf("Rebuild failed: %v (keeping current circuit)\n", err)
        return
    }
    
    // Swap circuits atomically
    h.backendMu.Lock()
    oldBackend := h.backend
    h.backend = newCircuit
    h.backendMu.Unlock()
    
    h.lastRelaysMu.Lock()
    h.lastRelays = newRelays
    h.lastRelaysMu.Unlock()
    
    if oldBackend != nil {
        oldBackend.Close()
    }
    
    fmt.Printf("Circuit rebuilt [%s]\n\n", hex.EncodeToString(newCircuit.ID)[:8])
}
```

Update `ServeHTTP` to call tracking functions:

```go
func (h *ProxyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // ... existing code ...
    
    streamID, err := backend.OpenStream(ctx, r.Host, 80)
    if err != nil {
        h.trackFailure(ctx)  // ADD THIS
        http.Error(w, "Failed to connect to site", http.StatusBadGateway)
        return
    }
    
    // ... send request ...
    
    respData, err := backend.RecvData(ctx, streamID)
    if err != nil {
        h.trackFailure(ctx)  // ADD THIS
        http.Error(w, "Failed to receive response", http.StatusBadGateway)
        return
    }
    
    h.trackSuccess()  // ADD THIS
    
    // ... rest unchanged ...
}
```

Update `buildCircuitWithRetry` to mark relays:

```go
func buildCircuitWithRetryAndRelays(ctx context.Context, gate *adnl.Gateway, dir *directory.Directory, maxRetries int) (*client.ClientCircuit, []client.RelayInfo, error) {
    var lastErr error

    for attempt := 1; attempt <= maxRetries; attempt++ {
        relays, err := dir.SelectCircuitRelays()
        if err != nil {
            lastErr = err
            continue
        }

        circuit, err := client.NewClientCircuit(ctx, gate, relays)
        if err == nil {
            // Success - mark relays as healthy
            for _, r := range relays {
                dir.MarkSuccess(r.Address)
            }
            return circuit, relays, nil
        }

        lastErr = err
        
        // Mark failed relays
        for _, r := range relays {
            dir.MarkFailed(r.Address)
        }
    }

    return nil, nil, fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
}
```

---

## Behavior After Implementation

### Scenario: Relay Dies Mid-Session

```
Request 1: Success (failCount = 0)
Request 2: 502 Error (failCount = 1)
Request 3: 502 Error (failCount = 2)
Request 4: 502 Error (failCount = 3) → Triggers rebuildCircuit()
           ├── Current relays blacklisted for 5 min
           ├── New circuit built with different relays
           └── failCount reset to 0
Request 5: Success (using new circuit)
```

### Scenario: Multiple Relays Down During Build

```
Attempt 1: Select A, B, C → A is down → Fail → A, B, C blacklisted
Attempt 2: Select D, E, F → All healthy → Success
```

---

## Summary of Changes

| File | Change | Lines |
|------|--------|-------|
| `directory.go` | Add blacklist map and methods | ~60 |
| `directory.go` | Modify FilterByRole to exclude blacklisted | ~5 |
| `main.go` | Add failure tracking and auto-rebuild | ~80 |
| `main.go` | Modify buildCircuitWithRetry | ~15 |

**Total: ~160 lines of new code**

---

## Backwards Compatibility

- No breaking changes to CLI flags
- No changes to circuit protocol
- No changes to directory JSON format
- Existing deployments gain resilience features automatically

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circuit Resilience: Add Health Tracking, Auto-Rebuild, and Relay Blacklisting #1

Problem Statement

Issue 1: No Health Check Before Relay Selection

Issue 2: No Automatic Detection of Broken Circuit

Issue 3: No Automatic Circuit Rebuild

Issue 4: No Temporary Blacklist for Failed Relays

Proposed Solution

Architecture Overview

Implementation Details

1. Modify `internal/directory/directory.go`

2. Modify `cmd/main.go`

Behavior After Implementation

Scenario: Relay Dies Mid-Session

Scenario: Multiple Relays Down During Build

Summary of Changes

Backwards Compatibility

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Change	Lines
`directory.go`	Add blacklist map and methods	~60
`directory.go`	Modify FilterByRole to exclude blacklisted	~5
`main.go`	Add failure tracking and auto-rebuild	~80
`main.go`	Modify buildCircuitWithRetry	~15

Circuit Resilience: Add Health Tracking, Auto-Rebuild, and Relay Blacklisting #1

Description

Problem Statement

Issue 1: No Health Check Before Relay Selection

Issue 2: No Automatic Detection of Broken Circuit

Issue 3: No Automatic Circuit Rebuild

Issue 4: No Temporary Blacklist for Failed Relays

Proposed Solution

Architecture Overview

Implementation Details

1. Modify internal/directory/directory.go

2. Modify cmd/main.go

Behavior After Implementation

Scenario: Relay Dies Mid-Session

Scenario: Multiple Relays Down During Build

Summary of Changes

Backwards Compatibility

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Modify `internal/directory/directory.go`

2. Modify `cmd/main.go`