-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem Statement
Currently, tonnet-proxy lacks resilience mechanisms when relay nodes become unavailable. This leads to poor user experience and extended downtime periods.
Issue 1: No Health Check Before Relay Selection
Current behavior:
// directory.go:115
entryRelay := entries[cryptoRandInt(len(entries))] // Blind random selectionThe SelectCircuitRelays() function selects relays randomly without checking availability. Dead relays can be selected repeatedly, wasting retry attempts.
Impact: Failed circuit builds even when healthy relays are available.
Issue 2: No Automatic Detection of Broken Circuit
Current behavior:
// main.go:84-90
streamID, err := backend.OpenStream(ctx, r.Host, 80)
if err != nil {
http.Error(w, "Failed to connect to site", http.StatusBadGateway)
return // Error returned, no tracking
}When requests fail, the proxy returns 502 errors but doesn't track failures. There's no mechanism to detect that the circuit itself is broken vs a temporary network issue.
Impact: Users experience continuous 502 errors until manual restart or scheduled rotation.
Issue 3: No Automatic Circuit Rebuild
Current behavior:
Once a circuit is established, it remains in use until:
- Manual restart of the proxy
- Scheduled rotation (if
--rotateflag is set)
There's no automatic rebuild when the circuit becomes unusable.
Impact: Extended downtime periods. Users must wait for rotation or restart the proxy manually.
Issue 4: No Temporary Blacklist for Failed Relays
Current behavior:
// main.go:232-237
for attempt := 1; attempt <= maxRetries; attempt++ {
relays, err := dir.SelectCircuitRelays() // Can select the same dead relay again!When a circuit build fails, the same relay can be selected on the next attempt since there's no memory of which relays failed.
Impact:
- Retry attempts wasted on known-dead relays
- Example: Relay A is down → selected in attempt 1, 2, and 3 → all fail
Proposed Solution
A unified approach using two simple mechanisms:
- Relay Blacklist in
Directory- Track failed relays temporarily - Circuit Health Monitor in
ProxyHandler- Detect failures and trigger rebuild
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ ProxyHandler │
│ ┌─────────────────┐ ┌─────────────────────────────┐ │
│ │ consecutiveErr │ │ rebuilding (atomic flag) │ │
│ └────────┬────────┘ └─────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ On request failure: increment counter │ │
│ │ If counter >= 3: trigger async rebuild │ │
│ │ On success: reset counter │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Directory │
│ ┌──────────────────────────────────────────────────┐ │
│ │ blacklist map[string]time.Time │ │
│ │ │ │
│ │ MarkFailed(addr) → add to blacklist │ │
│ │ MarkSuccess(addr) → remove from blacklist │ │
│ │ FilterByRole() → excludes blacklisted │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Implementation Details
1. Modify internal/directory/directory.go
Add blacklist tracking:
const BlacklistDuration = 5 * time.Minute
type Directory struct {
Version int `json:"version"`
Updated string `json:"updated"`
Relays []Relay `json:"relays"`
// Health tracking
mu sync.RWMutex
blacklist map[string]time.Time // address -> failure timestamp
}
// MarkFailed adds a relay to the temporary blacklist
func (d *Directory) MarkFailed(addr string) {
d.mu.Lock()
defer d.mu.Unlock()
if d.blacklist == nil {
d.blacklist = make(map[string]time.Time)
}
d.blacklist[addr] = time.Now()
}
// MarkSuccess removes a relay from the blacklist
func (d *Directory) MarkSuccess(addr string) {
d.mu.Lock()
defer d.mu.Unlock()
if d.blacklist != nil {
delete(d.blacklist, addr)
}
}
// isBlacklisted checks if a relay is temporarily unavailable
func (d *Directory) isBlacklisted(addr string) bool {
d.mu.RLock()
defer d.mu.RUnlock()
if d.blacklist == nil {
return false
}
failTime, exists := d.blacklist[addr]
if !exists {
return false
}
return time.Since(failTime) <= BlacklistDuration
}
// FilterByRole - modified to exclude blacklisted relays
func (d *Directory) FilterByRole(role string) []Relay {
var result []Relay
for _, r := range d.Relays {
if r.HasRole(role) && !d.isBlacklisted(r.Address) {
result = append(result, r)
}
}
return result
}
// FilterByRoleIncludeBlacklisted - fallback when too few relays available
func (d *Directory) FilterByRoleIncludeBlacklisted(role string) []Relay {
var result []Relay
for _, r := range d.Relays {
if r.HasRole(role) {
result = append(result, r)
}
}
return result
}Update SelectCircuit() to fallback when blacklist filters too many:
func (d *Directory) SelectCircuit() (entry, middle, exit *Relay, err error) {
entries := d.FilterByRole("entry")
middles := d.FilterByRole("middle")
exits := d.FilterByRole("exit")
// Fallback to all relays if too few available
if len(entries) == 0 {
entries = d.FilterByRoleIncludeBlacklisted("entry")
}
if len(middles) == 0 {
middles = d.FilterByRoleIncludeBlacklisted("middle")
}
if len(exits) == 0 {
exits = d.FilterByRoleIncludeBlacklisted("exit")
}
// ... rest unchanged
}2. Modify cmd/main.go
Add circuit health monitoring:
const MaxConsecutiveFailures = 3
type ProxyHandler struct {
// ... existing fields ...
// Circuit health tracking
consecutiveFailures int32 // atomic counter
rebuilding int32 // atomic flag
lastRelays []client.RelayInfo
lastRelaysMu sync.RWMutex
}
// trackFailure increments failure counter and triggers rebuild if needed
func (h *ProxyHandler) trackFailure(ctx context.Context) {
if h.directMode {
return
}
count := atomic.AddInt32(&h.consecutiveFailures, 1)
if count >= MaxConsecutiveFailures {
if atomic.CompareAndSwapInt32(&h.rebuilding, 0, 1) {
go h.rebuildCircuit(context.Background())
}
}
}
// trackSuccess resets the failure counter
func (h *ProxyHandler) trackSuccess() {
if h.directMode {
return
}
atomic.StoreInt32(&h.consecutiveFailures, 0)
// Mark current relays as healthy
if h.dir != nil {
h.lastRelaysMu.RLock()
for _, r := range h.lastRelays {
h.dir.MarkSuccess(r.Address)
}
h.lastRelaysMu.RUnlock()
}
}
// rebuildCircuit attempts to rebuild after consecutive failures
func (h *ProxyHandler) rebuildCircuit(ctx context.Context) {
defer atomic.StoreInt32(&h.rebuilding, 0)
defer atomic.StoreInt32(&h.consecutiveFailures, 0)
fmt.Println("\nCircuit appears broken, attempting rebuild...")
// Mark current relays as failed
if h.dir != nil {
h.lastRelaysMu.RLock()
for _, r := range h.lastRelays {
h.dir.MarkFailed(r.Address)
}
h.lastRelaysMu.RUnlock()
}
newCircuit, newRelays, err := buildCircuitWithRetryAndRelays(ctx, h.gate, h.dir, h.retries)
if err != nil {
fmt.Printf("Rebuild failed: %v (keeping current circuit)\n", err)
return
}
// Swap circuits atomically
h.backendMu.Lock()
oldBackend := h.backend
h.backend = newCircuit
h.backendMu.Unlock()
h.lastRelaysMu.Lock()
h.lastRelays = newRelays
h.lastRelaysMu.Unlock()
if oldBackend != nil {
oldBackend.Close()
}
fmt.Printf("Circuit rebuilt [%s]\n\n", hex.EncodeToString(newCircuit.ID)[:8])
}Update ServeHTTP to call tracking functions:
func (h *ProxyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// ... existing code ...
streamID, err := backend.OpenStream(ctx, r.Host, 80)
if err != nil {
h.trackFailure(ctx) // ADD THIS
http.Error(w, "Failed to connect to site", http.StatusBadGateway)
return
}
// ... send request ...
respData, err := backend.RecvData(ctx, streamID)
if err != nil {
h.trackFailure(ctx) // ADD THIS
http.Error(w, "Failed to receive response", http.StatusBadGateway)
return
}
h.trackSuccess() // ADD THIS
// ... rest unchanged ...
}Update buildCircuitWithRetry to mark relays:
func buildCircuitWithRetryAndRelays(ctx context.Context, gate *adnl.Gateway, dir *directory.Directory, maxRetries int) (*client.ClientCircuit, []client.RelayInfo, error) {
var lastErr error
for attempt := 1; attempt <= maxRetries; attempt++ {
relays, err := dir.SelectCircuitRelays()
if err != nil {
lastErr = err
continue
}
circuit, err := client.NewClientCircuit(ctx, gate, relays)
if err == nil {
// Success - mark relays as healthy
for _, r := range relays {
dir.MarkSuccess(r.Address)
}
return circuit, relays, nil
}
lastErr = err
// Mark failed relays
for _, r := range relays {
dir.MarkFailed(r.Address)
}
}
return nil, nil, fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
}Behavior After Implementation
Scenario: Relay Dies Mid-Session
Request 1: Success (failCount = 0)
Request 2: 502 Error (failCount = 1)
Request 3: 502 Error (failCount = 2)
Request 4: 502 Error (failCount = 3) → Triggers rebuildCircuit()
├── Current relays blacklisted for 5 min
├── New circuit built with different relays
└── failCount reset to 0
Request 5: Success (using new circuit)
Scenario: Multiple Relays Down During Build
Attempt 1: Select A, B, C → A is down → Fail → A, B, C blacklisted
Attempt 2: Select D, E, F → All healthy → Success
Summary of Changes
| File | Change | Lines |
|---|---|---|
directory.go |
Add blacklist map and methods | ~60 |
directory.go |
Modify FilterByRole to exclude blacklisted | ~5 |
main.go |
Add failure tracking and auto-rebuild | ~80 |
main.go |
Modify buildCircuitWithRetry | ~15 |
Total: ~160 lines of new code
Backwards Compatibility
- No breaking changes to CLI flags
- No changes to circuit protocol
- No changes to directory JSON format
- Existing deployments gain resilience features automatically