Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions docs/architecture/ADR-001-static-allocations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# ADR-001: Static Allocations via Box::leak

## Status

Accepted

## Context

The graph-gateway uses Axum as its HTTP framework. Axum's state management requires types to implement `Clone` and have `'static` lifetime. Several gateway components are heavyweight singletons that:

1. Are initialized once at startup
2. Never need to be deallocated (process lifetime)
3. Are expensive to clone (contain channels, cryptographic keys, etc.)

These components include:

- `ReceiptSigner` - TAP receipt signing with private keys
- `Budgeter` - PID controller state for fee management
- `Chains` - Chain head tracking with per-chain state
- `Eip712Domain` (attestation domains) - EIP-712 signing domains

## Decision

Use `Box::leak()` to convert owned `Box<T>` into `&'static T` references for singleton components.

```rust
// Example from main.rs
let receipt_signer: &'static ReceiptSigner = Box::leak(Box::new(ReceiptSigner::new(...)));

let chains: &'static Chains = Box::leak(Box::new(Chains::new(...)));
```

## Consequences

### Positive

1. **Zero-cost sharing**: `&'static T` is `Copy`, so passing to handlers has no overhead
2. **No Arc overhead**: Avoids atomic reference counting on every request
3. **Simpler lifetimes**: No need to propagate lifetime parameters through handler types
4. **Explicit intent**: Makes it clear these are process-lifetime singletons

### Negative

1. **Memory never freed**: The leaked memory is never reclaimed. Acceptable because:
- Components live for the entire process lifetime anyway
- Total leaked memory is small and bounded (< 1 KB)
- Process termination reclaims all memory

2. **Not suitable for tests**: Tests that need fresh state must use different patterns. Currently mitigated by limited test coverage.

## Alternatives Considered

### `Arc<T>` (Rejected)

```rust
let receipt_signer: Arc<ReceiptSigner> = Arc::new(ReceiptSigner::new(...));
```

Problems:

- Atomic operations on every clone (per-request overhead)
- More complex to share across Axum handlers
- Implies shared ownership when sole ownership is the intent

### `once_cell::sync::Lazy` (Rejected)

```rust
static RECEIPT_SIGNER: Lazy<ReceiptSigner> = Lazy::new(|| ...);
```

Problems:

- Requires initialization logic in static context
- Cannot use async initialization
- Configuration not available at static init time

## References

- [Axum State Documentation](https://docs.rs/axum/latest/axum/extract/struct.State.html)
- [Box::leak documentation](https://doc.rust-lang.org/std/boxed/struct.Box.html#method.leak)
131 changes: 131 additions & 0 deletions docs/architecture/ADR-002-type-state-pattern.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# ADR-002: Type-State Pattern for Indexer Processing

## Status

Accepted

## Context

Indexer information flows through multiple processing stages, with each stage enriching the data:

1. **Raw** - Basic indexer info from network subgraph
2. **Version resolved** - After fetching indexer-service version
3. **Progress resolved** - After fetching indexing progress (block height)
4. **Cost resolved** - After fetching cost model/fee info

Processing order matters: we need version info before we can query for progress (different API versions), and we need progress before cost resolution makes sense (stale indexers are filtered).

A naive approach would use `Option<T>` fields that get populated:

```rust
struct IndexingInfo {
indexer: IndexerId,
deployment: DeploymentId,
version: Option<Version>, // Filled in stage 2
progress: Option<BlockNumber>, // Filled in stage 3
fee: Option<GRT>, // Filled in stage 4
}
```

This leads to `unwrap()` calls throughout the codebase and runtime errors when accessing fields before they're populated.

## Decision

Use the type-state pattern with generic parameters to encode processing stage at compile time.

```rust
// Type markers for processing stages
struct Unresolved;
struct VersionResolved(Version);
struct ProgressResolved { version: Version, block: BlockNumber }
struct FullyResolved { version: Version, block: BlockNumber, fee: GRT }

// Generic struct parameterized by stage
struct IndexingInfo<Stage> {
indexer: IndexerId,
deployment: DeploymentId,
stage: Stage,
}

// Stage transitions are explicit methods
impl IndexingInfo<Unresolved> {
fn resolve_version(self, version: Version) -> IndexingInfo<VersionResolved> {
IndexingInfo {
indexer: self.indexer,
deployment: self.deployment,
stage: VersionResolved(version),
}
}
}
```

See `src/network/indexer_processing.rs` for the actual implementation.

## Consequences

### Positive

1. **Compile-time safety**: Impossible to access version info before it's resolved
2. **Self-documenting**: Function signatures show required processing stage
3. **No runtime overhead**: Type parameters are erased at compile time
4. **Explicit transitions**: Stage changes are visible method calls, not silent mutations

### Negative

1. **Verbose types**: `IndexingInfo<ProgressResolved>` is longer than `IndexingInfo`
2. **Learning curve**: Pattern is less common, may confuse new contributors
3. **More boilerplate**: Stage transition methods must be written explicitly

## Pattern Usage

```rust
// Functions declare their required stage in the signature
fn select_candidate(info: &IndexingInfo<FullyResolved>) -> Score {
// Safe to access info.stage.fee - compiler guarantees it exists
calculate_score(info.stage.fee, info.stage.block)
}

// Processing pipeline
async fn process_indexer(raw: IndexingInfo<Unresolved>) -> Result<IndexingInfo<FullyResolved>> {
let with_version = raw.resolve_version(fetch_version(&raw.indexer).await?);
let with_progress = with_version.resolve_progress(fetch_progress(&with_version).await?);
let fully_resolved = with_progress.resolve_cost(fetch_cost(&with_progress).await?);
Ok(fully_resolved)
}
```

## Alternatives Considered

### Builder Pattern (Rejected)

```rust
IndexingInfoBuilder::new(indexer, deployment)
.version(v)
.progress(p)
.fee(f)
.build()
```

Problems:

- Runtime validation only
- `build()` must check all fields are set
- No compile-time guarantee of processing order

### Separate Structs (Rejected)

```rust
struct RawIndexingInfo { ... }
struct ResolvedIndexingInfo { ... }
```

Problems:

- Code duplication across struct definitions
- Harder to share common logic
- Type relationships not explicit

## References

- [Typestate Pattern in Rust](https://cliffle.com/blog/rust-typestate/)
- [Parse, don't validate](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/)
149 changes: 149 additions & 0 deletions docs/architecture/ADR-003-pid-budget-controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# ADR-003: PID Controller for Fee Budget Management

## Status

Accepted

## Context

The gateway must manage query fee budgets to balance:

1. **Cost efficiency** - Minimize fees paid to indexers
2. **Query success rate** - Ensure queries succeed by offering competitive fees
3. **Responsiveness** - Adapt quickly to market conditions

Static fee budgets fail because:

- Too low: Indexers reject queries, degraded service
- Too high: Overpaying, wasted budget
- Market conditions change: Indexer fees fluctuate based on demand

We need a dynamic system that automatically adjusts fee budgets based on observed success rates.

## Decision

Implement a PID (Proportional-Integral-Derivative) controller to dynamically adjust fee budgets based on query success rate.

### PID Controller Overview

The PID controller continuously adjusts the fee budget using three terms:

```
adjustment = Kp * error + Ki * integral + Kd * derivative

where:
error = target_success_rate - actual_success_rate
integral = sum of past errors
derivative = rate of error change
```

- **P (Proportional)**: Immediate response to current error
- **I (Integral)**: Corrects persistent bias over time
- **D (Derivative)**: Dampens oscillations, smooths response

### Implementation

See `src/budgets.rs` for implementation:

```rust
pub struct Budgeter {
controller: PidController,
decay_buffer: DecayBuffer,
budget_per_query: f64,
}

impl Budgeter {
pub fn feedback(&self, success: bool) {
self.decay_buffer.record(success);
let success_rate = self.decay_buffer.success_rate();
let adjustment = self.controller.update(success_rate);
self.budget_per_query *= adjustment;
}
}
```

### Decay Buffer

Success rate is calculated using exponential decay to weight recent observations more heavily:

```
weighted_sum = sum(success_i * decay^i)
weighted_count = sum(decay^i)
success_rate = weighted_sum / weighted_count
```

This provides:

- Fast response to changing conditions
- Natural forgetting of stale data
- Bounded memory usage

## Consequences

### Positive

1. **Self-tuning**: Budget automatically converges to optimal level
2. **Adaptive**: Responds to market changes without manual intervention
3. **Stable**: PID controllers are well-understood and tuneable
4. **Observable**: Budget changes can be monitored via metrics

### Negative

1. **Tuning required**: PID gains (Kp, Ki, Kd) must be tuned for the system
2. **Oscillation risk**: Poorly tuned controller can oscillate
3. **Complexity**: More complex than static budgets
4. **Cold start**: Initial budget must be set heuristically

## Tuning Parameters

Current parameters (may need adjustment based on production data):

| Parameter | Value | Purpose |
| --------- | ----- | ----------------------------------------- |
| Kp | 0.1 | Proportional gain - immediate response |
| Ki | 0.01 | Integral gain - bias correction |
| Kd | 0.05 | Derivative gain - oscillation damping |
| Target | 0.95 | Target success rate (95%) |
| Decay | 0.99 | Decay factor for success rate calculation |

## Alternatives Considered

### Static Budget (Rejected)

```rust
const BUDGET_PER_QUERY: GRT = GRT::from_wei(1_000_000);
```

Problems:

- Cannot adapt to market conditions
- Requires manual intervention to change
- Either overpays or fails queries

### Threshold-based Adjustment (Rejected)

```rust
if success_rate < 0.9 { budget *= 1.1; }
if success_rate > 0.95 { budget *= 0.9; }
```

Problems:

- Oscillates around thresholds
- Step changes cause instability
- No derivative term to dampen oscillations

### Machine Learning Model (Rejected)

Train a model to predict optimal budget based on features.

Problems:

- Requires training data
- Black box behavior
- Overkill for this use case

## References

- [PID Controller (Wikipedia)](https://en.wikipedia.org/wiki/PID_controller)
- [Control Theory for Software Engineers](https://blog.acolyer.org/2015/05/01/feedback-control-for-computer-systems/)
29 changes: 29 additions & 0 deletions src/auth.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@
//! API Key Authentication
//!
//! Handles API key validation, payment status checks, and domain authorization.
//!
//! # Authentication Flow
//!
//! 1. Extract API key from `Authorization: Bearer <key>` header
//! 2. Parse and validate key format (32-char hex string → 16 bytes)
//! 3. Look up key in `api_keys` map (from Studio API or Kafka)
//! 4. Check payment status (`QueryStatus::Active`, `ServiceShutoff`, `MonthlyCapReached`)
//! 5. Verify origin domain against authorized domains list
//! 6. Return [`AuthSettings`] with user address and authorized subgraphs
//!
//! # Special API Keys
//!
//! Keys in `special_api_keys` bypass payment checks. Used for admin/monitoring.
//!
//! # Domain Authorization
//!
//! The `domains` field supports wildcards:
//! - `"example.com"` → exact match only
//! - `"*.example.com"` → matches `foo.example.com`, `bar.example.com`
//! - Empty list → all domains authorized
//!
//! # API Key Sources
//!
//! - [`studio_api`]: Poll HTTP endpoint periodically
//! - [`kafka`]: Stream updates from Kafka topic

pub mod kafka;
pub mod studio_api;

Expand Down
Loading