Skip to content

StackbiltCloudExporter: silent no-flush under low-QPS CF Worker isolates #10

@stackbilt-admin

Description

@stackbilt-admin

Filed after integration work in `Stackbilt-dev/aegis` (see aegis#432 + aegis#529). Reference impl `stackbilt-web#57` uses `LocalObserveExporter` (direct D1), so this is the first stress-test of the HTTP path under real CF Worker conditions.

Symptom

A Worker configured with:

```ts
createMonitoring({
service: 'aegis-web',
version: '...',
enableTracing: true,
tracingSampling: 0.2, // or 1.0
stackbilt: {
endpoint: 'https://stackbilder.com/api/observe/ingest',
token: env.STACKBILT_OBSERVE_TOKEN,
maxBatchSize: 1,
},
})
```

...emits metrics/spans at normal call sites (`metrics.increment`, wrapped `trace()`), but nothing appears at the ingest endpoint. Manual `curl POST` to the same endpoint with the same token returns HTTP 202 and shows up in `/api/observe/summary` immediately, so the endpoint + token + network path are fine. The library path is where the drop happens.

What was tried (all no-op)

Attempt Result
`maxBatchSize: 1` so `StackbiltCloudExporter.maybeFlush` has threshold `1 < 1 = false` → falls through to `flush()` on every emit Not visible
`tracer.flush()` in `ctx.waitUntil` after each request Not visible
`metrics.flush()` in `ctx.waitUntil` after each request (MetricsCollector has its own pre-exporter buffer) Not visible
Both `metrics.flush()` + `tracer.flush()` under `Promise.all` + `ctx.waitUntil` Not visible
Pre-warm `createMonitoring` cache with raw Env before MCP `apiHandler` (Hono middleware is bypassed by OAuthProvider) Not visible
`tracingSampling: 1.0` to rule out sample-drop Not visible
Synchronous `await fetch(endpoint, ...)` as a direct probe from inside a request handler, bypassing the library Also not visible \u2014 `wrangler tail` never showed `console.log` output for this Worker at all, even though it was clearly serving requests

Likely suspects

Without tail visibility the exact layer is opaque, but the ordered chain through the library is:

  1. `metrics.increment()` \u2192 `record()` pushes to `MetricsCollector.buffer`
  2. `record()` only flushes when `this.options.batchSize && buffer.length >= batchSize` \u2014 but `createMonitoring` wires `MetricsCollector` with `flushInterval: 30000` (setInterval) and no explicit `batchSize`. On CF Workers the setInterval fires but the flushInterval-driven flush is isolate-bound and may never actually run between short request lifetimes.
  3. Explicit `metrics.flush()` calls `await this.options.export.export(metrics)` \u2014 which is `StackbiltCloudExporter.export(items, kind)` \u2014 which pushes to `this.metrics[]` and calls `maybeFlush()`
  4. `maybeFlush()` with `maxBatchSize: 1` \u2014 the condition `totalItems < this.maxBatchSize` is false for any `totalItems >= 1`, so should fall through to `flush()`
  5. `flush()` \u2192 `doFlush()` \u2192 `fetch(endpoint, ...)`

The break is somewhere in steps 3\u20135. Either:

  • `MetricsCollector.flush()` finds `buffer.length === 0` on the explicit call because `record()` wrote straight to the internal buffer but CF Worker memory / isolate state isn't what we expect
  • Or `exporter.maybeFlush()` has a subtle condition that prevents the HTTP POST

Workaround (shipped in aegis)

Bypass `createMonitoring`'s `stackbilt` wiring entirely. Construct `Logger` + `MetricsCollector` + `Tracer` manually. Pass a custom exporter that implements all three of `MetricsExporter`, `SpanExporter`, `LogOutput` with zero internal buffering \u2014 every `export()` is an immediate `fetch()` POST with `console.error` on non-OK response.

```ts
class DirectCloudExporter implements MetricsExporter, SpanExporter, LogOutput {
async export(items: MetricPoint[] | TraceSpan[]) {
const isSpans = items.length && 'traceId' in items[0];
await this.post(isSpans ? { service, spans: items } : { service, metrics: items });
}
async write(entry: LogEntry) { await this.post({ service: this.service, logs: [entry] }); }
private async post(payload) {
const res = await fetch(this.endpoint, { method: 'POST', headers: { authorization: `Bearer ${this.token}`, 'content-type': 'application/json' }, body: JSON.stringify(payload) });
if (!res.ok) console.error(`[exporter] POST failed status=${res.status}`);
}
}
```

Full implementation at `Stackbilt-dev/aegis` `web/src/lib/direct-cloud-exporter.ts` + `web/src/monitoring.ts` (landed in aegis#530, 2026-04-18).

Verification the workaround works

With 20% sampling, MCP call volume produces proportional trace counts at stackbilder.com/observe within seconds. Pre-workaround: `aegis-web` appeared with 1 trace (the manual curl from debug). Post-workaround: traces grow in real-time as requests fire (2 \u2192 8 after 25 MCP calls = 6 sampled spans, variance-within-bounds for 20%).

Ask

If you can repro: the fix is probably in `StackbiltCloudExporter.maybeFlush()` or the `MetricsCollector` \u2194 exporter handoff under CF Worker isolate timing. The `DirectCloudExporter` shape could also be a candidate for inclusion in the library as an alternative to `StackbiltCloudExporter` \u2014 especially useful for low-QPS Workers where batching isn't a cost concern.

References

  • aegis#432 (parent, dogfood instrumentation)
  • aegis#529 (debug thread)
  • aegis#530 (workaround PR)
  • `stackbilt-web#57` (reference impl using `LocalObserveExporter`)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions