From 2de7de592593cfa88322ee072a1cd8709122c63f Mon Sep 17 00:00:00 2001 From: Sash Zats Date: Tue, 2 Sep 2025 06:31:46 +0300 Subject: [PATCH] Simplify token usage snippet --- ...stretching-context-in-foundation-models.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 src/content/blog/stretching-context-in-foundation-models.md diff --git a/src/content/blog/stretching-context-in-foundation-models.md b/src/content/blog/stretching-context-in-foundation-models.md new file mode 100644 index 0000000..c45523f --- /dev/null +++ b/src/content/blog/stretching-context-in-foundation-models.md @@ -0,0 +1,103 @@ +--- +title: "Stretching Context in Apple’s Foundation Models" +description: "Strategies to estimate usage and gracefully manage Apple’s 4,096-token limit." +pubDate: 2025-09-15 +draft: true +--- + +# Stretching Context in Apple’s Foundation Models + +When I published Counting Tokens in Foundation Models, I showed a practical way to approximate usage inside Apple’s 4,096-token hard limit. That post ended with a promise: a follow-up on how to stretch effective context so your app degrades gracefully instead of slamming into the ceiling. This is that follow-up. + +## Why Token Estimation Matters + +Apple’s Foundation Models don’t expose a tokenizer or usage telemetry. You only find out you’ve crossed the boundary when generation fails. That’s not acceptable for production apps. + +The fix is to build a guardrail: track characters across prompts, instructions, and responses, then convert to an estimated token count. In my runs, ~4.2 characters per token was a reliable baseline for English prose. Cross-checking with OpenAI’s tiktoken confirmed the trend: Apple heuristics overstate usage, but the failure mode is still abrupt. + +Here’s the pattern I use: + +```swift +let result = await TokenUsageEstimator.buildSummary( + for: session.transcript, + maxContextTokens: 4096 +) +self.stableEstimatedTokens = result.stableEstimatedTokens +self.tokenEstimatesSummary = result.summary +``` + +That single number — `stableEstimatedTokens` — becomes the trigger for every context management decision downstream. + +## Strategy 1: Sliding Windows + +The simplest way to avoid blowing the budget is to keep only the last few turns. Drop the oldest once you cross a threshold (say, 3000 tokens). + +```swift +if estimator.stableEstimatedTokens > 3000 { + transcript = transcript.keepRecentTurns(count: 6) +} +``` + +This works well for short, focused exchanges. But if your app supports longer conversations, you’ll need more than blunt trimming. + +## Strategy 2: Opportunistic Summarization + +Instead of throwing away history, compress it. When the budget tightens, collapse older turns into compact summaries: “User asked about X, model replied Y.” This preserves semantic continuity while making room for new input. + +```swift +if estimator.stableEstimatedTokens > 3200 { + transcript = transcript.summarizeOlderEntries() +} +``` + +The key is to do this proactively, before you hit the wall. Summarization itself consumes tokens, so keep the summaries short and role-aware. + +## Strategy 3: Hierarchical Condensation + +Flat summaries only take you so far. A better pattern is layered condensation: + +- First pass: shorten individual messages. For example: + - Original: “Sure, I can walk you through the setup of the provisioning API step by step with full code examples and background theory.” + - Condensed: “Explained provisioning API setup with examples.” +- Second pass: roll multiple compressed messages into topic summaries. For example: + - Original thread of 10 turns about debugging SQLite. + - Topic summary: “Resolved SQLite concurrency issues, recommended using read-only connections and GRDB notifications.” + +This gives you a semantic spine of the conversation that survives multiple rounds of compression, while still retaining the key technical outcomes. + +## Strategy 4: Selective Retention + +Not all context is equally important. Tool outputs and logs can eat your budget quickly. The trick is to retain essential state while discarding noise. + +- **Keep:** parameters, IDs, results that the model must reuse. For example: “Calendar event created: ID 87213, date 2025-09-15.” +- **Drop:** verbose logs like JSON payloads, intermediate HTTP traces, or multi-line stack traces, unless they’re required for reasoning. + +Code-wise, this can look like: + +```swift +transcript = transcript.retainImportantTools { toolCall in + return toolCall.isCritical || toolCall.containsIdentifiers +} +``` + +By treating structured data as state rather than dialogue, you prevent irrelevant clutter from consuming tokens. + +## A Graceful Degradation Policy + +Put these strategies together into a pipeline: + +1. Monitor tokens continuously. +2. Trim the earliest turns above 75% of budget. +3. Summarize mid-aged turns above 85%. +4. Apply hierarchical condensation if history is long and repetitive. +5. Strip tool chatter above 90%. +6. Fail-safe: block input or force summarization if >95%. + +The goal isn’t to never hit 4096 — it’s to never hit it suddenly. + +## Closing Thoughts + +In this post we covered the why and how of token estimation, and walked through strategies to stretch context without tripping over Apple’s hard 4096 limit. Estimation provides the guardrail; context management provides the steering wheel. Together they turn an abrupt ceiling into a manageable budget. + +Next time I’ll focus on Strategy 2 — opportunistic summarization — as a happy medium between complexity and efficiency, and show how to implement it cleanly in production. +