From 81600de524a5fac9c33cc5d9b35268f096bee13a Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Tue, 1 Jul 2025 16:15:13 -0700
Subject: [PATCH 1/6] Blog post about agentic pipeline using valkey

---
 ...5-07-01-valkey-powered-agentic-pipeline.md | 110 ++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 content/blog/2025-07-01-valkey-powered-agentic-pipeline.md

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
new file mode 100644
index 00000000..d84aee75
--- /dev/null
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -0,0 +1,110 @@
+### From Tweet to Tailored Feed: How We Built Lightning-Fast Agent Communication with Valkey
+
+---
+
+## Why Agentic Architectures Matter
+
+The software world is quickly shifting toward agent-based architectures—small, autonomous programs working together to sense their environment, make decisions, and take action. When you have hundreds or even thousands of these agents talking to each other, their communication needs to be rock-solid. It has to be **blazing fast**, completely transparent and observable, and flexible enough to adapt as your agents evolve.
+
+We found that **Valkey**, a modern fork of Redis, fits perfectly here. It gives us the lightning-fast, in-memory performance we expect from Redis but also bundles first-class modules, a friendlier open-source license, and vibrant community-driven development. Crucially, Valkey offers powerful built-in Streams, Lua scripting, and JSON/Search capabilities—all packed neatly inside a lightweight server.
+
+To demonstrate Valkey’s capabilities, we built a fun yet realistic demo: a Twitter-style news feed pipeline. Here's what we ended up with:
+
+```
+NewsFetcher → Enricher → Fan-out → UserFeedBuilder
+```
+
+Each step runs as a tiny Python agent, glued together seamlessly by Valkey Streams. We hooked it up to a Grafana dashboard so we could watch our little agent ecosystem in action—tracking backlogs, throughput, and latency.
+
+---
+
+## What Does the System Actually Do?
+
+When new articles come in, the `NewsFetcher` grabs them from external sources (like APIs or RSS feeds) and pushes them into a raw news stream. The `Enricher` then quickly classifies each article’s topic and creates a concise summary before publishing it to a dedicated stream for that topic.
+
+From there, the `Fan-out` agent takes over, broadcasting articles into thousands of personalized user feeds. Finally, the `UserFeedBuilder` streams these directly into user browsers, updating in real-time.
+
+This setup lets users see fresh, personalized content instantly—no waiting, no duplicates, and very little memory footprint.
+
+---
+
+## Why Did We Choose Valkey?
+
+Valkey stood out because it naturally fits agent workloads:
+
+* **Ultra-Fast Streams and Consumer Groups:** Messages travel between agents in under a millisecond, reliably delivered at least once.
+* **Server-Side Logic with Lua:** Complex fan-out and trimming operations happen directly inside Valkey, keeping our Python agents slim and efficient.
+* **Built-in JSON and Search Modules:** Enriching or querying payloads happens entirely in memory, dramatically reducing latency.
+* **Easy Metrics Integration:** Built-in monitoring lets Grafana show us backlog sizes, latency, and memory usage at a glance.
+* **Wide Language Support:** We could easily integrate with Python today and maybe Rust or Go tomorrow without changing the API.
+
+---
+
+## The Real Story of Our Development Journey
+
+Like all real-world projects, we hit some bumps and learned plenty along the way.
+
+We faced a puzzling issue dubbed the "Slinky backlog." When bursts of news came in, the fan-out queues formed a staircase pattern, causing delays. The fix? We moved trimming logic into Valkey itself using Lua scripting. Suddenly, bursts became smooth streams, and our backlogs flattened.
+
+Another challenge was duplicate articles popping into user feeds. Annoying, right? We solved this by introducing a deduplication step with a simple Redis set (`feed_seen`). This tiny adjustment cut duplicates from an annoying 3% down to a negligible 0.05%.
+
+Early on, our user interface had a quirky bug—it showed only a single message initially, with everything else piling up in a confusing "Refresh" bucket. After tweaking our React hooks with a short idle timer, the backlog smoothly appeared in the timeline, making our UI feel responsive and intuitive.
+
+We also discovered issues with missing modules during CI testing. By adding automated checks that confirm Valkey modules load properly in GitHub Actions, we caught configuration mishaps early, saving us headaches down the line.
+
+Finally, our Grafana dashboard initially looked like a complicated airplane cockpit with over 40 panels! To tame this complexity, we auto-generated simpler layouts, color-coding each pipeline stage to highlight anomalies immediately. Now, spotting problems is effortless.
+
+---
+
+## Observability: Making Monitoring Feel Natural
+
+Valkey’s native metrics integration was delightful. With just a glance at our Grafana dashboard, we see:
+
+* How quickly articles are ingested and processed.
+* How long messages take to move through the pipeline.
+* Memory usage and potential bottlenecks.
+
+Observability went from a daunting chore to something genuinely enjoyable.
+
+---
+
+## Performance & Reliability We’re Proud Of
+
+Our modest setup comfortably handles 250 new articles per second, rapidly expanding into 300,000 personalized feed messages, all with just 12 MB of RAM usage. Even better, our end-to-end latency stayed impressively low at around 170 microseconds per Valkey operation.
+
+Scaling was equally painless—just a single Docker command scaled out our enrichment and fan-out stages effortlessly. For GPU acceleration (for faster classification and summarization), switching from CPU to GPU mode was as easy as flipping a single configuration flag.
+
+---
+
+## Looking Ahead: Integrating LangChain & Beyond
+
+The next big step is connecting our pipeline to powerful LLM frameworks like LangChain. Imagine conversational agents effortlessly storing context, logging traces, and using natural abstractions like `ValkeyStreamTool`. We’re also prototyping an intuitive Message-Control-Plane (MCP) server to automatically provision streams, set permissions, and trace agent interactions—simplifying agent deployment dramatically.
+
+Contributors and curious minds are welcome—join us!
+
+---
+
+## Why This Matters to You
+
+Whether you're building an AI-driven recommendation engine, real-time feature store, or orchestrating IoT devices, Valkey gives you everything needed for lightning-fast, reliable agent communication. Prototype on your laptop, scale to a production-grade cluster, and enjoy a frictionless experience.
+
+---
+
+## Try it Yourself!
+
+Want to see it in action?
+
+```bash
+git clone https://github.com/vitarb/valkey_agentic_demo.git
+cd valkey_agentic_demo
+make dev # starts Valkey, agents, Grafana & React UI
+```
+
+Then open:
+
+* UI: [http://localhost:8500](http://localhost:8500)
+* Grafana: [http://localhost:3000](http://localhost:3000) (login: admin/admin)
+
+Have questions, ideas, or want to help improve it? Open an issue or PR on GitHub—we’d love to collaborate and see what you build!
+
+

From 7b66bffd3fe2eb2e5044546e65c6b59e193b51e7 Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Tue, 1 Jul 2025 16:32:52 -0700
Subject: [PATCH 2/6] Add a section about EC2 execution

---
 .../2025-07-01-valkey-powered-agentic-pipeline.md | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
index d84aee75..c94e92fa 100644
--- a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -40,19 +40,10 @@ Valkey stood out because it naturally fits agent workloads:
 
 ---
 
-## The Real Story of Our Development Journey
+## Running on GPU
+We've invested time creating a self-contained EC2 launch path—a one-liner that provisions everything from scratch: Docker, NVIDIA drivers, cached models, and the full Valkey agentic stack. But getting that to work wasn’t trivial. The default Amazon Linux 2 AMIs lacked modern GPU drivers, and HF Transformers would fail silently if their cache wasn’t pre-warmed. We fixed this with a layered Dockerfile split: a DEPS stage builds the model cache offline, while the final runtime image stays minimal. Add in shell script automation, metadata tagging, and manage.py to orchestrate runs—and now the EC2 path works reliably, GPU or not.
 
-Like all real-world projects, we hit some bumps and learned plenty along the way.
-
-We faced a puzzling issue dubbed the "Slinky backlog." When bursts of news came in, the fan-out queues formed a staircase pattern, causing delays. The fix? We moved trimming logic into Valkey itself using Lua scripting. Suddenly, bursts became smooth streams, and our backlogs flattened.
-
-Another challenge was duplicate articles popping into user feeds. Annoying, right? We solved this by introducing a deduplication step with a simple Redis set (`feed_seen`). This tiny adjustment cut duplicates from an annoying 3% down to a negligible 0.05%.
-
-Early on, our user interface had a quirky bug—it showed only a single message initially, with everything else piling up in a confusing "Refresh" bucket. After tweaking our React hooks with a short idle timer, the backlog smoothly appeared in the timeline, making our UI feel responsive and intuitive.
-
-We also discovered issues with missing modules during CI testing. By adding automated checks that confirm Valkey modules load properly in GitHub Actions, we caught configuration mishaps early, saving us headaches down the line.
-
-Finally, our Grafana dashboard initially looked like a complicated airplane cockpit with over 40 panels! To tame this complexity, we auto-generated simpler layouts, color-coding each pipeline stage to highlight anomalies immediately. Now, spotting problems is effortless.
+That setup now lets anyone launch the full demo on a fresh AWS box with GPU acceleration and working ports in ~5 minutes. 
 
 ---
 

From 9417961487c5306cdb2713b7fc50fc36f1bbdc0f Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Wed, 2 Jul 2025 15:07:46 -0700
Subject: [PATCH 3/6] Improve blog post

---
 ...5-07-01-valkey-powered-agentic-pipeline.md | 215 ++++++++++++++----
 1 file changed, 173 insertions(+), 42 deletions(-)

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
index c94e92fa..46054ced 100644
--- a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -1,101 +1,232 @@
-### From Tweet to Tailored Feed: How We Built Lightning-Fast Agent Communication with Valkey
+### From Tweet to Tailored Feed: How We Built Lightning‑Fast Agent Communication with **Valkey**
 
 ---
 
 ## Why Agentic Architectures Matter
 
-The software world is quickly shifting toward agent-based architectures—small, autonomous programs working together to sense their environment, make decisions, and take action. When you have hundreds or even thousands of these agents talking to each other, their communication needs to be rock-solid. It has to be **blazing fast**, completely transparent and observable, and flexible enough to adapt as your agents evolve.
+The software world is quickly shifting toward **agent‑based** architectures—small, autonomous programs that sense, decide, and act. When you are running hundreds of these agents, their communication layer must be
 
-We found that **Valkey**, a modern fork of Redis, fits perfectly here. It gives us the lightning-fast, in-memory performance we expect from Redis but also bundles first-class modules, a friendlier open-source license, and vibrant community-driven development. Crucially, Valkey offers powerful built-in Streams, Lua scripting, and JSON/Search capabilities—all packed neatly inside a lightweight server.
+* **Blazing‑fast** (so the system feels alive)
+* **Observable** (so you can prove it is alive)
+* **Flexible** (so you can evolve without rewiring everything)
 
-To demonstrate Valkey’s capabilities, we built a fun yet realistic demo: a Twitter-style news feed pipeline. Here's what we ended up with:
+**Valkey**—a community‑driven fork of Redis—hits the sweet spot: Redis‑grade speed, an Apache‑2 license, built‑in Streams, Lua, JSON, Search, plus an extension mechanism that lets you keep everything in a single lightweight server.
+
+---
+
+## The Demo Pipeline
 
 ```
-NewsFetcher → Enricher → Fan-out → UserFeedBuilder
+NewsFetcher ➜ Enricher ➜ Fan‑out ➜ UserFeedBuilder (React UI)
 ```
 
-Each step runs as a tiny Python agent, glued together seamlessly by Valkey Streams. We hooked it up to a Grafana dashboard so we could watch our little agent ecosystem in action—tracking backlogs, throughput, and latency.
+Each box is a **tiny Python agent** talking over Valkey Streams. A one‑liner (`make dev`) launches the full stack—Valkey, Prometheus, Grafana, agents, and a React UI—in ≈ 5 minutes on a fresh EC2 box (GPU optional).
 
 ---
 
-## What Does the System Actually Do?
+## What the System Does
 
-When new articles come in, the `NewsFetcher` grabs them from external sources (like APIs or RSS feeds) and pushes them into a raw news stream. The `Enricher` then quickly classifies each article’s topic and creates a concise summary before publishing it to a dedicated stream for that topic.
+1. **`NewsFetcher`** ingests raw news articles into the `news_raw` Stream.
+2. **`Enricher`** classifies each article’s topic and writes it to `topic:<slug>` Streams.
+3. **`Fan‑out`** distributes every article to thousands of personalised per‑user feeds.
+4. **`UserFeedBuilder`** (a thin WebSocket gateway) pushes updates straight into browsers.
+
+Users see a personalised timeline **instantly**—no duplicates, tiny memory footprint.
+
+---
 
-From there, the `Fan-out` agent takes over, broadcasting articles into thousands of personalized user feeds. Finally, the `UserFeedBuilder` streams these directly into user browsers, updating in real-time.
+## Why We Picked Valkey
 
-This setup lets users see fresh, personalized content instantly—no waiting, no duplicates, and very little memory footprint.
+* **Streams + Consumer Groups** → sub‑millisecond hops, at‑least‑once delivery.
+* **Server‑side Lua** → heavy fan‑out logic stays *inside* Valkey.
+* **JSON & Search Modules** → enrich / query payloads without touching disk.
+* **First‑class Metrics** → Prometheus exporter shows backlog, latency, memory.
+* **Language‑agnostic** → today Python, tomorrow Rust/Go, same API.
 
 ---
 
-## Why Did We Choose Valkey?
+## Battle‑tested Fixes (and the Code Behind Them)
 
-Valkey stood out because it naturally fits agent workloads:
+### 1. Smoothing the “Slinky Backlog” with Lua
 
-* **Ultra-Fast Streams and Consumer Groups:** Messages travel between agents in under a millisecond, reliably delivered at least once.
-* **Server-Side Logic with Lua:** Complex fan-out and trimming operations happen directly inside Valkey, keeping our Python agents slim and efficient.
-* **Built-in JSON and Search Modules:** Enriching or querying payloads happens entirely in memory, dramatically reducing latency.
-* **Easy Metrics Integration:** Built-in monitoring lets Grafana show us backlog sizes, latency, and memory usage at a glance.
-* **Wide Language Support:** We could easily integrate with Python today and maybe Rust or Go tomorrow without changing the API.
+Bursty input created staircase‑shaped queues.
+Instead of trimming in Python we let Valkey do it:
+
+```lua
+-- fanout.lua  – executed atomically inside Valkey
+-- KEYS[1]  = topic stream key
+-- ARGV[1]  = max_len
+redis.call('XTRIM', KEYS[1], 'MAXLEN', tonumber(ARGV[1]))
+return 1
+```
+
+Loaded once, invoked thousands of times per second—no extra RTT, no backlog waves.
 
 ---
 
-## Running on GPU
-We've invested time creating a self-contained EC2 launch path—a one-liner that provisions everything from scratch: Docker, NVIDIA drivers, cached models, and the full Valkey agentic stack. But getting that to work wasn’t trivial. The default Amazon Linux 2 AMIs lacked modern GPU drivers, and HF Transformers would fail silently if their cache wasn’t pre-warmed. We fixed this with a layered Dockerfile split: a DEPS stage builds the model cache offline, while the final runtime image stays minimal. Add in shell script automation, metadata tagging, and manage.py to orchestrate runs—and now the EC2 path works reliably, GPU or not.
+### 2. Killing Duplicates with a 24 h “Seen” Set
+
+```python
+# agents/fanout.py  (excerpt)
+seen_key = f"feed_seen:{uid}"
+added = await r.sadd(seen_key, doc_id)
+if added == 0:        # already delivered → skip
+    DUP_SKIP.inc()
+    continue
+await r.expire(seen_key, SEEN_TTL, nx=True)   # lazy‑set 24 h TTL
+```
 
-That setup now lets anyone launch the full demo on a fresh AWS box with GPU acceleration and working ports in ~5 minutes. 
+A six‑line patch helped to get rid of duplicate posts.
 
 ---
 
-## Observability: Making Monitoring Feel Natural
+### 3. GPU? Flip One Flag, Export One Metric
 
-Valkey’s native metrics integration was delightful. With just a glance at our Grafana dashboard, we see:
+```python
+# agents/enrich.py  (device auto‑select + Prometheus gauge)
+USE_CUDA_ENV = os.getenv("ENRICH_USE_CUDA", "auto").lower()
+DEVICE = 0 if USE_CUDA_ENV == "1" or (
+    USE_CUDA_ENV == "auto" and torch.cuda.is_available()) else -1
 
-* How quickly articles are ingested and processed.
-* How long messages take to move through the pipeline.
-* Memory usage and potential bottlenecks.
+GPU_GAUGE = Gauge(
+    "enrich_gpu",
+    "1 if this enrich replica is running on GPU; 0 otherwise",
+)
+GPU_GAUGE.set(1 if DEVICE >= 0 else 0)
+```
 
-Observability went from a daunting chore to something genuinely enjoyable.
+Replica‑level visibility means Grafana instantly shows how many workers actually run on CUDA after a deploy.
 
 ---
 
-## Performance & Reliability We’re Proud Of
+### 4. Autoscaling the Feed Reader—No K8s Required
 
-Our modest setup comfortably handles 250 new articles per second, rapidly expanding into 300,000 personalized feed messages, all with just 12 MB of RAM usage. Even better, our end-to-end latency stayed impressively low at around 170 microseconds per Valkey operation.
+```python
+# agents/user_reader.py  (dynamic pops / sec)
+latest_uid = int(await r.get("latest_uid") or 0)
+target_rps = min(MAX_RPS, max(1.0, latest_uid * POP_RATE))
+delay = 1.0 / target_rps
+TARGET_RPS.set(target_rps)      # Prometheus gauge
+```
 
-Scaling was equally painless—just a single Docker command scaled out our enrichment and fan-out stages effortlessly. For GPU acceleration (for faster classification and summarization), switching from CPU to GPU mode was as easy as flipping a single configuration flag.
+More users appear → the agent raises its own throughput linearly, capped for safety. Zero orchestrator glue code.
 
 ---
 
-## Looking Ahead: Integrating LangChain & Beyond
+### 5. CI Guardrails: “Fail Fast if Valkey is Mis‑configured”
+
+```yaml
+# .github/workflows/build.yml
+- name: Verify Valkey JSON module present
+  run: |
+    docker run -d --name valkey-check valkey/valkey-extensions:8.1-bookworm
+    for i in {1..5}; do
+      if docker exec valkey-check valkey-cli MODULE LIST | grep -q json; then
+        docker rm -f valkey-check && exit 0
+      fi
+      sleep 1
+    done
+    echo "Valkey JSON module missing"; docker logs valkey-check || true; exit 1
+```
+
+One flaky staging deploy convinced us to turn the check into a mandatory gate.
+
+---
 
-The next big step is connecting our pipeline to powerful LLM frameworks like LangChain. Imagine conversational agents effortlessly storing context, logging traces, and using natural abstractions like `ValkeyStreamTool`. We’re also prototyping an intuitive Message-Control-Plane (MCP) server to automatically provision streams, set permissions, and trace agent interactions—simplifying agent deployment dramatically.
+## Running on GPU (the Docker Magic)
+
+A two‑stage Dockerfile keeps the final image small **and** ships a warmed‑up HF model cache:
+
+```dockerfile
+# builder stage
+FROM python:3.12-slim AS DEPS
+...
+RUN --mount=type=cache,target=/root/.cache/pip \
+    pip install torch==2.2.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
+RUN --mount=type=cache,target=/opt/hf_cache \
+    python - <<'PY'
+from transformers import pipeline
+pipeline('zero-shot-classification',
+         model='typeform/distilbert-base-uncased-mnli')
+PY
+
+# runtime stage – just copy deps + cache
+FROM python:3.12-slim
+COPY --from=DEPS /usr/local      /usr/local
+COPY --from=DEPS /opt/hf_cache/  /opt/hf_cache/
+```
 
-Contributors and curious minds are welcome—join us!
+Cold‑start on an EC2 g5.xlarge? **≈ 30 s** until the first batch is classified.
 
 ---
 
-## Why This Matters to You
+## Observability Feels Native
+
+Prometheus + Grafana came almost “for free” because every agent exports its own counters & histograms. Highlights:
+
+* `enrich_classifier_latency_seconds` → p99 stays < 12 ms on A10G.
+* `topic_stream_len` → reveals hot topics at a glance.
+* `histogram_quantile()` over Valkey’s ping histogram → live *µs* latency.
 
-Whether you're building an AI-driven recommendation engine, real-time feature store, or orchestrating IoT devices, Valkey gives you everything needed for lightning-fast, reliable agent communication. Prototype on your laptop, scale to a production-grade cluster, and enjoy a frictionless experience.
+Grafana auto‑generates a 4‑column dashboard (yes, via a Python script in `tools/bootstrap_grafana.py`!), so adding a metric is a one‑line change.
 
 ---
 
-## Try it Yourself!
+## Performance Snapshot
 
-Want to see it in action?
+| Metric                 | Value                          |
+| ---------------------- | ------------------------------ |
+| Articles ingested      | **250 / s**                    |
+| Personalised feed msgs | **300 k / s**                  |
+| Valkey RAM             | **12 MB** steady               |
+| p99 Valkey op          | **< 200 µs**                   |
+| GPU uplift             | **3.6×** faster classification |
+
+Scaling up was as simple as:
+
+```bash
+docker compose up \
+  --scale enrich=6 \
+  --scale fanout=3 \
+  --scale reader=4
+```
+
+---
+
+## Looking Ahead
+
+* **LangChain integration** – drop‑in `ValkeyStreamTool`.
+* **Message‑Control‑Plane (MCP)** – auto‑provision streams & ACLs.
+* **Rust agents** – same Streams API, zero Python.
+
+PRs, ideas, and critiques are all welcome—**join us!**
+
+---
+
+## Why This Matters to You
+
+If you are building…
+
+* an AI‑driven recommendation engine,
+* a real‑time feature store, **or**
+* an IoT swarm with thousands of sensors,
+
+…Valkey is the **glue layer** that keeps state consistent and messages flying while your agents stay blissfully simple.
+
+---
+
+## Try It Yourself
 
 ```bash
 git clone https://github.com/vitarb/valkey_agentic_demo.git
 cd valkey_agentic_demo
-make dev # starts Valkey, agents, Grafana & React UI
+make dev        # Valkey + agents + Grafana + React UI
 ```
 
-Then open:
-
-* UI: [http://localhost:8500](http://localhost:8500)
-* Grafana: [http://localhost:3000](http://localhost:3000) (login: admin/admin)
+Open:
 
-Have questions, ideas, or want to help improve it? Open an issue or PR on GitHub—we’d love to collaborate and see what you build!
+* **Feed UI:** [http://localhost:8500](http://localhost:8500)
+* **Grafana:** [http://localhost:3000](http://localhost:3000)  (`admin / admin`)
 
+Questions? Ideas? Open an issue or PR—we’d love to see what you build next.
 

From 7814aab48a567120eac66d102d01d5fc1d449851 Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Wed, 16 Jul 2025 15:35:05 -0700
Subject: [PATCH 4/6] Update blogpost making it more narrative driven

---
 ...5-07-01-valkey-powered-agentic-pipeline.md | 252 +++++++-----------
 1 file changed, 97 insertions(+), 155 deletions(-)

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
index 46054ced..888c9abf 100644
--- a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -1,232 +1,174 @@
-### From Tweet to Tailored Feed: How We Built Lightning‑Fast Agent Communication with **Valkey**
+## Lightning-Fast Agent Messaging with Valkey
 
----
-
-## Why Agentic Architectures Matter
-
-The software world is quickly shifting toward **agent‑based** architectures—small, autonomous programs that sense, decide, and act. When you are running hundreds of these agents, their communication layer must be
+Modern applications are slipping away from monoliths toward fleets of specialised agents—small programs that sense, decide, and act in tight, real-time loops. When hundreds of them interact, their messaging layer must be lightning-quick, observable, and flexible enough to evolve without painful rewrites.
 
-* **Blazing‑fast** (so the system feels alive)
-* **Observable** (so you can prove it is alive)
-* **Flexible** (so you can evolve without rewiring everything)
-
-**Valkey**—a community‑driven fork of Redis—hits the sweet spot: Redis‑grade speed, an Apache‑2 license, built‑in Streams, Lua, JSON, Search, plus an extension mechanism that lets you keep everything in a single lightweight server.
+That need led us to **Valkey**—an open-source, community-driven, in-memory database fully compatible with Redis. Streams, Lua scripting, a mature JSON & Search stack, and a lightweight extension mechanism all live inside one process, giving our agents a fast, shared nervous system.
 
 ---
 
-## The Demo Pipeline
-
-```
-NewsFetcher ➜ Enricher ➜ Fan‑out ➜ UserFeedBuilder (React UI)
-```
+### Inside the Pipeline — Code & Commentary
 
-Each box is a **tiny Python agent** talking over Valkey Streams. A one‑liner (`make dev`) launches the full stack—Valkey, Prometheus, Grafana, agents, and a React UI—in ≈ 5 minutes on a fresh EC2 box (GPU optional).
+Whenever a headline arrives, it passes through four focused agents. We’ll trace that journey and highlight the micro-optimisations that keep agent-to-agent latency in the low microseconds.
 
----
-
-## What the System Does
+```text
+NewsFetcher → Enricher → Fan-out → UserFeedBuilder → React UI
+```
 
-1. **`NewsFetcher`** ingests raw news articles into the `news_raw` Stream.
-2. **`Enricher`** classifies each article’s topic and writes it to `topic:<slug>` Streams.
-3. **`Fan‑out`** distributes every article to thousands of personalised per‑user feeds.
-4. **`UserFeedBuilder`** (a thin WebSocket gateway) pushes updates straight into browsers.
+#### Stage 1 – NewsFetcher (pushes raw headlines)
 
-Users see a personalised timeline **instantly**—no duplicates, tiny memory footprint.
+```python
+# fetcher.py – 250 msgs / s
+await r.xadd("news_raw", {"id": idx, "title": title, "body": text})
+```
 
----
+Adds each raw article to the `news_raw` stream so downstream agents can pick it up.
 
-## Why We Picked Valkey
+#### Stage 2 – Enricher (classifies on GPU if available)
 
-* **Streams + Consumer Groups** → sub‑millisecond hops, at‑least‑once delivery.
-* **Server‑side Lua** → heavy fan‑out logic stays *inside* Valkey.
-* **JSON & Search Modules** → enrich / query payloads without touching disk.
-* **First‑class Metrics** → Prometheus exporter shows backlog, latency, memory.
-* **Language‑agnostic** → today Python, tomorrow Rust/Go, same API.
+```python
+# enrich.py – device pick, GPU gauge
+DEVICE = 0 if torch.cuda.is_available() else -1          # −1 → CPU
+GPU_GAUGE.set(1 if DEVICE >= 0 else 0)
+```
 
----
+Detects whether a GPU is present and records the result in a Prometheus gauge.
 
-## Battle‑tested Fixes (and the Code Behind Them)
+```python
+# classify then publish to a topic stream
+pipe.xadd(f"topic:{doc['topic']}", {"data": json.dumps(payload)})
+```
 
-### 1. Smoothing the “Slinky Backlog” with Lua
+Writes the enriched article into its `topic:<slug>` stream for later fan-out.
 
-Bursty input created staircase‑shaped queues.
-Instead of trimming in Python we let Valkey do it:
+#### Stage 3 – Fan-out (duplicates to per-user feeds + dedupe)
 
 ```lua
--- fanout.lua  – executed atomically inside Valkey
--- KEYS[1]  = topic stream key
--- ARGV[1]  = max_len
+-- fanout.lua – smooths burst traffic
+-- ARGV[1] = max stream length (e.g. 10000)
+-- Trim ensures old messages don’t balloon memory or backlog
 redis.call('XTRIM', KEYS[1], 'MAXLEN', tonumber(ARGV[1]))
-return 1
 ```
 
-Loaded once, invoked thousands of times per second—no extra RTT, no backlog waves.
-
----
-
-### 2. Killing Duplicates with a 24 h “Seen” Set
+Atomically trims each topic stream inside Valkey to keep memory and queues flat.
 
 ```python
-# agents/fanout.py  (excerpt)
-seen_key = f"feed_seen:{uid}"
-added = await r.sadd(seen_key, doc_id)
-if added == 0:        # already delivered → skip
-    DUP_SKIP.inc()
-    continue
-await r.expire(seen_key, SEEN_TTL, nx=True)   # lazy‑set 24 h TTL
+# fanout.py – per-user de-duplication
+added = await r.sadd(f"feed_seen:{uid}", doc_id)
+if added == 0:
+    continue                              # duplicate → skip
+# 24h TTL for dedup tracking; NX avoids overwriting if already set
+await r.expire(f"feed_seen:{uid}", 86_400, nx=True)
 ```
 
-A six‑line patch helped to get rid of duplicate posts.
+Skips any article a user has already seen by tracking IDs in a 24-hour set.
 
----
-
-### 3. GPU? Flip One Flag, Export One Metric
+#### Stage 4 – UserFeedBuilder (tails the stream over WebSockets)
 
 ```python
-# agents/enrich.py  (device auto‑select + Prometheus gauge)
-USE_CUDA_ENV = os.getenv("ENRICH_USE_CUDA", "auto").lower()
-DEVICE = 0 if USE_CUDA_ENV == "1" or (
-    USE_CUDA_ENV == "auto" and torch.cuda.is_available()) else -1
-
-GPU_GAUGE = Gauge(
-    "enrich_gpu",
-    "1 if this enrich replica is running on GPU; 0 otherwise",
-)
-GPU_GAUGE.set(1 if DEVICE >= 0 else 0)
+# gateway/main.py – live feed push
+msgs = await r.xread({stream: last_id}, block=0, count=1)
+await ws.send_json(json.loads(msgs[0][1][0][1]["data"]))
 ```
 
-Replica‑level visibility means Grafana instantly shows how many workers actually run on CUDA after a deploy.
+Continuously reads from the user’s feed stream and pushes each new item over a WebSocket.
 
----
-
-### 4. Autoscaling the Feed Reader—No K8s Required
+#### Self-Tuning Readers (load generator & consumer)
 
 ```python
-# agents/user_reader.py  (dynamic pops / sec)
-latest_uid = int(await r.get("latest_uid") or 0)
+# user_reader.py – dynamic pacing
 target_rps = min(MAX_RPS, max(1.0, latest_uid * POP_RATE))
-delay = 1.0 / target_rps
-TARGET_RPS.set(target_rps)      # Prometheus gauge
+await asyncio.sleep(1.0 / target_rps)
 ```
 
-More users appear → the agent raises its own throughput linearly, capped for safety. Zero orchestrator glue code.
+Adjusts its own consumption rate to match the current user count—no external autoscaler needed.
 
 ---
 
-### 5. CI Guardrails: “Fail Fast if Valkey is Mis‑configured”
-
-```yaml
-# .github/workflows/build.yml
-- name: Verify Valkey JSON module present
-  run: |
-    docker run -d --name valkey-check valkey/valkey-extensions:8.1-bookworm
-    for i in {1..5}; do
-      if docker exec valkey-check valkey-cli MODULE LIST | grep -q json; then
-        docker rm -f valkey-check && exit 0
-      fi
-      sleep 1
-    done
-    echo "Valkey JSON module missing"; docker logs valkey-check || true; exit 1
-```
-
-One flaky staging deploy convinced us to turn the check into a mandatory gate.
+A single `make` command spins up Valkey, agents, Grafana, and the UI under Docker Compose in roughly five minutes. If the host has a GPU, the Enricher detects and uses it automatically; otherwise it proceeds on CPU with the same code path.
 
 ---
 
-## Running on GPU (the Docker Magic)
-
-A two‑stage Dockerfile keeps the final image small **and** ships a warmed‑up HF model cache:
-
-```dockerfile
-# builder stage
-FROM python:3.12-slim AS DEPS
-...
-RUN --mount=type=cache,target=/root/.cache/pip \
-    pip install torch==2.2.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
-RUN --mount=type=cache,target=/opt/hf_cache \
-    python - <<'PY'
-from transformers import pipeline
-pipeline('zero-shot-classification',
-         model='typeform/distilbert-base-uncased-mnli')
-PY
-
-# runtime stage – just copy deps + cache
-FROM python:3.12-slim
-COPY --from=DEPS /usr/local      /usr/local
-COPY --from=DEPS /opt/hf_cache/  /opt/hf_cache/
-```
+### Why We Bet on Valkey
 
-Cold‑start on an EC2 g5.xlarge? **≈ 30 s** until the first batch is classified.
+Streams and consumer groups move messages in well under a millisecond, Lua keeps heavy fan-out logic server-side, and JSON / Search lets enrichment stay in memory. Grafana began charting backlog lengths and latency immediately, and swapping Python agents for Rust or Go required no datastore changes. The Redis compatibility is genuine—we didn’t tweak a single configuration knob when moving from Redis to Valkey.
 
 ---
 
-## Observability Feels Native
+### Challenges on the Road — and How We Solved Them
 
-Prometheus + Grafana came almost “for free” because every agent exports its own counters & histograms. Highlights:
+**Bursty traffic turned streams into “slinkies.”**
+Our first load test looked like a staircase: sudden article bursts piled up and only drained once the wave had passed. Pushing a ten-line Lua XTRIM script into Valkey meant trimming happened atomically, right where the data lived. Queue lengths flattened almost instantly.
 
-* `enrich_classifier_latency_seconds` → p99 stays < 12 ms on A10G.
-* `topic_stream_len` → reveals hot topics at a glance.
-* `histogram_quantile()` over Valkey’s ping histogram → live *µs* latency.
+**Users started seeing déjà-vu in their feeds.**
+A subtle race caused the same article ID to reach a user twice. We fixed it by introducing a tiny “seen” set per user (`feed_seen:<uid>`). If `SADD` returns 0, the item is silently skipped. Dupes dropped from roughly 3% to effectively zero, and the extra memory footprint was trivial.
 
-Grafana auto‑generates a 4‑column dashboard (yes, via a Python script in `tools/bootstrap_grafana.py`!), so adding a metric is a one‑line change.
+**Some replicas bragged about GPUs they didn’t have.**
+On mixed CPU/GPU clusters, a few Enricher containers claimed CUDA but actually ran on CPU. Emitting a one-shot Prometheus gauge (`enrich_gpu`) exposed the truth in Grafana, so mis-scheduled pods are obvious at a glance.
+
+**Reader throughput lagged behind user growth.**
+Instead of wiring up a Kubernetes HPA, we let the reader recalculate its own pops-per-second each second (`latest_uid * POP_RATE`). More users? Faster loop. Peak load? The delay clamps at a safe maximum. No Helm charts, no YAML deep dives.
+
+**A missing module once took down staging.**
+Someone built Valkey without the JSON module; enrichment crashed only after deploy. Our CI pipeline now boots a throw-away Valkey container, runs `MODULE LIST`, and fails the build if anything critical is absent—misconfigurations caught before merge.
 
 ---
 
-## Performance Snapshot
+### Observability That Comes Standard
 
-| Metric                 | Value                          |
-| ---------------------- | ------------------------------ |
-| Articles ingested      | **250 / s**                    |
-| Personalised feed msgs | **300 k / s**                  |
-| Valkey RAM             | **12 MB** steady               |
-| p99 Valkey op          | **< 200 µs**                   |
-| GPU uplift             | **3.6×** faster classification |
+Because every agent exports counters and histograms, Grafana’s Agent Overview dashboard fills itself:
 
-Scaling up was as simple as:
+* ingestion, enrichment, and fan-out rates
+* topic-specific backlog lengths
+* p50 / p99 command latency (µs)
+* dataset-only memory use, network throughput, connected-client count
+* exact number of Enricher replicas running on GPU right now
 
-```bash
-docker compose up \
-  --scale enrich=6 \
-  --scale fanout=3 \
-  --scale reader=4
-```
+A helper script (`tools/bootstrap_grafana.py`) rewrites the dashboard whenever we add a metric, so panels stay readable and colour-coded.
 
 ---
 
-## Looking Ahead
+### Performance Snapshot
 
-* **LangChain integration** – drop‑in `ValkeyStreamTool`.
-* **Message‑Control‑Plane (MCP)** – auto‑provision streams & ACLs.
-* **Rust agents** – same Streams API, zero Python.
+* **Raw articles ingested:** 250 / s
+* **Personalised feed messages:** 300k / s
+* **Valkey RAM (steady):** 12 MB
+* **p99 Valkey op latency:** ≈ 200 µs
+* **GPU uplift (A10G):** 3.6× faster enrichment
 
-PRs, ideas, and critiques are all welcome—**join us!**
+Scaling up is a single Docker command—no Helm charts, no YAML deep dives.
 
 ---
 
-## Why This Matters to You
+### What’s Next
+
+Our long-term goal is to make agent networks something you can spin up and evolve in minutes, not weeks. We’re betting that agent-based infrastructure is the next primitive—and we want it to be drop-in simple.
 
-If you are building…
+* **LangChain integration** — idiomatic `ValkeyStreamTool` for LLM workflows
+* **Message Control Plane** — auto-provision streams, ACLs, metrics
+* **Rust agents** — lower memory, same Streams API
 
-* an AI‑driven recommendation engine,
-* a real‑time feature store, **or**
-* an IoT swarm with thousands of sensors,
+Pull requests and fresh ideas are always welcome.
 
-…Valkey is the **glue layer** that keeps state consistent and messages flying while your agents stay blissfully simple.
+---
+
+### Why It Might Matter to You
+
+Whether you’re building a recommendation engine, a real-time feature store, or an IoT swarm, Valkey supplies stateful speed, built-in observability, and room to evolve—while your agents stay blissfully focused on their own jobs.
 
 ---
 
-## Try It Yourself
+### Try It Yourself
+
+You can spin up the full system in one command:
 
 ```bash
 git clone https://github.com/vitarb/valkey_agentic_demo.git
 cd valkey_agentic_demo
-make dev        # Valkey + agents + Grafana + React UI
+make
 ```
 
-Open:
-
-* **Feed UI:** [http://localhost:8500](http://localhost:8500)
-* **Grafana:** [http://localhost:3000](http://localhost:3000)  (`admin / admin`)
+Then open:
 
-Questions? Ideas? Open an issue or PR—we’d love to see what you build next.
+* **Feed UI**: [http://localhost:8500](http://localhost:8500)
+* **Grafana**: [http://localhost:3000](http://localhost:3000) (admin / admin)
 

From ea098b4507cdbd19173c327f37ebfe00c7192db2 Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Wed, 23 Jul 2025 14:10:28 -0700
Subject: [PATCH 5/6] Make some edits improving readibility and flow of the
 article

---
 ...5-07-01-valkey-powered-agentic-pipeline.md | 155 +++++++++---------
 1 file changed, 79 insertions(+), 76 deletions(-)

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
index 888c9abf..a5909542 100644
--- a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -1,68 +1,78 @@
-## Lightning-Fast Agent Messaging with Valkey
+# Lightning-Fast Agent Messaging with Valkey
 
-Modern applications are slipping away from monoliths toward fleets of specialised agents—small programs that sense, decide, and act in tight, real-time loops. When hundreds of them interact, their messaging layer must be lightning-quick, observable, and flexible enough to evolve without painful rewrites.
+## From Tweet to Tailored Feed
 
-That need led us to **Valkey**—an open-source, community-driven, in-memory database fully compatible with Redis. Streams, Lua scripting, a mature JSON & Search stack, and a lightweight extension mechanism all live inside one process, giving our agents a fast, shared nervous system.
+Modern applications are moving beyond monoliths into distributed fleets of specialized agents—small programs that sense, decide, and act in real-time. When hundreds of these interact, their messaging layer must be lightning-fast, observable, and flexible enough to evolve without rewrites.
 
----
+That requirement led us to **Valkey**: an open-source, community-driven, in-memory database fully compatible with Redis. With streams, Lua scripting, a mature JSON & Search stack, and a lightweight extension system, Valkey provides our agents with a fast, shared nervous system.
 
-### Inside the Pipeline — Code & Commentary
+## Inside the Pipeline: Code & Commentary
 
-Whenever a headline arrives, it passes through four focused agents. We’ll trace that journey and highlight the micro-optimisations that keep agent-to-agent latency in the low microseconds.
+Each incoming headline flows through four agents. Here's that journey, including key optimizations that keep agent-to-agent latency in the low microseconds:
 
-```text
+```
 NewsFetcher → Enricher → Fan-out → UserFeedBuilder → React UI
 ```
 
-#### Stage 1 – NewsFetcher (pushes raw headlines)
+### Stage 1 – NewsFetcher (pushes raw headlines)
 
 ```python
-# fetcher.py – 250 msgs / s
+# fetcher.py – ~250 msgs/s
 await r.xadd("news_raw", {"id": idx, "title": title, "body": text})
 ```
 
-Adds each raw article to the `news_raw` stream so downstream agents can pick it up.
+Adds each raw article to the `news_raw` stream for downstream agents to consume.
 
-#### Stage 2 – Enricher (classifies on GPU if available)
+### Stage 2 – Enricher (tags topics and summarizes)
 
 ```python
 # enrich.py – device pick, GPU gauge
-DEVICE = 0 if torch.cuda.is_available() else -1          # −1 → CPU
+DEVICE = 0 if torch.cuda.is_available() else -1
 GPU_GAUGE.set(1 if DEVICE >= 0 else 0)
 ```
 
-Detects whether a GPU is present and records the result in a Prometheus gauge.
+Detects GPU availability and exposes the result to Prometheus.
 
 ```python
-# classify then publish to a topic stream
-pipe.xadd(f"topic:{doc['topic']}", {"data": json.dumps(payload)})
+# enrich.py – run the classifier with LangChain
+from langchain_community.llms import HuggingFacePipeline
+from transformers import pipeline
+
+zeroshot = pipeline(
+    "zero-shot-classification",
+    model="typeform/distilbert-base-uncased-mnli",
+    device=DEVICE,
+)
+llm = HuggingFacePipeline(pipeline=zeroshot)
+
+topic = llm("Which topic best fits: " + doc["title"], labels=TOPICS).labels[0]
+payload = {**doc, "topic": topic}
+pipe.xadd(f"topic:{topic}", {"data": json.dumps(payload)})
 ```
 
-Writes the enriched article into its `topic:<slug>` stream for later fan-out.
+Uses a Hugging Face zero-shot model—wrapped in LangChain—to label articles and route them into topic streams.
 
-#### Stage 3 – Fan-out (duplicates to per-user feeds + dedupe)
+### Stage 3 – Fan-out (duplicates to per-user feeds + deduplication)
 
 ```lua
 -- fanout.lua – smooths burst traffic
 -- ARGV[1] = max stream length (e.g. 10000)
--- Trim ensures old messages don’t balloon memory or backlog
 redis.call('XTRIM', KEYS[1], 'MAXLEN', tonumber(ARGV[1]))
 ```
 
-Atomically trims each topic stream inside Valkey to keep memory and queues flat.
+Trims topic streams inside Valkey to prevent unbounded growth.
 
 ```python
 # fanout.py – per-user de-duplication
 added = await r.sadd(f"feed_seen:{uid}", doc_id)
 if added == 0:
-    continue                              # duplicate → skip
-# 24h TTL for dedup tracking; NX avoids overwriting if already set
+    continue  # duplicate → skip
 await r.expire(f"feed_seen:{uid}", 86_400, nx=True)
 ```
 
-Skips any article a user has already seen by tracking IDs in a 24-hour set.
+Skips already-seen articles by tracking IDs in a 24-hour `feed_seen` set.
 
-#### Stage 4 – UserFeedBuilder (tails the stream over WebSockets)
+### Stage 4 – UserFeedBuilder (streams updates via WebSockets)
 
 ```python
 # gateway/main.py – live feed push
@@ -70,9 +80,9 @@ msgs = await r.xread({stream: last_id}, block=0, count=1)
 await ws.send_json(json.loads(msgs[0][1][0][1]["data"]))
 ```
 
-Continuously reads from the user’s feed stream and pushes each new item over a WebSocket.
+Tails the per-user stream and emits new entries directly to the browser.
 
-#### Self-Tuning Readers (load generator & consumer)
+### Self-Tuning Readers (load generator & consumer)
 
 ```python
 # user_reader.py – dynamic pacing
@@ -80,95 +90,88 @@ target_rps = min(MAX_RPS, max(1.0, latest_uid * POP_RATE))
 await asyncio.sleep(1.0 / target_rps)
 ```
 
-Adjusts its own consumption rate to match the current user count—no external autoscaler needed.
-
----
+Dynamically adjusts consumption rate based on user count—no external autoscaler needed.
 
-A single `make` command spins up Valkey, agents, Grafana, and the UI under Docker Compose in roughly five minutes. If the host has a GPU, the Enricher detects and uses it automatically; otherwise it proceeds on CPU with the same code path.
+A single `make` command launches Valkey, agents, Grafana, and the UI via Docker Compose in \~5 minutes. If a GPU is present, the Enricher uses it automatically.
 
 ---
 
-### Why We Bet on Valkey
+## Why We Bet on Valkey
 
-Streams and consumer groups move messages in well under a millisecond, Lua keeps heavy fan-out logic server-side, and JSON / Search lets enrichment stay in memory. Grafana began charting backlog lengths and latency immediately, and swapping Python agents for Rust or Go required no datastore changes. The Redis compatibility is genuine—we didn’t tweak a single configuration knob when moving from Redis to Valkey.
+Valkey Streams and consumer groups move messages in <1 ms. Lua keeps fan-out logic server-side. JSON/Search allows enrichment to stay in-memory. Grafana charts latency and backlog immediately. Python agents can be swapped for Rust or Go with no changes to the datastore.
 
----
+Redis compatibility was seamless—no config changes needed.
 
-### Challenges on the Road — and How We Solved Them
+## Real-World Bumps (and the Fixes That Worked)
 
-**Bursty traffic turned streams into “slinkies.”**
-Our first load test looked like a staircase: sudden article bursts piled up and only drained once the wave had passed. Pushing a ten-line Lua XTRIM script into Valkey meant trimming happened atomically, right where the data lived. Queue lengths flattened almost instantly.
+**1. Enricher bottlenecked the pipeline**
+A c6.xlarge maxed out at \~10 msg/s on CPU. GPU offload + batch processing (32 articles) on an A10G raised throughput to 60 msg/s.
 
-**Users started seeing déjà-vu in their feeds.**
-A subtle race caused the same article ID to reach a user twice. We fixed it by introducing a tiny “seen” set per user (`feed_seen:<uid>`). If `SADD` returns 0, the item is silently skipped. Dupes dropped from roughly 3% to effectively zero, and the extra memory footprint was trivial.
+**2. Messages got stuck in consumer groups**
+Missed `XACK` left IDs in `PENDING`. Fix: immediately `XACK` after processing + a 30s "reaper" to reclaim old messages.
 
-**Some replicas bragged about GPUs they didn’t have.**
-On mixed CPU/GPU clusters, a few Enricher containers claimed CUDA but actually ran on CPU. Emitting a one-shot Prometheus gauge (`enrich_gpu`) exposed the truth in Grafana, so mis-scheduled pods are obvious at a glance.
+**3. Duplicate articles appeared**
+Fan-out crashes between user push and stream trim caused retries. `feed_seen` set made idempotency explicit. Dupes dropped to zero.
 
-**Reader throughput lagged behind user growth.**
-Instead of wiring up a Kubernetes HPA, we let the reader recalculate its own pops-per-second each second (`latest_uid * POP_RATE`). More users? Faster loop. Peak load? The delay clamps at a safe maximum. No Helm charts, no YAML deep dives.
+**4. Readers fell behind during spikes**
+Fixed 50 pops/sec couldn’t keep up with 10k users. Self-tuning delay (`latest_uid * POP_RATE`) scaled up to 200 pops/sec.
 
-**A missing module once took down staging.**
-Someone built Valkey without the JSON module; enrichment crashed only after deploy. Our CI pipeline now boots a throw-away Valkey container, runs `MODULE LIST`, and fails the build if anything critical is absent—misconfigurations caught before merge.
+All fixes are now defaults in the repo.
 
 ---
 
-### Observability That Comes Standard
+## Observability That Comes Standard
 
-Because every agent exports counters and histograms, Grafana’s Agent Overview dashboard fills itself:
+Every agent exports metrics. Grafana's dashboard auto-populates:
 
-* ingestion, enrichment, and fan-out rates
-* topic-specific backlog lengths
-* p50 / p99 command latency (µs)
-* dataset-only memory use, network throughput, connected-client count
-* exact number of Enricher replicas running on GPU right now
+* Ingestion, enrichment, and fan-out rates
+* Topic-specific backlog lengths
+* p50 / p99 command latency (in µs)
+* Dataset memory use, network throughput, connected clients
+* Enricher replicas on GPU (via `enrich_gpu` gauge)
 
-A helper script (`tools/bootstrap_grafana.py`) rewrites the dashboard whenever we add a metric, so panels stay readable and colour-coded.
+`tools/bootstrap_grafana.py` auto-updates the dashboard when new metrics are added.
 
----
+## Performance Snapshot
 
-### Performance Snapshot
+| Metric                     | Result                 |
+| -------------------------- | ---------------------- |
+| Raw articles ingested      | 250 /s                 |
+| Personalized feed messages | 300k /s                |
+| Valkey RAM (steady)        | 12 MB                  |
+| p99 Valkey op latency      | ≈ 200 µs               |
+| GPU uplift (A10G)          | 5x faster enrichment |
 
-* **Raw articles ingested:** 250 / s
-* **Personalised feed messages:** 300k / s
-* **Valkey RAM (steady):** 12 MB
-* **p99 Valkey op latency:** ≈ 200 µs
-* **GPU uplift (A10G):** 3.6× faster enrichment
-
-Scaling up is a single Docker command—no Helm charts, no YAML deep dives.
+Scaling up? One Docker command. No Helm. No YAML deep dives.
 
 ---
 
-### What’s Next
-
-Our long-term goal is to make agent networks something you can spin up and evolve in minutes, not weeks. We’re betting that agent-based infrastructure is the next primitive—and we want it to be drop-in simple.
+## What's Next
 
-* **LangChain integration** — idiomatic `ValkeyStreamTool` for LLM workflows
-* **Message Control Plane** — auto-provision streams, ACLs, metrics
-* **Rust agents** — lower memory, same Streams API
+We aim to make agent networks something you can spin up in minutes, not weeks. Our roadmap:
 
-Pull requests and fresh ideas are always welcome.
+* **LangChain-powered MCP (Message Control Plane)** to declaratively wire chains to Valkey.
+* **Rust agents** using the same Streams API but with lower memory.
+* **Auto-provisioned ACLs & metrics** via the MCP server.
 
----
+Pull requests and fresh ideas welcome.
 
-### Why It Might Matter to You
+## Why It Might Matter to You
 
-Whether you’re building a recommendation engine, a real-time feature store, or an IoT swarm, Valkey supplies stateful speed, built-in observability, and room to evolve—while your agents stay blissfully focused on their own jobs.
-
----
+Whether you're building recommendation engines, real-time feature stores, or IoT swarms—Valkey offers stateful speed, built-in observability, and freedom to evolve. Your agents stay blissfully focused on their own jobs.
 
-### Try It Yourself
+## Try It Yourself
 
-You can spin up the full system in one command:
+Spin up the full system in one command:
 
 ```bash
 git clone https://github.com/vitarb/valkey_agentic_demo.git
 cd valkey_agentic_demo
-make
+make dev
 ```
 
 Then open:
 
 * **Feed UI**: [http://localhost:8500](http://localhost:8500)
-* **Grafana**: [http://localhost:3000](http://localhost:3000) (admin / admin)
+* **Grafana**: [http://localhost:3000](http://localhost:3000) (login: `admin` / `admin`)
 

From 983b6d33c778e4261b1a46e9608efade15a0dcba Mon Sep 17 00:00:00 2001
From: Vitaly Arbuzov <Vitaly.Arbuzov@gmail.com>
Date: Mon, 28 Jul 2025 14:20:46 -0700
Subject: [PATCH 6/6] Addressing review comments

---
 ...5-07-01-valkey-powered-agentic-pipeline.md | 148 ++++++++++--------
 1 file changed, 81 insertions(+), 67 deletions(-)

diff --git a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
index a5909542..c212d173 100644
--- a/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
+++ b/content/blog/2025-07-01-valkey-powered-agentic-pipeline.md
@@ -2,39 +2,50 @@
 
 ## From Tweet to Tailored Feed
 
-Modern applications are moving beyond monoliths into distributed fleets of specialized agents—small programs that sense, decide, and act in real-time. When hundreds of these interact, their messaging layer must be lightning-fast, observable, and flexible enough to evolve without rewrites.
+### Why Messaging Matters for Agentic Systems
 
-That requirement led us to **Valkey**: an open-source, community-driven, in-memory database fully compatible with Redis. With streams, Lua scripting, a mature JSON & Search stack, and a lightweight extension system, Valkey provides our agents with a fast, shared nervous system.
+As modern software shifts toward ecosystems of intelligent agents—small, purpose-built programs that sense, decide, and act—the infrastructure underneath has to keep up. These systems aren’t batch-driven or monolithic; they’re always on, highly parallel, and relentlessly evolving. What they need is a messaging backbone that’s fast, flexible, and observable by design.
 
-## Inside the Pipeline: Code & Commentary
+That requirement led us to **Valkey**: an open-source, community-driven, in-memory database fully compatible with Redis. Streams give us an append-only event log; Lua scripting runs coordination logic server-side where it’s cheapest; JSON & Search handle structured payloads without external ETL. In short, Valkey provides our agents a fast, shared nervous system.
 
-Each incoming headline flows through four agents. Here's that journey, including key optimizations that keep agent-to-agent latency in the low microseconds:
+In this post we’ll put that claim to the test. We built a real-time content pipeline—five autonomous agents that pull headlines, enrich them with topics, route stories to interested users, and push live updates to the browser. Everything runs in Docker (GPU optional) and lights up Grafana dashboards the moment messages start flowing. We’ll walk through the stages, the code, the rough spots, and the fixes that took us from works-on-laptop to 250 msgs/s on commodity hardware.
+
+All source lives in the [**valkey‑agentic‑demo repository**](https://github.com/vitarb/valkey_agentic_demo). Clone it and follow along.
+
+## Inside the Pipeline — Code & Commentary
 
 ```
-NewsFetcher → Enricher → Fan-out → UserFeedBuilder → React UI
+Stage 1  NewsFetcher      →  
+Stage 2  Enricher         →  
+Stage 3  Fan-out          →  
+Stage 4  UserFeedBuilder  →  
+Stage 5  Reader (self-tuning) →
+React UI
 ```
 
+We’ll step through each stage and call out the snippet that makes it tick.
+
 ### Stage 1 – NewsFetcher (pushes raw headlines)
 
 ```python
-# fetcher.py – ~250 msgs/s
+# fetcher.py – ~250 msgs/s
 await r.xadd("news_raw", {"id": idx, "title": title, "body": text})
 ```
 
-Adds each raw article to the `news_raw` stream for downstream agents to consume.
+Publishes each raw article into the `news_raw` stream so downstream agents can pick it up exactly once.
 
-### Stage 2 – Enricher (tags topics and summarizes)
+### Stage 2 – Enricher (tags topics & summarizes)
 
 ```python
-# enrich.py – device pick, GPU gauge
+# enrich.py – pick device, expose GPU status
 DEVICE = 0 if torch.cuda.is_available() else -1
-GPU_GAUGE.set(1 if DEVICE >= 0 else 0)
+GPU_GAUGE.set(1 if DEVICE >= 0 else 0)       # Prometheus metric
 ```
 
-Detects GPU availability and exposes the result to Prometheus.
+Detects a GPU and records the fact for Grafana.
 
 ```python
-# enrich.py – run the classifier with LangChain
+# enrich.py – zero-shot topic classification via LangChain
 from langchain_community.llms import HuggingFacePipeline
 from transformers import pipeline
 
@@ -45,120 +56,123 @@ zeroshot = pipeline(
 )
 llm = HuggingFacePipeline(pipeline=zeroshot)
 
-topic = llm("Which topic best fits: " + doc["title"], labels=TOPICS).labels[0]
+topic = llm(
+    "Which topic best fits: " + doc["title"],
+    labels=TOPICS
+).labels[0]
+
 payload = {**doc, "topic": topic}
 pipe.xadd(f"topic:{topic}", {"data": json.dumps(payload)})
 ```
 
-Uses a Hugging Face zero-shot model—wrapped in LangChain—to label articles and route them into topic streams.
+Wraps Hugging Face in LangChain so future chains can be composed declaratively. Publishes enriched articles into `topic:<slug>` streams.
 
-### Stage 3 – Fan-out (duplicates to per-user feeds + deduplication)
+### Stage 3 – Fan-out (duplicates & deduplicates)
 
 ```lua
--- fanout.lua – smooths burst traffic
--- ARGV[1] = max stream length (e.g. 10000)
-redis.call('XTRIM', KEYS[1], 'MAXLEN', tonumber(ARGV[1]))
+-- fanout.lua – keeps streams bounded
+redis.call('XTRIM', KEYS[1], 'MAXLEN', tonumber(ARGV[1]))  -- e.g. 10000
 ```
 
-Trims topic streams inside Valkey to prevent unbounded growth.
+Trims each topic stream inside Valkey—no Python round-trip—so spikes never explode memory.
 
 ```python
-# fanout.py – per-user de-duplication
+# fanout.py – skip repeats per user
 added = await r.sadd(f"feed_seen:{uid}", doc_id)
 if added == 0:
-    continue  # duplicate → skip
-await r.expire(f"feed_seen:{uid}", 86_400, nx=True)
+    continue                  # already delivered
+await r.expire(f"feed_seen:{uid}", 86_400, nx=True)  # 24 h TTL
 ```
 
-Skips already-seen articles by tracking IDs in a 24-hour `feed_seen` set.
+Guarantees idempotency: a user never sees the same article twice.
 
-### Stage 4 – UserFeedBuilder (streams updates via WebSockets)
+### Stage 4 – UserFeedBuilder (WebSocket push)
 
 ```python
-# gateway/main.py – live feed push
+# gateway/main.py – tail & push
 msgs = await r.xread({stream: last_id}, block=0, count=1)
 await ws.send_json(json.loads(msgs[0][1][0][1]["data"]))
 ```
 
-Tails the per-user stream and emits new entries directly to the browser.
+Streams new feed entries straight to the browser, keeping UI latency sub-100 ms.
 
-### Self-Tuning Readers (load generator & consumer)
+### Stage 5 – Reader (self-tuning load consumer)
 
 ```python
-# user_reader.py – dynamic pacing
+# user_reader.py – scale pops per user
 target_rps = min(MAX_RPS, max(1.0, latest_uid * POP_RATE))
 await asyncio.sleep(1.0 / target_rps)
 ```
 
-Dynamically adjusts consumption rate based on user count—no external autoscaler needed.
+Acts as a load generator and back-pressure sink, pacing itself to active-user count—no K8s HPA required.
 
-A single `make` command launches Valkey, agents, Grafana, and the UI via Docker Compose in \~5 minutes. If a GPU is present, the Enricher uses it automatically.
+Boot-up: one `make dev` spins the whole constellation—Valkey, agents, Grafana, React UI—on a fresh EC2 box in \~5 min. If a GPU exists, the Enricher uses it automatically.
 
----
+## Why We Picked Valkey for the Job
 
-## Why We Bet on Valkey
-
-Valkey Streams and consumer groups move messages in <1 ms. Lua keeps fan-out logic server-side. JSON/Search allows enrichment to stay in-memory. Grafana charts latency and backlog immediately. Python agents can be swapped for Rust or Go with no changes to the datastore.
-
-Redis compatibility was seamless—no config changes needed.
+| Valkey Feature            | Why It Helped                                                               |
+| ------------------------- | --------------------------------------------------------------------------- |
+| Streams + Consumer Groups | Ordered, at-least-once delivery with sub-millisecond round trips.           |
+| Server-side Lua           | Runs fan-out trimming inside Valkey—no network hop, avoids Python GIL lock. |
+| JSON & Search Modules     | Stores structured payloads without Postgres or Elasticsearch.               |
+| INFO‑rich Metrics         | One command exposes memory, I/O, latency, fragmentation, and more.          |
+| Redis Compatibility       | Swapped Redis OSS 7.2 with Valkey—zero config changes.                      |
 
 ## Real-World Bumps (and the Fixes That Worked)
 
-**1. Enricher bottlenecked the pipeline**
-A c6.xlarge maxed out at \~10 msg/s on CPU. GPU offload + batch processing (32 articles) on an A10G raised throughput to 60 msg/s.
+**CPU-bound Enricher throttled the entire pipeline**
+On a c6i.large we stalled at \~10 msgs/s. Moving classification to an A10G and batching 32 docs per call lifted throughput to 60 msgs/s and cleared topic backlogs.
 
-**2. Messages got stuck in consumer groups**
-Missed `XACK` left IDs in `PENDING`. Fix: immediately `XACK` after processing + a 30s "reaper" to reclaim old messages.
+**Pending messages got “stuck”**
+A missed `XACK` left IDs in the PENDING list. We now XACK immediately after each write and run a tiny “reaper” coroutine that reclaims any entry older than 30 s.
 
-**3. Duplicate articles appeared**
-Fan-out crashes between user push and stream trim caused retries. `feed_seen` set made idempotency explicit. Dupes dropped to zero.
+**Duplicate articles spammed user feeds**
+A crash between pushing to a user list and trimming the topic stream caused retries. The `feed_seen` set (see code) made idempotency explicit—duplication rate fell to zero and the skip counter in Grafana confirms it.
 
-**4. Readers fell behind during spikes**
-Fixed 50 pops/sec couldn’t keep up with 10k users. Self-tuning delay (`latest_uid * POP_RATE`) scaled up to 200 pops/sec.
+**Readers lagged behind synthetic user spikes**
+A fixed 50 pops/s couldn’t keep up when we seeded 10k users. The self-tuning delay now scales to 200 pops/s automatically and holds feed backlog near zero.
 
-All fixes are now defaults in the repo.
-
----
+All fixes are now defaults in the repository.
 
 ## Observability That Comes Standard
 
-Every agent exports metrics. Grafana's dashboard auto-populates:
+Valkey’s INFO command and Prometheus-client hooks expose almost everything we care about:
 
-* Ingestion, enrichment, and fan-out rates
-* Topic-specific backlog lengths
-* p50 / p99 command latency (in µs)
-* Dataset memory use, network throughput, connected clients
-* Enricher replicas on GPU (via `enrich_gpu` gauge)
+| Metric                              | Question It Answers                                    |
+| ----------------------------------- | ------------------------------------------------------ |
+| `enrich_in_total` / `fan_out_total` | Are we ingesting and routing at expected rates?        |
+| `topic_stream_len`                  | Which topics are running hot—and about to back up?     |
+| `reader_target_rps`                 | Is the reader keeping pace with user growth?           |
+| `histogram_quantile()`              | Are p99 Valkey commands still < 250 µs?                |
+| Dataset memory                      | Is trim logic holding memory at 12 MB steady?          |
 
-`tools/bootstrap_grafana.py` auto-updates the dashboard when new metrics are added.
+A one-liner (`tools/bootstrap_grafana.py`) regenerates the dashboard whenever we add a metric, keeping panels tidy and color-coded.
 
 ## Performance Snapshot
 
 | Metric                     | Result                 |
 | -------------------------- | ---------------------- |
-| Raw articles ingested      | 250 /s                 |
-| Personalized feed messages | 300k /s                |
+| Raw articles ingested      | 250 msgs/s             |
+| Personalized feed messages | 300 k msgs/s           |
 | Valkey RAM (steady)        | 12 MB                  |
 | p99 Valkey op latency      | ≈ 200 µs               |
-| GPU uplift (A10G)          | 5x faster enrichment |
-
-Scaling up? One Docker command. No Helm. No YAML deep dives.
+| GPU uplift (A10G)          | ≈ 5× faster enrichment |
 
----
+Scaling up is a single Docker command—no Helm, no YAML spelunking.
 
 ## What's Next
 
-We aim to make agent networks something you can spin up in minutes, not weeks. Our roadmap:
+Our goal is agent networks you can spin up—and evolve—in minutes.
 
-* **LangChain-powered MCP (Message Control Plane)** to declaratively wire chains to Valkey.
-* **Rust agents** using the same Streams API but with lower memory.
-* **Auto-provisioned ACLs & metrics** via the MCP server.
+* **LangChain-powered Message Control Plane (MCP)** – declare a chain and get a Valkey stream, ACL, and metrics stub automatically.
+* **Rust agents** – same Streams API, lower memory footprint.
+* **Auto-provisioned ACLs & dashboards** driven by the MCP server.
 
-Pull requests and fresh ideas welcome.
+Contributions and design discussions are very welcome.
 
 ## Why It Might Matter to You
 
-Whether you're building recommendation engines, real-time feature stores, or IoT swarms—Valkey offers stateful speed, built-in observability, and freedom to evolve. Your agents stay blissfully focused on their own jobs.
+If you’re building recommendation engines, real-time feature stores, or IoT swarms, Valkey supplies stateful speed, built-in observability, and room to evolve—while your agents stay blissfully focused on their own jobs.
 
 ## Try It Yourself
 
@@ -173,5 +187,5 @@ make dev
 Then open:
 
 * **Feed UI**: [http://localhost:8500](http://localhost:8500)
-* **Grafana**: [http://localhost:3000](http://localhost:3000) (login: `admin` / `admin`)
+* **Grafana**: [http://localhost:3000](http://localhost:3000) (admin / admin)