Automated daily AI industry digest. A cron job on a self-hosted machine invokes headless Claude Code each morning: a Python script fetches and filters RSS feeds, Claude does web discovery for additional coverage, synthesizes the digest as Hugo markdown, commits, and pushes. Cloudflare Pages auto-deploys on push. A summary notification goes out via ntfy.sh.
No server-side application to maintain. The pipeline is a set of
instructions in CLAUDE.md, backed by a feed-fetcher script and
repo-committed config and state.
Daily cron invocation (11:00 UTC)
-> Wrapper script pulls latest main
-> Invokes `claude -p` headless, pointed at CLAUDE.md
-> Step 1-2: Runs scripts/fetch_feeds.py
-> Fetches 33 RSS/Atom feeds + 3 ArXiv feeds with per-fetch jitter
-> Retries once on 5xx/429/timeouts; identifies as a named bot
-> Detects content-mismatch responses (200 OK where body isn't a feed)
-> Classifies errors by type (status:*, timeout, content_mismatch, parse_error)
-> Normalizes URLs, SHA-256 hashes, filters ArXiv by keywords
-> Deduplicates against state/seen.json
-> Tracks per-feed consecutive-failure counts in feed_health
-> Drops items older than 48 hours
-> Emits structured JSON + disable_candidates list
-> Step 3-4: Web discovery (+ passive feed-candidate capture)
-> Step 5: Triages ArXiv papers by significance
-> Step 6-7: Synthesizes digest, writes content/posts/{YYYY-MM-DD}.md
-> Step 8: Post-write semantic dedup against last 2 digests
-> Step 9: Retires feeds past the hard-failure threshold into
scrape: (if homepage reachable) or disabled:
-> Step 10: Updates state/seen.json (seen hashes + feed_health)
and state/citation_tracking.json (non-feed source
citations for passive feed discovery)
-> Step 11: Commits + pushes to main, recovery-branch fallback
(code changes and digest content go as separate commits)
-> Step 12: Sends summary to ntfy.sh
-> Cloudflare Pages builds Hugo and deploys
- Web: sorcerousmachine.com/ai-news-digest
- RSS: sorcerousmachine.com/ai-news-digest/feed.xml
- Notifications: ntfy.sh/ai-news-digest
scripts/fetch_feeds.py # Feed fetcher: RSS parsing, dedup, filtering, health tracking
config/feeds.yaml # Active feeds + scrape: and disabled: sections for retired entries
state/seen.json # Dedup state (URL hashes) + per-feed/scrape health
state/citation_tracking.json # Per-source citation dates + origin URL (powers /sources)
content/posts/ # Generated digest posts (Hugo markdown)
content/sources.md # /sources transparency page content
layouts/ # Hugo templates
assets/css/style.css # Site styles (fingerprinted at build)
hugo.toml # Hugo configuration
CLAUDE.md # Pipeline instructions loaded by each run
33 feeds across 7 categories, plus 3 ArXiv feeds with keyword filtering:
- Vendor -- OpenAI, Google DeepMind, Anthropic, Hugging Face
- News -- TechCrunch, Ars Technica, MIT Technology Review, The Register, Hacker News, Lobsters
- Newsletters -- Simon Willison, Nathan Lambert, Jack Clark, Ethan Mollick, Lilian Weng, Sebastian Raschka, Andrej Karpathy, Zvi Mowshowitz, SemiAnalysis, and more
- Open Source -- LangChain, Weights & Biases, PyTorch
- Research -- ArXiv (cs.AI, cs.CL, cs.LG), Google Research, AI Alignment Forum, HF Daily Papers
- Regulatory -- Stanford HAI, NIST
- Infrastructure -- NVIDIA, Semiconductor Engineering, AWS ML
Configured in config/feeds.yaml. Feeds that produce three consecutive
hard failures are retired automatically. Routing is based on whether
the config entry has a homepage: value — a failure on the feed URL
says nothing about whether the publisher's site is alive, so homepage
presence is the right signal:
homepageset →scrape:section. Step 3 of the pipeline directed- fetches the homepage to recover coverage without going through RSS. Controlled by theDIGEST_SCRAPE_ENABLEDenv var (default on).homepageabsent →disabled:section. No known recovery path; skipped on subsequent runs.
Manual re-enable or promotion between sections is a hand edit — move
the entry back into the active feeds: list.
- Python for feed processing. atoma (pure-Python, defusedxml-based) handles RSS/Atom parsing deterministically. URL normalization and hashing are exact, not LLM-approximate. Claude receives clean JSON instead of raw XML.
- 48-hour recency window. The script drops items older than 48 hours before Claude sees them. Keeps context small and focused on what's new.
- JSON state, not SQLite. Produces readable git diffs. One URL hash per line.
- URL hashes, not full URLs. SHA-256 keeps the state file compact.
- 90-day retention. Caps state at ~5,000-9,000 entries. Pruned each run.
- Per-feed health tracking with auto-retirement. Hard failures (status:404, status:410, content_mismatch) increment a consecutive-failure counter; soft failures (5xx, timeouts, parse errors on XML-ish bodies) preserve it. Three consecutive hard failures retires the feed automatically. Prevents the error log from being dominated by feeds that have permanently moved or gone dark.
- Post-write semantic deduplication. URL-hash dedup can't catch the same story surfaced from a different URL day-to-day. After the digest is written, Claude reads the last two digests — matching the 48-hour recency window upstream — and removes items that cover stories already reported, keeping the synthesis context clean of prior-post bias.
- Passive feed discovery via persistent state. Each pipeline run
records the source attributions it emitted that day to
state/citation_tracking.json, keyed by source name with a list of citation dates and the most recent primary citation's origin URL (scheme://host). Dates older than 30 days are pruned. Any source name that accumulates 3+ citations in the rolling window AND isn't already infeeds.yamlsurfaces as a candidate feed in the commit message for manual review. No auto-addition — web search ranks for traffic, not insight, so the curation stays human. The state file is what makes cross-day signal possible: each cron run is a fresh Claude session with no memory of prior days, so the tracking has to be written down deterministically. - Sources transparency page. The same
citation_tracking.jsonfeeds a/sourcespage listing every cited source from the rolling 30-day window, split into an RSS-feeds bucket (matched againstfeeds.yamlvia case-insensitive token-set equality after stripping generic descriptors like "news" and "blog") and a web-discovered bucket. Each source links to its homepage — the configuredhomepage:for RSS-backed sources, the capturedoriginfor web-discovered ones. - No theme dependency. Templates are self-contained in
layouts/. - No JavaScript. CSS-only. Progressive enhancement only.
Two parts: Cloudflare Pages for hosting, and a cron job on a machine
you control for the daily pipeline. No part of the pipeline runs in
a managed service — anything that can run claude and git on a
schedule will do.
- Connect this repo to Cloudflare Pages. Framework: Hugo. Output
directory:
public. SetHUGO_VERSION=0.147.0as a build env var. - Set build watch paths to
content/**,layouts/**,assets/**,static/**,hugo.toml— config and state changes shouldn't trigger rebuilds. - Point a custom domain at the Pages project.
Requirements: a long-running host (small VPS or homelab machine) with git, Python 3.10+, and network access to GitHub + Anthropic + RSS origins.
- Install the Claude Code CLI and log in — this creates
~/.claude/.credentials.json. - Install
gh(GitHub CLI), authenticate withreposcope, and enable the git credential helper so pushes work over HTTPS. pip install atoma pyyaml(use--useror a venv depending on your distribution's Python policy).- Clone this repo locally.
- Write a wrapper script that
cds into the repo, pulls, and invokes:Redirect stdout+stderr to a per-day log file so stream interruptions are visible after the fact. The model is pinned to Opus 4.6 (1M context variant) because digest synthesis can run into six-figure token counts and Opus 4.7 has documented retrieval degradation past 256k tokens. Re-evaluate the pin when a successor model ships with verified long-context behavior.claude -p --permission-mode bypassPermissions \ --model 'claude-opus-4-6[1m]' \ "Run the daily AI digest pipeline per CLAUDE.md. Follow every step in order." - Add to crontab with
TZ=UTC:11:00 UTC lands after ArXiv's nightly update and before most US readers wake up.0 11 * * * /path/to/wrapper.sh
Claude loads CLAUDE.md at the start of each run for its full
pipeline spec.
DIGEST_SCRAPE_ENABLED— controls whether the pipeline directed-fetches publisher homepages listed underscrape:inconfig/feeds.yamlduring Step 3 web discovery. Default is enabled; set tofalse(or0,no,off) on the cron host's wrapper env to disable. When disabled, retiredcontent_mismatchsites stop being visited — their coverage drops entirely, but the pipeline runs faster and uses less context.
A handful of Hugo site params can be set via environment variables at build time:
HUGO_PARAMS_GITHUBREPO— if set (e.g.owner/repo), renders a GitHub icon link in the header pointing to that repo. Omit to hide the link.HUGO_PARAMS_NOINDEX— if set totrue, emits a restrictiverobots.txtand a<meta name="robots" content="noindex, nofollow">tag on every page. Use for builds that should stay out of search indexes.HUGO_PARAMS_BUILTBY— if set to an organization or individual name (e.g.Acme Corp), renders a "Built by {name}." line in the footer and emits schema.org JSON-LD marking the site as aBlogpublished by thatOrganization. Search engines pick up the parent/child relationship for sitelinks and knowledge-panel purposes. Omit to hide both the footer attribution and the structured data.HUGO_PARAMS_BUILTBYURL— if set alongsideHUGO_PARAMS_BUILTBY, wraps the footer name in a link and adds theurlfield to the JSON-LDOrganizationobject. Ignored whenHUGO_PARAMS_BUILTBYis unset.
Instrument Serif (headings) and Source Sans 3 (body) are self-hosted
from static/fonts/. No external font CDN is contacted at runtime —
the site is fully self-contained.
The .woff2 files were sourced from the @fontsource npm packages:
@fontsource/instrument-serif
@fontsource/source-sans-3
To regenerate or add a weight:
mkdir /tmp/fontsource && cd /tmp/fontsource
npm init -y && npm install @fontsource/instrument-serif @fontsource/source-sans-3
# files live under node_modules/@fontsource/<family>/files/
# <family>-latin-<weight>-<style>.woff2
# Copy the ones you need into static/fonts/ with the naming pattern:
# <family>-<weight>[-italic].woff2Then add a matching @font-face rule in static/css/style.css. Fonts
referenced in @font-face but missing from static/fonts/ will simply
fall through to the system-font fallback chain declared in the
:root font variables.
Only the latin subset is vendored. Names with extended Latin, Greek,
or Cyrillic characters fall back to system fonts for those codepoints.
Add latin-ext etc. if you start publishing content that exercises
them regularly.
MIT