AI News Daily Digest

Automated daily AI industry digest. A cron job on a self-hosted machine invokes headless Claude Code each morning: a Python script fetches and filters RSS feeds, Claude does web discovery for additional coverage, synthesizes the digest as Hugo markdown, commits, and pushes. Cloudflare Pages auto-deploys on push. A summary notification goes out via ntfy.sh.

No server-side application to maintain. The pipeline is a set of instructions in CLAUDE.md, backed by a feed-fetcher script and repo-committed config and state.

How It Works

Daily cron invocation (11:00 UTC)
  -> Wrapper script pulls latest main
  -> Invokes `claude -p` headless, pointed at CLAUDE.md
     -> Step 1-2: Runs scripts/fetch_feeds.py
        -> Fetches 33 RSS/Atom feeds + 3 ArXiv feeds with per-fetch jitter
        -> Retries once on 5xx/429/timeouts; identifies as a named bot
        -> Detects content-mismatch responses (200 OK where body isn't a feed)
        -> Classifies errors by type (status:*, timeout, content_mismatch, parse_error)
        -> Normalizes URLs, SHA-256 hashes, filters ArXiv by keywords
        -> Deduplicates against state/seen.json
        -> Tracks per-feed consecutive-failure counts in feed_health
        -> Drops items older than 48 hours
        -> Emits structured JSON + disable_candidates list
     -> Step 3-4: Web discovery (+ passive feed-candidate capture)
     -> Step 5: Triages ArXiv papers by significance
     -> Step 6-7: Synthesizes digest, writes content/posts/{YYYY-MM-DD}.md
     -> Step 8: Post-write semantic dedup against last 2 digests
     -> Step 9: Retires feeds past the hard-failure threshold into
                scrape: (if homepage reachable) or disabled:
     -> Step 10: Updates state/seen.json (seen hashes + feed_health)
                 and state/citation_tracking.json (non-feed source
                 citations for passive feed discovery)
     -> Step 11: Commits + pushes to main, recovery-branch fallback
                 (code changes and digest content go as separate commits)
     -> Step 12: Sends summary to ntfy.sh
  -> Cloudflare Pages builds Hugo and deploys

Subscribe

Web: sorcerousmachine.com/ai-news-digest
RSS: sorcerousmachine.com/ai-news-digest/feed.xml
Notifications: ntfy.sh/ai-news-digest

Repository Structure

scripts/fetch_feeds.py     # Feed fetcher: RSS parsing, dedup, filtering, health tracking
config/feeds.yaml          # Active feeds + scrape: and disabled: sections for retired entries
state/seen.json            # Dedup state (URL hashes) + per-feed/scrape health
state/citation_tracking.json  # Per-source citation dates + origin URL (powers /sources)
content/posts/             # Generated digest posts (Hugo markdown)
content/sources.md         # /sources transparency page content
layouts/                   # Hugo templates
assets/css/style.css       # Site styles (fingerprinted at build)
hugo.toml                  # Hugo configuration
CLAUDE.md                  # Pipeline instructions loaded by each run

Feed Sources

33 feeds across 7 categories, plus 3 ArXiv feeds with keyword filtering:

Vendor -- OpenAI, Google DeepMind, Anthropic, Hugging Face
News -- TechCrunch, Ars Technica, MIT Technology Review, The Register, Hacker News, Lobsters
Newsletters -- Simon Willison, Nathan Lambert, Jack Clark, Ethan Mollick, Lilian Weng, Sebastian Raschka, Andrej Karpathy, Zvi Mowshowitz, SemiAnalysis, and more
Open Source -- LangChain, Weights & Biases, PyTorch
Research -- ArXiv (cs.AI, cs.CL, cs.LG), Google Research, AI Alignment Forum, HF Daily Papers
Regulatory -- Stanford HAI, NIST
Infrastructure -- NVIDIA, Semiconductor Engineering, AWS ML

Configured in config/feeds.yaml. Feeds that produce three consecutive hard failures are retired automatically. Routing is based on whether the config entry has a homepage: value — a failure on the feed URL says nothing about whether the publisher's site is alive, so homepage presence is the right signal:

homepage set → scrape: section. Step 3 of the pipeline directed- fetches the homepage to recover coverage without going through RSS. Controlled by the DIGEST_SCRAPE_ENABLED env var (default on).
homepage absent → disabled: section. No known recovery path; skipped on subsequent runs.

Manual re-enable or promotion between sections is a hand edit — move the entry back into the active feeds: list.

Architecture Decisions

Python for feed processing. atoma (pure-Python, defusedxml-based) handles RSS/Atom parsing deterministically. URL normalization and hashing are exact, not LLM-approximate. Claude receives clean JSON instead of raw XML.
48-hour recency window. The script drops items older than 48 hours before Claude sees them. Keeps context small and focused on what's new.
JSON state, not SQLite. Produces readable git diffs. One URL hash per line.
URL hashes, not full URLs. SHA-256 keeps the state file compact.
90-day retention. Caps state at ~5,000-9,000 entries. Pruned each run.
Per-feed health tracking with auto-retirement. Hard failures (status:404, status:410, content_mismatch) increment a consecutive-failure counter; soft failures (5xx, timeouts, parse errors on XML-ish bodies) preserve it. Three consecutive hard failures retires the feed automatically. Prevents the error log from being dominated by feeds that have permanently moved or gone dark.
Post-write semantic deduplication. URL-hash dedup can't catch the same story surfaced from a different URL day-to-day. After the digest is written, Claude reads the last two digests — matching the 48-hour recency window upstream — and removes items that cover stories already reported, keeping the synthesis context clean of prior-post bias.
Passive feed discovery via persistent state. Each pipeline run records the source attributions it emitted that day to state/citation_tracking.json, keyed by source name with a list of citation dates and the most recent primary citation's origin URL (scheme://host). Dates older than 30 days are pruned. Any source name that accumulates 3+ citations in the rolling window AND isn't already in feeds.yaml surfaces as a candidate feed in the commit message for manual review. No auto-addition — web search ranks for traffic, not insight, so the curation stays human. The state file is what makes cross-day signal possible: each cron run is a fresh Claude session with no memory of prior days, so the tracking has to be written down deterministically.
Sources transparency page. The same citation_tracking.json feeds a /sources page listing every cited source from the rolling 30-day window, split into an RSS-feeds bucket (matched against feeds.yaml via case-insensitive token-set equality after stripping generic descriptors like "news" and "blog") and a web-discovered bucket. Each source links to its homepage — the configured homepage: for RSS-backed sources, the captured origin for web-discovered ones.
No theme dependency. Templates are self-contained in layouts/.
No JavaScript. CSS-only. Progressive enhancement only.

Setup

Two parts: Cloudflare Pages for hosting, and a cron job on a machine you control for the daily pipeline. No part of the pipeline runs in a managed service — anything that can run claude and git on a schedule will do.

Hosting (Cloudflare Pages)

Connect this repo to Cloudflare Pages. Framework: Hugo. Output directory: public. Set HUGO_VERSION=0.147.0 as a build env var.
Set build watch paths to content/**, layouts/**, assets/**, static/**, hugo.toml — config and state changes shouldn't trigger rebuilds.
Point a custom domain at the Pages project.

Pipeline (cron)

Requirements: a long-running host (small VPS or homelab machine) with git, Python 3.10+, and network access to GitHub + Anthropic + RSS origins.

Install the Claude Code CLI and log in — this creates ~/.claude/.credentials.json.
Install gh (GitHub CLI), authenticate with repo scope, and enable the git credential helper so pushes work over HTTPS.
pip install atoma pyyaml (use --user or a venv depending on your distribution's Python policy).
Clone this repo locally.
Write a wrapper script that cds into the repo, pulls, and invokes:
```
claude -p --permission-mode bypassPermissions \
  --model 'claude-opus-4-6[1m]' \
  "Run the daily AI digest pipeline per CLAUDE.md. Follow every step in order."
```
Redirect stdout+stderr to a per-day log file so stream interruptions are visible after the fact. The model is pinned to Opus 4.6 (1M context variant) because digest synthesis can run into six-figure token counts and Opus 4.7 has documented retrieval degradation past 256k tokens. Re-evaluate the pin when a successor model ships with verified long-context behavior.
Add to crontab with TZ=UTC:
```
0 11 * * * /path/to/wrapper.sh
```
11:00 UTC lands after ArXiv's nightly update and before most US readers wake up.

Claude loads CLAUDE.md at the start of each run for its full pipeline spec.

Pipeline environment variables

DIGEST_SCRAPE_ENABLED — controls whether the pipeline directed-fetches publisher homepages listed under scrape: in config/feeds.yaml during Step 3 web discovery. Default is enabled; set to false (or 0, no, off) on the cron host's wrapper env to disable. When disabled, retired content_mismatch sites stop being visited — their coverage drops entirely, but the pipeline runs faster and uses less context.

Build Parameters

A handful of Hugo site params can be set via environment variables at build time:

HUGO_PARAMS_GITHUBREPO — if set (e.g. owner/repo), renders a GitHub icon link in the header pointing to that repo. Omit to hide the link.
HUGO_PARAMS_NOINDEX — if set to true, emits a restrictive robots.txt and a <meta name="robots" content="noindex, nofollow"> tag on every page. Use for builds that should stay out of search indexes.
HUGO_PARAMS_BUILTBY — if set to an organization or individual name (e.g. Acme Corp), renders a "Built by {name}." line in the footer and emits schema.org JSON-LD marking the site as a Blog published by that Organization. Search engines pick up the parent/child relationship for sitelinks and knowledge-panel purposes. Omit to hide both the footer attribution and the structured data.
HUGO_PARAMS_BUILTBYURL — if set alongside HUGO_PARAMS_BUILTBY, wraps the footer name in a link and adds the url field to the JSON-LD Organization object. Ignored when HUGO_PARAMS_BUILTBY is unset.

Fonts

Instrument Serif (headings) and Source Sans 3 (body) are self-hosted from static/fonts/. No external font CDN is contacted at runtime — the site is fully self-contained.

The .woff2 files were sourced from the @fontsource npm packages:

@fontsource/instrument-serif
@fontsource/source-sans-3

To regenerate or add a weight:

mkdir /tmp/fontsource && cd /tmp/fontsource
npm init -y && npm install @fontsource/instrument-serif @fontsource/source-sans-3
# files live under node_modules/@fontsource/<family>/files/
#   <family>-latin-<weight>-<style>.woff2
# Copy the ones you need into static/fonts/ with the naming pattern:
#   <family>-<weight>[-italic].woff2

Then add a matching @font-face rule in static/css/style.css. Fonts referenced in @font-face but missing from static/fonts/ will simply fall through to the system-font fallback chain declared in the :root font variables.

Only the latin subset is vendored. Names with extended Latin, Greek, or Cyrillic characters fall back to system fonts for those codepoints. Add latin-ext etc. if you start publishing content that exercises them regularly.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI News Daily Digest

How It Works

Subscribe

Repository Structure

Feed Sources

Architecture Decisions

Setup

Hosting (Cloudflare Pages)

Pipeline (cron)

Pipeline environment variables

Build Parameters

Fonts

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
archetypes		archetypes
assets/css		assets/css
config		config
content		content
layouts		layouts
scripts		scripts
state		state
static/fonts		static/fonts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
hugo.toml		hugo.toml

Folders and files

Latest commit

History

Repository files navigation

AI News Daily Digest

How It Works

Subscribe

Repository Structure

Feed Sources

Architecture Decisions

Setup

Hosting (Cloudflare Pages)

Pipeline (cron)

Pipeline environment variables

Build Parameters

Fonts

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages