Archive a curated OPML list of RSS and Atom feeds, persist article full text into the GitHub repository, and publish public JSON and RSS artifacts via GitHub Pages.
The repository is designed for a GitHub Actions scheduled run:
- Read feed definitions from
feeds.opml. - Fetch each source feed from GitHub-hosted runners.
- Parse feed items and assign stable
article_idvalues. - Reuse full text directly from RSS when available; otherwise fetch the article page and extract the main content.
- Persist article JSON files under
archive/articles/<feed>/<article-id>.json. - Commit
archive/back intomainso the corpus survives future feed churn. - Publish a static site to GitHub Pages with:
- raw mirrored feed files under
feeds/<slug>.xml - archive index at
archive/index.json - archived article JSON files under
archive/articles/... - archived article HTML pages under
archive/articles/... - combined item JSON at
feeds/combined.json - a recent 72-hour full-text RSS feed at
feeds/fulltext.xml - an explicit recent 72-hour full-text RSS feed at
feeds/fulltext-72h.xml - a recent 72-hour summary RSS feed at
feeds/history-72h.xml, with links to GitHub-hosted full article pages - a historical summary RSS feed at
feeds/history.xml, with links to GitHub-hosted full article pages
- raw mirrored feed files under
The workflow is scheduled every 5 minutes using GitHub Actions cron. To reduce GitHub's top-of-hour delay risk, it runs on minute 2,7,12,...,57 rather than exactly on :00,:05,:10.
After Pages is enabled, the default URLs are:
https://ThinkPeace.github.io/rss-cache/https://ThinkPeace.github.io/rss-cache/archive/index.jsonhttps://ThinkPeace.github.io/rss-cache/feeds/index.jsonhttps://ThinkPeace.github.io/rss-cache/feeds/combined.jsonhttps://ThinkPeace.github.io/rss-cache/feeds/fulltext.xml(recent 72 hours)https://ThinkPeace.github.io/rss-cache/feeds/fulltext-72h.xml(recent 72 hours)https://ThinkPeace.github.io/rss-cache/feeds/history-72h.xml(recent 72-hour summary feed, links to GitHub-hosted article pages)https://ThinkPeace.github.io/rss-cache/feeds/history.xml(historical summary feed, links to GitHub-hosted article pages)
Compatibility note:
feeds/combined.xmlremains as a legacy alias of the 72-hour summary feed.
Install dependencies first:
python3 -m pip install -r requirements.txtThen build:
python3 scripts/build_site.py \
--opml feeds.opml \
--archive-dir archive \
--output dist \
--site-url https://ThinkPeace.github.io/rss-cacheFor a smaller smoke test:
python3 scripts/build_site.py \
--opml feeds.opml \
--archive-dir /tmp/rss-cache-archive \
--output /tmp/rss-cache-smoke \
--site-url https://example.com/rss-cache \
--max-feeds 2Each article JSON includes:
- source feed metadata
- canonical article link and GUID
- published timestamp
- summary
- extracted
content_text - extracted or synthesized
content_html - archive JSON URL for public access
This structure is intended to make later custom RSS generation trivial: the next layer can read archive/index.json and emit whatever feed format you want without touching the origin sites again.