Add category support for RSS/Atom feed articles#12
Conversation
- Use DataDog go-secure-sdk for SSRF protection (blocks private IPs, loopback, link-local, cloud metadata endpoints) - Refactor rss.Fetcher and scraper.Scraper as structs holding *http.Client - Scanner accepts Fetcher + Scraper as dependencies (no global state) - Add --unsafe-client / BLOGWATCHER_UNSAFE_CLIENT flag for local dev - ScanBlog now returns error (hard fail, not soft error string) - E2e test verifies safe client blocks loopback requests - Tests use plain http.Client (no SSRF check needed for httptest) - Update AGENTS.md: ground claims in evidence, not guesses Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse and store categories/tags from RSS and Atom feeds, with a new --category flag on the articles command to filter by category. Inspired by Hyaxia/blogwatcher#11 (by @weronikakombat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds SSRF-safe configurable HTTP clients and propagates them into RSS fetcher and scraper; refactors scanner into a dependency-injected type returning explicit errors; introduces Article.Categories (parsing, DB column, JSON helpers, filtering, CLI flag), updates tests/e2e fixtures, and extends AGENTS.md with a “Before speaking” section. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/scanner/scanner.go (1)
153-190:⚠️ Potential issue | 🔴 CriticalThe current worker-pool error path can deadlock
ScanAllBlogs.
jobsis unbuffered, anderrsis not drained until after every job has been queued. If all workers exit early, the send loop blocks forever; if only one fails, the function can still return while other workers are still writing to the database.💡 Suggested fix
- jobs := make(chan job) + jobs := make(chan job, len(blogs)) results := make([]ScanResult, len(blogs)) errs := make(chan error, workers) @@ - for i := 0; i < workers; i++ { - if err := <-errs; err != nil { - return nil, err - } - } - - return results, nil + var firstErr error + for i := 0; i < workers; i++ { + if err := <-errs; err != nil && firstErr == nil { + firstErr = err + } + } + if firstErr != nil { + return nil, firstErr + } + + return results, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scanner/scanner.go` around lines 153 - 190, ScanAllBlogs can deadlock because jobs is unbuffered and errs is only read after enqueuing all jobs; fix by adding cancellation and proper synchronization: create a cancellable context (ctx, cancel := context.WithCancel(ctx)), make errs buffered to capacity workers, use a sync.WaitGroup to track worker goroutines, have each worker select on ctx.Done() while reading from jobs and when encountering an error call cancel() and send the error into errs (use non-blocking send or buffered errs so send cannot block), ensure the job-sender loop selects on ctx.Done() to stop enqueuing when cancelled, close jobs after enqueuing, wait for wg.Done and then close(errs) and range over errs to return the first non-nil error; reference the jobs channel, errs channel, the worker goroutine that calls storage.OpenDatabase and s.ScanBlog, and ensure workerDB.Close runs in defer as currently written.
🧹 Nitpick comments (2)
AGENTS.md (1)
3-3: Consider varying the sentence structure.Three consecutive sentences begin with "If you're unsure," which slightly reduces readability. You might rephrase for variety while keeping the same meaning.
✍️ Optional rephrasing for variety
-Never guess. Ground every claim in actual files, docs, or command output. If you're unsure about a dependency's size, run `go get` and check. If you're unsure about an API, read the source or docs. If you're unsure about behavior, write a test. State what you verified and how — don't speculate. +Never guess. Ground every claim in actual files, docs, or command output. When uncertain about a dependency's size, run `go get` and check. For APIs, read the source or documentation. For behavior, write a test to confirm. Always state what you verified and how — don't speculate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@AGENTS.md` at line 3, Update the three consecutive sentences that start with "If you're unsure" in AGENTS.md to vary sentence openings while preserving meaning; reword one or two to begin differently (e.g., "When in doubt, run `go get`...", "To confirm an API's behavior, read the source or docs", "If behavior is unclear, write a test") and keep each instruction concrete about what to verify and how to verify.internal/rss/rss_test.go (1)
48-80: Consider adding an Atom fixture here too.This currently only locks in RSS category parsing. An Atom sample would cover the second advertised feed format and make regressions here much easier to spot.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/rss/rss_test.go` around lines 48 - 80, TestParseFeedWithCategories only exercises RSS; add an Atom fixture and assert Atom category parsing to prevent regressions. Update the test (TestParseFeedWithCategories) to include an Atom feed string (with <entry> elements and <category> tags) and either start a second httptest server or extend the existing handler to serve the Atom fixture at a distinct path, then call newTestFetcher().ParseFeed for that Atom URL and add assertions analogous to require.Equal/require.Nil that validate the returned articles' Categories for the Atom entries. Ensure the test references ParseFeed and the same article struct expectations so Atom parsing is covered alongside RSS.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scanner/scanner.go`:
- Around line 45-49: The code currently sets blog.FeedURL = discovered and calls
db.UpdateBlog(ctx, blog) immediately when a discovery string is returned, which
can persist a bad URL before ParseFeed runs; change the flow in the scanner
(scanner.go) so that you do NOT call db.UpdateBlog or mutate blog.FeedURL when
discovered != "" up front—instead assign feedURL = discovered only for use in
the following ParseFeed call (the ParseFeed function invocation), attempt to
parse, and only if ParseFeed succeeds update blog.FeedURL and call
db.UpdateBlog(ctx, blog); ensure any early returns on parse failure do not
persist the discovered value so future scans can re-discover.
In `@internal/storage/database.go`:
- Around line 363-365: The code currently uses sq.Like on the "categories"
string column (checking variable category with query.Where(sq.Like{"categories":
"%" + *category + "%"}) ), which causes boundary-blind matches; change the
storage and query to a structured categories representation (e.g., JSONB array
or SQL array) and perform exact matching using the DB's containment or
membership operator (e.g., jsonb @> or ARRAY containment) instead of LIKE.
Update the query construction in the database.go block that references category
(and the similar logic around lines 523-535) to use sq.Expr with the appropriate
operator (e.g., "categories @> to_jsonb(array[?])" or "categories @> ?::jsonb"/"
? = ANY(categories)" depending on chosen schema) and add a migration to convert
the existing comma-joined values to the new array/JSONB format.
---
Outside diff comments:
In `@internal/scanner/scanner.go`:
- Around line 153-190: ScanAllBlogs can deadlock because jobs is unbuffered and
errs is only read after enqueuing all jobs; fix by adding cancellation and
proper synchronization: create a cancellable context (ctx, cancel :=
context.WithCancel(ctx)), make errs buffered to capacity workers, use a
sync.WaitGroup to track worker goroutines, have each worker select on ctx.Done()
while reading from jobs and when encountering an error call cancel() and send
the error into errs (use non-blocking send or buffered errs so send cannot
block), ensure the job-sender loop selects on ctx.Done() to stop enqueuing when
cancelled, close jobs after enqueuing, wait for wg.Done and then close(errs) and
range over errs to return the first non-nil error; reference the jobs channel,
errs channel, the worker goroutine that calls storage.OpenDatabase and
s.ScanBlog, and ensure workerDB.Close runs in defer as currently written.
---
Nitpick comments:
In `@AGENTS.md`:
- Line 3: Update the three consecutive sentences that start with "If you're
unsure" in AGENTS.md to vary sentence openings while preserving meaning; reword
one or two to begin differently (e.g., "When in doubt, run `go get`...", "To
confirm an API's behavior, read the source or docs", "If behavior is unclear,
write a test") and keep each instruction concrete about what to verify and how
to verify.
In `@internal/rss/rss_test.go`:
- Around line 48-80: TestParseFeedWithCategories only exercises RSS; add an Atom
fixture and assert Atom category parsing to prevent regressions. Update the test
(TestParseFeedWithCategories) to include an Atom feed string (with <entry>
elements and <category> tags) and either start a second httptest server or
extend the existing handler to serve the Atom fixture at a distinct path, then
call newTestFetcher().ParseFeed for that Atom URL and add assertions analogous
to require.Equal/require.Nil that validate the returned articles' Categories for
the Atom entries. Ensure the test references ParseFeed and the same article
struct expectations so Atom parsing is covered alongside RSS.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4f3a0f4b-53b5-4c04-9edc-d850611b07ae
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (18)
AGENTS.mde2e/e2e_test.gogo.modinternal/cli/commands.gointernal/cli/root.gointernal/controller/controller.gointernal/controller/controller_test.gointernal/model/model.gointernal/rss/rss.gointernal/rss/rss_test.gointernal/scanner/scanner.gointernal/scanner/scanner_test.gointernal/scraper/scraper.gointernal/scraper/scraper_test.gointernal/storage/database.gointernal/storage/database_test.gointernal/storage/migrations/000002_add_categories.down.sqlinternal/storage/migrations/000002_add_categories.up.sql
- Only persist discovered feed URL after ParseFeed succeeds - Fix category LIKE query to use delimited matching (prevents "AI" matching "FAIR") - Replace scanner worker pool with errgroup to prevent deadlock - Add categories to GitHub RSS fixture and e2e test for --category filter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use json.Marshal/Unmarshal for categories serialization - Query with json_each() for exact element matching instead of LIKE Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/rss/rss_test.go (1)
79-79: Avoid nil-specific assertion for empty categoriesLine 79 is stricter than needed and can make the test brittle. Prefer asserting emptiness rather than exact
nil.Proposed test assertion change
- require.Nil(t, articles[1].Categories) + require.Empty(t, articles[1].Categories)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/rss/rss_test.go` at line 79, The test is too strict by asserting a nil slice; update the assertion to check for emptiness instead—replace the nil-specific check on articles[1].Categories with an emptiness assertion (e.g., require.Empty(t, articles[1].Categories)) so the test passes for both nil and zero-length slices while still verifying no categories are present.internal/storage/database.go (1)
363-372: Delimited LIKE matching improves but doesn't fully solve boundary issues.The four-way
ORpattern correctly handles prefix/suffix/middle positions to avoid "AI" matching "FAIR". However, if a category value itself contains a comma (e.g.,"AI, ML"), the round-trip throughcategoriesToString/categoriesFromStringwill corrupt it into two separate categories.This is acceptable for an MVP if category values are controlled/sanitized upstream (RSS feeds rarely use commas in tags). Consider documenting this limitation or escaping/quoting values if user-defined categories are planned.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/database.go` around lines 363 - 372, The current delimited LIKE matching around the `category` filter avoids partial matches but breaks when a category value contains commas (since `categoriesToString`/`categoriesFromString` round-trip will split on commas); either document this limitation near the serialization helpers or implement escaping/quoting in `categoriesToString` and `categoriesFromString` and update the SQL matching logic to use the escaped form so commas are preserved; locate the serialization helpers (`categoriesToString`, `categoriesFromString`) and either add a clear comment about the comma limitation or change them to escape commas (and adjust the code that builds `query` using `category` accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/rss/rss_test.go`:
- Line 79: The test is too strict by asserting a nil slice; update the assertion
to check for emptiness instead—replace the nil-specific check on
articles[1].Categories with an emptiness assertion (e.g., require.Empty(t,
articles[1].Categories)) so the test passes for both nil and zero-length slices
while still verifying no categories are present.
In `@internal/storage/database.go`:
- Around line 363-372: The current delimited LIKE matching around the `category`
filter avoids partial matches but breaks when a category value contains commas
(since `categoriesToString`/`categoriesFromString` round-trip will split on
commas); either document this limitation near the serialization helpers or
implement escaping/quoting in `categoriesToString` and `categoriesFromString`
and update the SQL matching logic to use the escaped form so commas are
preserved; locate the serialization helpers (`categoriesToString`,
`categoriesFromString`) and either add a clear comment about the comma
limitation or change them to escape commas (and adjust the code that builds
`query` using `category` accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1b02f9dc-76c3-438a-a4ea-2ed17dd1a59b
📒 Files selected for processing (10)
e2e/e2e_test.goe2e/expected/11_articles_unread.txte2e/expected/12b_articles_filter_category.txte2e/expected/20_articles_all.txte2e/testdata/github_blog.rssgo.modinternal/rss/rss.gointernal/rss/rss_test.gointernal/scanner/scanner.gointernal/storage/database.go
✅ Files skipped from review due to trivial changes (5)
- e2e/expected/12b_articles_filter_category.txt
- e2e/testdata/github_blog.rss
- e2e/expected/11_articles_unread.txt
- e2e/expected/20_articles_all.txt
- go.mod
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/rss/rss.go
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/storage/database.go (1)
223-257:⚠️ Potential issue | 🟠 MajorDon't silently suppress category serialization/deserialization errors.
categoriesToJSONandcategoriesFromJSONswallow JSON errors by returningnil, which silently loses categories when marshaling or database values are malformed. This violates the guideline to always handle errors in Go code.Modify both helpers to return errors, and update call sites in
AddArticle(line 224),AddArticlesBulk(line 257), andscanArticle(line 481) to propagate them.Proposed patch
-func categoriesToJSON(categories []string) *string { +func categoriesToJSON(categories []string) (*string, error) { if len(categories) == 0 { - return nil + return nil, nil } b, err := json.Marshal(categories) if err != nil { - return nil + return nil, err } s := string(b) - return &s + return &s, nil } -func categoriesFromJSON(s sql.NullString) []string { +func categoriesFromJSON(s sql.NullString) ([]string, error) { if !s.Valid || s.String == "" { - return nil + return nil, nil } var cats []string if err := json.Unmarshal([]byte(s.String), &cats); err != nil { - return nil + return nil, err } - return cats + return cats, nil }- Values(article.BlogID, article.Title, article.URL, formatTimePtr(article.PublishedDate), formatTimePtr(article.DiscoveredDate), article.IsRead, categoriesToJSON(article.Categories)). + categoriesJSON, err := categoriesToJSON(article.Categories) + if err != nil { + return article, err + } + result, err := sq.Insert("articles"). + Columns("blog_id", "title", "url", "published_date", "discovered_date", "is_read", "categories"). + Values(article.BlogID, article.Title, article.URL, formatTimePtr(article.PublishedDate), formatTimePtr(article.DiscoveredDate), article.IsRead, categoriesJSON).- categoriesToJSON(article.Categories), + categoriesJSON,- Categories: categoriesFromJSON(categories), + cats, err := categoriesFromJSON(categories) + if err != nil { + return nil, err + } + article := &model.Article{ + ID: id, + BlogID: blogID, + Title: title, + URL: url, + IsRead: isRead, + Categories: cats, + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/storage/database.go` around lines 223 - 257, The helpers categoriesToJSON and categoriesFromJSON currently swallow JSON errors; change their signatures to return ([]byte, error) and ([]string, error) (or appropriate types plus error) and propagate JSON marshal/unmarshal errors instead of returning nil; update all call sites — AddArticle (where categoriesToJSON is used in the single-insert block), AddArticlesBulk (inside the loop building insert.Values), and scanArticle (where categoriesFromJSON is used when scanning DB rows) — to check the error, return it to the caller, and in AddArticlesBulk ensure the transaction is aborted (tx.Rollback) on any serialization error before returning so partial inserts are not committed. Ensure function signatures and returned errors are updated accordingly throughout the call chain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/storage/database.go`:
- Around line 363-367: Replace the raw SQL predicate string in the categories
filter with a squirrel expression: instead of query.Where("EXISTS (SELECT 1 FROM
json_each(categories) WHERE json_each.value = ?)", *category), call query =
query.Where(sq.Expr("EXISTS (SELECT 1 FROM json_each(categories) WHERE
json_each.value = ?)", *category)) so the filter uses a squirrel expression like
the existing blogID filter that uses sq.Eq{}; update the code around the query,
category, and categories references accordingly.
---
Outside diff comments:
In `@internal/storage/database.go`:
- Around line 223-257: The helpers categoriesToJSON and categoriesFromJSON
currently swallow JSON errors; change their signatures to return ([]byte, error)
and ([]string, error) (or appropriate types plus error) and propagate JSON
marshal/unmarshal errors instead of returning nil; update all call sites —
AddArticle (where categoriesToJSON is used in the single-insert block),
AddArticlesBulk (inside the loop building insert.Values), and scanArticle (where
categoriesFromJSON is used when scanning DB rows) — to check the error, return
it to the caller, and in AddArticlesBulk ensure the transaction is aborted
(tx.Rollback) on any serialization error before returning so partial inserts are
not committed. Ensure function signatures and returned errors are updated
accordingly throughout the call chain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e4d4e2d4-3123-47a1-a7b4-2dabf4145f39
📒 Files selected for processing (1)
internal/storage/database.go
| if category != nil && *category != "" { | ||
| // Categories are stored as a JSON string array. Use json_each() | ||
| // for exact element matching. | ||
| query = query.Where("EXISTS (SELECT 1 FROM json_each(categories) WHERE json_each.value = ?)", *category) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify no raw SQL predicate remains for category filtering and sq.Expr is used.
rg -nP --type=go 'Where\("EXISTS \(SELECT 1 FROM json_each\(categories\) WHERE json_each\.value = \?"' internal/storage/database.go
rg -nP --type=go 'Where\(sq\.Expr\("EXISTS \(SELECT 1 FROM json_each\(categories\) WHERE json_each\.value = \?"' internal/storage/database.goRepository: JulienTant/blogwatcher-cli
Length of output: 52
🏁 Script executed:
# Read the actual content at the specified lines
sed -n '360,370p' internal/storage/database.goRepository: JulienTant/blogwatcher-cli
Length of output: 465
Use a squirrel expression instead of the raw SQL string in the Where call.
Line 366 uses a raw SQL predicate string, which violates the project's requirement to use the squirrel library exclusively. The blogID filter on line 362 correctly uses sq.Eq{}, so apply the same pattern for the category filter.
Proposed patch
- query = query.Where("EXISTS (SELECT 1 FROM json_each(categories) WHERE json_each.value = ?)", *category)
+ query = query.Where(sq.Expr("EXISTS (SELECT 1 FROM json_each(categories) WHERE json_each.value = ?)", *category))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/storage/database.go` around lines 363 - 367, Replace the raw SQL
predicate string in the categories filter with a squirrel expression: instead of
query.Where("EXISTS (SELECT 1 FROM json_each(categories) WHERE json_each.value =
?)", *category), call query = query.Where(sq.Expr("EXISTS (SELECT 1 FROM
json_each(categories) WHERE json_each.value = ?)", *category)) so the filter
uses a squirrel expression like the existing blogID filter that uses sq.Eq{};
update the code around the query, category, and categories references
accordingly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
--category/-cflag to thearticlescommand for filtering000002_add_categories) adds thecategoriescolumnInspired by Hyaxia/blogwatcher#11 by @weronikakombat, reimplemented to follow our project patterns (squirrel, golang-migrate, context threading, testify).
Usage
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation