Add SSRF-safe HTTP client with --unsafe-client flag#9
Conversation
- Use DataDog go-secure-sdk for SSRF protection (blocks private IPs, loopback, link-local, cloud metadata endpoints) - Refactor rss.Fetcher and scraper.Scraper as structs holding *http.Client - Scanner accepts Fetcher + Scraper as dependencies (no global state) - Add --unsafe-client / BLOGWATCHER_UNSAFE_CLIENT flag for local dev - ScanBlog now returns error (hard fail, not soft error string) - E2e test verifies safe client blocks loopback requests - Tests use plain http.Client (no SSRF check needed for httptest) - Update AGENTS.md: ground claims in evidence, not guesses Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR improves SSRF protection testing by sanitizing the process environment to ensure consistent test behavior, and enhances feed discovery error handling by propagating errors instead of silently returning empty values. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/rss/rss.go (1)
82-108:⚠️ Potential issue | 🟡 MinorErrors are silently swallowed in DiscoverFeedURL.
Multiple error paths return
("", nil)instead of propagating the error:
- Line 85:
http.NewRequestWithContextfailure- Line 89:
f.client.Dofailure- Line 97: Non-2xx HTTP status
- Line 102:
url.Parsefailure- Line 107:
goquery.NewDocumentFromReaderfailureWhile "discovery not found" may be a valid soft-failure scenario, silently discarding errors (especially network/SSRF errors from
f.client.Do) makes debugging difficult and violates the guideline to never ignore errors.Consider returning errors for actual failures while returning
("", nil)only for "not found" scenarios:Proposed fix to distinguish errors from "not found"
func (f *Fetcher) DiscoverFeedURL(ctx context.Context, blogURL string) (string, error) { req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil) if err != nil { - return "", nil + return "", fmt.Errorf("create request: %w", err) } response, err := f.client.Do(req) if err != nil { - return "", nil + return "", fmt.Errorf("fetch blog URL: %w", err) } defer func() { if err := response.Body.Close(); err != nil { fmt.Fprintf(os.Stderr, "close: %v\n", err) } }() if response.StatusCode < 200 || response.StatusCode >= 300 { - return "", nil + return "", nil // Not found is acceptable } base, err := url.Parse(blogURL) if err != nil { - return "", nil + return "", fmt.Errorf("parse blog URL: %w", err) } doc, err := goquery.NewDocumentFromReader(response.Body) if err != nil { - return "", nil + return "", fmt.Errorf("parse HTML: %w", err) }As per coding guidelines: "Never ignore errors in Go - handle all errors returned by functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/rss/rss.go` around lines 82 - 108, The DiscoverFeedURL function currently swallows all errors (e.g., failures from http.NewRequestWithContext, f.client.Do, non-2xx responses, url.Parse, and goquery.NewDocumentFromReader) by returning ("", nil); change it to propagate real failures as errors (return "", err or a wrapped error) for network/parse/http errors and reserve returning ("", nil) only for legitimate "feed not found" cases (e.g., when discovery markup is absent or HTTP 404/204 where you decide not to treat as an error). Update the error returns in DiscoverFeedURL, ensure f.client.Do errors are returned, non-2xx responses return a descriptive error (unless explicitly handled as not-found), and parsing errors from url.Parse and goquery.NewDocumentFromReader are returned so callers can distinguish real failures from "not found."internal/scanner/scanner.go (1)
157-193:⚠️ Potential issue | 🟠 MajorWorker early return on error can cause goroutine leak and deadlock.
When a worker encounters an error (lines 171-174), it sends the error and returns immediately without draining remaining jobs from the channel. Meanwhile, the main goroutine continues sending jobs (lines 181-183). If workers exit early,
jobs <- job{...}will block forever since no worker is receiving.Additionally, remaining workers that finish successfully will send
niltoerrs, but if one worker exited early, you may not receive the expected number of messages, or you may return an error while other workers are still running.Proposed fix using errgroup for cleaner cancellation
+import "golang.org/x/sync/errgroup" + func (s *Scanner) ScanAllBlogs(ctx context.Context, db *storage.Database, workers int) ([]ScanResult, error) { blogs, err := db.ListBlogs(ctx) if err != nil { return nil, err } if workers <= 1 { // ... sequential path unchanged ... } type job struct { Index int Blog model.Blog } jobs := make(chan job) results := make([]ScanResult, len(blogs)) - errs := make(chan error, workers) + + g, gctx := errgroup.WithContext(ctx) for i := 0; i < workers; i++ { - go func() { - workerDB, err := storage.OpenDatabase(ctx, db.Path()) + g.Go(func() error { + workerDB, err := storage.OpenDatabase(gctx, db.Path()) if err != nil { - errs <- err - return + return err } defer func() { if err := workerDB.Close(); err != nil { fmt.Fprintf(os.Stderr, "close: %v\n", err) } }() for item := range jobs { - result, err := s.ScanBlog(ctx, workerDB, item.Blog) + result, err := s.ScanBlog(gctx, workerDB, item.Blog) if err != nil { - errs <- fmt.Errorf("scan %s: %w", item.Blog.Name, err) - return + return fmt.Errorf("scan %s: %w", item.Blog.Name, err) } results[item.Index] = result } - errs <- nil - }() + return nil + }) } + // Send jobs in a separate goroutine so we can close channel when context cancels + go func() { + defer close(jobs) + for index, blog := range blogs { + select { + case jobs <- job{Index: index, Blog: blog}: + case <-gctx.Done(): + return + } + } + }() - for index, blog := range blogs { - jobs <- job{Index: index, Blog: blog} - } - close(jobs) - - for i := 0; i < workers; i++ { - if err := <-errs; err != nil { - return nil, err - } - } + if err := g.Wait(); err != nil { + return nil, err + } return results, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scanner/scanner.go` around lines 157 - 193, The worker goroutine currently returns early on error while the main sender keeps pushing to jobs, causing deadlocks; replace the manual workers/errs pattern with an errgroup (use errgroup.WithContext(ctx)) and spawn workers via g.Go so that a single cancellation stops all workers and the sender; inside each worker (the anonymous goroutine that currently calls storage.OpenDatabase and s.ScanBlog) open the DB, defer Close, then loop receiving from jobs with select on ctx.Done() and the jobs channel (handle ok==false to exit), and return any scan error so the errgroup cancels the context; in the sender loop that enqueues job{Index, Blog} select on ctx.Done() as well to stop early and close(jobs) after enqueuing, then call g.Wait() to collect the first error and return results.
🧹 Nitpick comments (2)
AGENTS.md (1)
1-3: Excellent addition to document agent verification behavior.The new "Before speaking" section establishes a clear principle: ground claims in evidence rather than speculation. The specific verification methods (go get, read source/docs, write tests) are concrete and actionable, which aligns well with the high-standards philosophy of this file. As per coding guidelines, this properly documents agent responsibilities and capabilities.
📝 Optional: Reduce repetitive sentence structure
The three successive sentences beginning with "If you're unsure" create a repetitive pattern. Consider rewording for variety:
-Never guess. Ground every claim in actual files, docs, or command output. If you're unsure about a dependency's size, run `go get` and check. If you're unsure about an API, read the source or docs. If you're unsure about behavior, write a test. State what you verified and how — don't speculate. +Never guess. Ground every claim in actual files, docs, or command output. When uncertain about a dependency's size, run `go get` and check. For API questions, read the source or docs. To verify behavior, write a test. State what you verified and how — don't speculate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@AGENTS.md` around lines 1 - 3, The "Before speaking" section has three consecutive sentences all starting with "If you're unsure" which is repetitive; edit the paragraph under the "Before speaking" header to vary sentence openings and tighten flow while preserving the three verification actions (run `go get` and check, read source/docs, write a test) and the instruction to state what was verified and how—target the three sentences specifically for rewording so the meaning stays identical but reads less repetitive.e2e/e2e_test.go (1)
423-424: Assert on the repo-owned error prefix instead of the SDK's exact wording.The test currently asserts on
"is not authorized"fromgo-secure-sdk, which can change with dependency updates. Sinceinternal/rss/rss.gowraps fetch failures with"failed to fetch feed:"(line 48), asserting on that repo-owned prefix is more stable and equally validates that SSRF protection is blocking the request.Diff
- require.Contains(t, string(out), "is not authorized", "expected SSRF error message") + require.Contains(t, string(out), "failed to fetch feed:", "expected SSRF error message")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/e2e_test.go` around lines 423 - 424, Update the assertion in e2e_test.go to check for the repo-owned error prefix instead of the SDK message: replace the require.Contains that expects "is not authorized" with one that looks for "failed to fetch feed:" (the wrapper string emitted by internal/rss/rss.go when fetch fails). This ensures the test validates the repo's wrapped error from the fetch helper (see internal/rss/rss.go) and remains stable against upstream SDK message changes; modify the require.Contains call in the test accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e/e2e_test.go`:
- Around line 411-423: The test spawns subprocesses with cmd :=
exec.CommandContext(...) and currently uses cmd.Env = append(os.Environ(),
"NO_COLOR=1"), which can inherit BLOGWATCHER_UNSAFE_CLIENT from the environment;
update both places where cmd.Env is set (the "add" and "scan" invocations) to
remove or override BLOGWATCHER_UNSAFE_CLIENT (e.g. filter it out from
os.Environ() or explicitly set "BLOGWATCHER_UNSAFE_CLIENT=" or
"BLOGWATCHER_UNSAFE_CLIENT=false" before appending "NO_COLOR=1") so the test
reliably exercises the safe client path when invoking the binaryPath with
dbPath.
In `@internal/scraper/scraper.go`:
- Around line 40-45: The Scraper currently panics when its HTTP client is nil
(e.g., NewScraper(nil)) because ScrapeBlog calls s.client.Do(req); update this
by validating the injected client and failing fast: in NewScraper return an
error if the provided http.Client is nil (or set a default client) and/or add a
nil-check at the start of ScrapeBlog to return a clear ScrapeError (include
context like "nil http client") instead of dereferencing s.client in ScrapeBlog;
reference the Scraper type, NewScraper function, ScrapeBlog method, and the
s.client.Do call when making the change.
---
Outside diff comments:
In `@internal/rss/rss.go`:
- Around line 82-108: The DiscoverFeedURL function currently swallows all errors
(e.g., failures from http.NewRequestWithContext, f.client.Do, non-2xx responses,
url.Parse, and goquery.NewDocumentFromReader) by returning ("", nil); change it
to propagate real failures as errors (return "", err or a wrapped error) for
network/parse/http errors and reserve returning ("", nil) only for legitimate
"feed not found" cases (e.g., when discovery markup is absent or HTTP 404/204
where you decide not to treat as an error). Update the error returns in
DiscoverFeedURL, ensure f.client.Do errors are returned, non-2xx responses
return a descriptive error (unless explicitly handled as not-found), and parsing
errors from url.Parse and goquery.NewDocumentFromReader are returned so callers
can distinguish real failures from "not found."
In `@internal/scanner/scanner.go`:
- Around line 157-193: The worker goroutine currently returns early on error
while the main sender keeps pushing to jobs, causing deadlocks; replace the
manual workers/errs pattern with an errgroup (use errgroup.WithContext(ctx)) and
spawn workers via g.Go so that a single cancellation stops all workers and the
sender; inside each worker (the anonymous goroutine that currently calls
storage.OpenDatabase and s.ScanBlog) open the DB, defer Close, then loop
receiving from jobs with select on ctx.Done() and the jobs channel (handle
ok==false to exit), and return any scan error so the errgroup cancels the
context; in the sender loop that enqueues job{Index, Blog} select on ctx.Done()
as well to stop early and close(jobs) after enqueuing, then call g.Wait() to
collect the first error and return results.
---
Nitpick comments:
In `@AGENTS.md`:
- Around line 1-3: The "Before speaking" section has three consecutive sentences
all starting with "If you're unsure" which is repetitive; edit the paragraph
under the "Before speaking" header to vary sentence openings and tighten flow
while preserving the three verification actions (run `go get` and check, read
source/docs, write a test) and the instruction to state what was verified and
how—target the three sentences specifically for rewording so the meaning stays
identical but reads less repetitive.
In `@e2e/e2e_test.go`:
- Around line 423-424: Update the assertion in e2e_test.go to check for the
repo-owned error prefix instead of the SDK message: replace the require.Contains
that expects "is not authorized" with one that looks for "failed to fetch feed:"
(the wrapper string emitted by internal/rss/rss.go when fetch fails). This
ensures the test validates the repo's wrapped error from the fetch helper (see
internal/rss/rss.go) and remains stable against upstream SDK message changes;
modify the require.Contains call in the test accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e320a777-4963-4818-88c5-b9667bd32f88
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (11)
AGENTS.mde2e/e2e_test.gogo.modinternal/cli/commands.gointernal/cli/root.gointernal/rss/rss.gointernal/rss/rss_test.gointernal/scanner/scanner.gointernal/scanner/scanner_test.gointernal/scraper/scraper.gointernal/scraper/scraper_test.go
| func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) { | ||
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil) | ||
| if err != nil { | ||
| return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)} | ||
| } | ||
| response, err := client.Do(req) | ||
| response, err := s.client.Do(req) |
There was a problem hiding this comment.
Fail fast when the scraper client is missing.
NewScraper(nil) currently panics later at s.client.Do(req). Since the injected client now controls timeout and SSRF posture, this should return a regular error instead of dereferencing nil.
🐛 Proposed fix
func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) {
+ if s == nil || s.client == nil {
+ return nil, ScrapeError{Message: "http client is not configured"}
+ }
req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
if err != nil {
return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)}
}
response, err := s.client.Do(req)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) { | |
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil) | |
| if err != nil { | |
| return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)} | |
| } | |
| response, err := client.Do(req) | |
| response, err := s.client.Do(req) | |
| func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) { | |
| if s == nil || s.client == nil { | |
| return nil, ScrapeError{Message: "http client is not configured"} | |
| } | |
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil) | |
| if err != nil { | |
| return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)} | |
| } | |
| response, err := s.client.Do(req) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scraper/scraper.go` around lines 40 - 45, The Scraper currently
panics when its HTTP client is nil (e.g., NewScraper(nil)) because ScrapeBlog
calls s.client.Do(req); update this by validating the injected client and
failing fast: in NewScraper return an error if the provided http.Client is nil
(or set a default client) and/or add a nil-check at the start of ScrapeBlog to
return a clear ScrapeError (include context like "nil http client") instead of
dereferencing s.client in ScrapeBlog; reference the Scraper type, NewScraper
function, ScrapeBlog method, and the s.client.Do call when making the change.
…v leak
- DiscoverFeedURL now propagates real errors (network, context cancel,
parse) instead of swallowing them as ("", nil)
- Replace scanner worker pool with errgroup to prevent deadlock when a
worker returns early on error while sender is still pushing jobs
- Filter BLOGWATCHER_UNSAFE_CLIENT from env in SSRF e2e test to ensure
the safe client path is always exercised
- Check our own wrapped error message in SSRF test instead of SDK string
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
go-secure-sdk— blocks requests to private IPs, loopback, link-local, and cloud metadata endpoints (CWE-918)rss.Fetcherandscraper.Scraperas structs holding*http.Client(dependency injection, no global state)ScanneracceptsFetcher+Scraperas dependenciesScanBlognow returns hard errors instead of soft error strings--unsafe-client/BLOGWATCHER_UNSAFE_CLIENTflag to disable SSRF protection for local devInspired by Hyaxia/blogwatcher#8 by @mihajlo-jovanovic, reimplemented with dependency injection and DataDog's go-secure-sdk.
Test plan
golangci-lint runpassesgotestsum -- ./...passes (23 tests)TestSSRFProtectionverifies loopback is blocked without--unsafe-client🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests