Skip to content

Add SSRF-safe HTTP client with --unsafe-client flag#9

Merged
JulienTant merged 4 commits intomainfrom
ssrf-safe-client
Apr 4, 2026
Merged

Add SSRF-safe HTTP client with --unsafe-client flag#9
JulienTant merged 4 commits intomainfrom
ssrf-safe-client

Conversation

@JulienTant
Copy link
Copy Markdown
Owner

@JulienTant JulienTant commented Apr 3, 2026

Summary

  • SSRF protection via DataDog go-secure-sdk — blocks requests to private IPs, loopback, link-local, and cloud metadata endpoints (CWE-918)
  • Refactored rss.Fetcher and scraper.Scraper as structs holding *http.Client (dependency injection, no global state)
  • Scanner accepts Fetcher + Scraper as dependencies
  • ScanBlog now returns hard errors instead of soft error strings
  • Added --unsafe-client / BLOGWATCHER_UNSAFE_CLIENT flag to disable SSRF protection for local dev
  • E2e test verifies safe client actually blocks loopback requests
  • Updated AGENTS.md with "ground claims in evidence" rule

Inspired by Hyaxia/blogwatcher#8 by @mihajlo-jovanovic, reimplemented with dependency injection and DataDog's go-secure-sdk.

Test plan

  • golangci-lint run passes
  • gotestsum -- ./... passes (23 tests)
  • TestSSRFProtection verifies loopback is blocked without --unsafe-client

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced error handling in feed discovery to return informative error messages for request failures instead of silent failures.
  • Tests

    • Improved test robustness for SSRF protection validation by isolating test environment from caller's shell settings.

- Use DataDog go-secure-sdk for SSRF protection (blocks private IPs,
  loopback, link-local, cloud metadata endpoints)
- Refactor rss.Fetcher and scraper.Scraper as structs holding *http.Client
- Scanner accepts Fetcher + Scraper as dependencies (no global state)
- Add --unsafe-client / BLOGWATCHER_UNSAFE_CLIENT flag for local dev
- ScanBlog now returns error (hard fail, not soft error string)
- E2e test verifies safe client blocks loopback requests
- Tests use plain http.Client (no SSRF check needed for httptest)
- Update AGENTS.md: ground claims in evidence, not guesses

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29b9e4b4-6d2a-4a85-a442-30a6f4f2f64a

📥 Commits

Reviewing files that changed from the base of the PR and between 75c6119 and cd311c6.

📒 Files selected for processing (2)
  • e2e/e2e_test.go
  • internal/rss/rss.go

📝 Walkthrough

Walkthrough

This PR improves SSRF protection testing by sanitizing the process environment to ensure consistent test behavior, and enhances feed discovery error handling by propagating errors instead of silently returning empty values.

Changes

Cohort / File(s) Summary
SSRF Protection Testing
e2e/e2e_test.go
Added filterEnv helper function to remove environment variables from test context. Updated TestSSRFProtection to sanitize the process environment by filtering out BLOGWATCHER_UNSAFE_CLIENT before test execution, ensuring consistent safe-client behavior. Adjusted assertion to expect "failed to fetch feed:" error message.
Feed Discovery Error Handling
internal/rss/rss.go
Modified DiscoverFeedURL to wrap and return errors for request construction, HTTP request failures, and document parsing failures instead of silently returning empty values. Preserved behavior for non-2xx HTTP status codes (still treated as "not found").

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 With whiskers twitching, I hopped through the code,
Cleaning the paths where our tests once strode.
Error wrapping blooms like clover so sweet,
Now feed discovery's honest and neat! 🌱

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding SSRF protection with an unsafe-client flag for local development.
Description check ✅ Passed The description covers all required template sections (Summary and Test plan) with comprehensive details about the changes, testing, and implementation approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ssrf-safe-client

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/rss/rss.go (1)

82-108: ⚠️ Potential issue | 🟡 Minor

Errors are silently swallowed in DiscoverFeedURL.

Multiple error paths return ("", nil) instead of propagating the error:

  • Line 85: http.NewRequestWithContext failure
  • Line 89: f.client.Do failure
  • Line 97: Non-2xx HTTP status
  • Line 102: url.Parse failure
  • Line 107: goquery.NewDocumentFromReader failure

While "discovery not found" may be a valid soft-failure scenario, silently discarding errors (especially network/SSRF errors from f.client.Do) makes debugging difficult and violates the guideline to never ignore errors.

Consider returning errors for actual failures while returning ("", nil) only for "not found" scenarios:

Proposed fix to distinguish errors from "not found"
 func (f *Fetcher) DiscoverFeedURL(ctx context.Context, blogURL string) (string, error) {
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
 	if err != nil {
-		return "", nil
+		return "", fmt.Errorf("create request: %w", err)
 	}
 	response, err := f.client.Do(req)
 	if err != nil {
-		return "", nil
+		return "", fmt.Errorf("fetch blog URL: %w", err)
 	}
 	defer func() {
 		if err := response.Body.Close(); err != nil {
 			fmt.Fprintf(os.Stderr, "close: %v\n", err)
 		}
 	}()
 	if response.StatusCode < 200 || response.StatusCode >= 300 {
-		return "", nil
+		return "", nil // Not found is acceptable
 	}
 
 	base, err := url.Parse(blogURL)
 	if err != nil {
-		return "", nil
+		return "", fmt.Errorf("parse blog URL: %w", err)
 	}
 
 	doc, err := goquery.NewDocumentFromReader(response.Body)
 	if err != nil {
-		return "", nil
+		return "", fmt.Errorf("parse HTML: %w", err)
 	}

As per coding guidelines: "Never ignore errors in Go - handle all errors returned by functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/rss/rss.go` around lines 82 - 108, The DiscoverFeedURL function
currently swallows all errors (e.g., failures from http.NewRequestWithContext,
f.client.Do, non-2xx responses, url.Parse, and goquery.NewDocumentFromReader) by
returning ("", nil); change it to propagate real failures as errors (return "",
err or a wrapped error) for network/parse/http errors and reserve returning ("",
nil) only for legitimate "feed not found" cases (e.g., when discovery markup is
absent or HTTP 404/204 where you decide not to treat as an error). Update the
error returns in DiscoverFeedURL, ensure f.client.Do errors are returned,
non-2xx responses return a descriptive error (unless explicitly handled as
not-found), and parsing errors from url.Parse and goquery.NewDocumentFromReader
are returned so callers can distinguish real failures from "not found."
internal/scanner/scanner.go (1)

157-193: ⚠️ Potential issue | 🟠 Major

Worker early return on error can cause goroutine leak and deadlock.

When a worker encounters an error (lines 171-174), it sends the error and returns immediately without draining remaining jobs from the channel. Meanwhile, the main goroutine continues sending jobs (lines 181-183). If workers exit early, jobs <- job{...} will block forever since no worker is receiving.

Additionally, remaining workers that finish successfully will send nil to errs, but if one worker exited early, you may not receive the expected number of messages, or you may return an error while other workers are still running.

Proposed fix using errgroup for cleaner cancellation
+import "golang.org/x/sync/errgroup"
+
 func (s *Scanner) ScanAllBlogs(ctx context.Context, db *storage.Database, workers int) ([]ScanResult, error) {
 	blogs, err := db.ListBlogs(ctx)
 	if err != nil {
 		return nil, err
 	}
 	if workers <= 1 {
 		// ... sequential path unchanged ...
 	}

 	type job struct {
 		Index int
 		Blog  model.Blog
 	}
 	jobs := make(chan job)
 	results := make([]ScanResult, len(blogs))
-	errs := make(chan error, workers)
+
+	g, gctx := errgroup.WithContext(ctx)

 	for i := 0; i < workers; i++ {
-		go func() {
-			workerDB, err := storage.OpenDatabase(ctx, db.Path())
+		g.Go(func() error {
+			workerDB, err := storage.OpenDatabase(gctx, db.Path())
 			if err != nil {
-				errs <- err
-				return
+				return err
 			}
 			defer func() {
 				if err := workerDB.Close(); err != nil {
 					fmt.Fprintf(os.Stderr, "close: %v\n", err)
 				}
 			}()
 			for item := range jobs {
-				result, err := s.ScanBlog(ctx, workerDB, item.Blog)
+				result, err := s.ScanBlog(gctx, workerDB, item.Blog)
 				if err != nil {
-					errs <- fmt.Errorf("scan %s: %w", item.Blog.Name, err)
-					return
+					return fmt.Errorf("scan %s: %w", item.Blog.Name, err)
 				}
 				results[item.Index] = result
 			}
-			errs <- nil
-		}()
+			return nil
+		})
 	}

+	// Send jobs in a separate goroutine so we can close channel when context cancels
+	go func() {
+		defer close(jobs)
+		for index, blog := range blogs {
+			select {
+			case jobs <- job{Index: index, Blog: blog}:
+			case <-gctx.Done():
+				return
+			}
+		}
+	}()

-	for index, blog := range blogs {
-		jobs <- job{Index: index, Blog: blog}
-	}
-	close(jobs)
-
-	for i := 0; i < workers; i++ {
-		if err := <-errs; err != nil {
-			return nil, err
-		}
-	}
+	if err := g.Wait(); err != nil {
+		return nil, err
+	}

 	return results, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scanner/scanner.go` around lines 157 - 193, The worker goroutine
currently returns early on error while the main sender keeps pushing to jobs,
causing deadlocks; replace the manual workers/errs pattern with an errgroup (use
errgroup.WithContext(ctx)) and spawn workers via g.Go so that a single
cancellation stops all workers and the sender; inside each worker (the anonymous
goroutine that currently calls storage.OpenDatabase and s.ScanBlog) open the DB,
defer Close, then loop receiving from jobs with select on ctx.Done() and the
jobs channel (handle ok==false to exit), and return any scan error so the
errgroup cancels the context; in the sender loop that enqueues job{Index, Blog}
select on ctx.Done() as well to stop early and close(jobs) after enqueuing, then
call g.Wait() to collect the first error and return results.
🧹 Nitpick comments (2)
AGENTS.md (1)

1-3: Excellent addition to document agent verification behavior.

The new "Before speaking" section establishes a clear principle: ground claims in evidence rather than speculation. The specific verification methods (go get, read source/docs, write tests) are concrete and actionable, which aligns well with the high-standards philosophy of this file. As per coding guidelines, this properly documents agent responsibilities and capabilities.

📝 Optional: Reduce repetitive sentence structure

The three successive sentences beginning with "If you're unsure" create a repetitive pattern. Consider rewording for variety:

-Never guess. Ground every claim in actual files, docs, or command output. If you're unsure about a dependency's size, run `go get` and check. If you're unsure about an API, read the source or docs. If you're unsure about behavior, write a test. State what you verified and how — don't speculate.
+Never guess. Ground every claim in actual files, docs, or command output. When uncertain about a dependency's size, run `go get` and check. For API questions, read the source or docs. To verify behavior, write a test. State what you verified and how — don't speculate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 1 - 3, The "Before speaking" section has three
consecutive sentences all starting with "If you're unsure" which is repetitive;
edit the paragraph under the "Before speaking" header to vary sentence openings
and tighten flow while preserving the three verification actions (run `go get`
and check, read source/docs, write a test) and the instruction to state what was
verified and how—target the three sentences specifically for rewording so the
meaning stays identical but reads less repetitive.
e2e/e2e_test.go (1)

423-424: Assert on the repo-owned error prefix instead of the SDK's exact wording.

The test currently asserts on "is not authorized" from go-secure-sdk, which can change with dependency updates. Since internal/rss/rss.go wraps fetch failures with "failed to fetch feed:" (line 48), asserting on that repo-owned prefix is more stable and equally validates that SSRF protection is blocking the request.

Diff
-	require.Contains(t, string(out), "is not authorized", "expected SSRF error message")
+	require.Contains(t, string(out), "failed to fetch feed:", "expected SSRF error message")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/e2e_test.go` around lines 423 - 424, Update the assertion in e2e_test.go
to check for the repo-owned error prefix instead of the SDK message: replace the
require.Contains that expects "is not authorized" with one that looks for
"failed to fetch feed:" (the wrapper string emitted by internal/rss/rss.go when
fetch fails). This ensures the test validates the repo's wrapped error from the
fetch helper (see internal/rss/rss.go) and remains stable against upstream SDK
message changes; modify the require.Contains call in the test accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/e2e_test.go`:
- Around line 411-423: The test spawns subprocesses with cmd :=
exec.CommandContext(...) and currently uses cmd.Env = append(os.Environ(),
"NO_COLOR=1"), which can inherit BLOGWATCHER_UNSAFE_CLIENT from the environment;
update both places where cmd.Env is set (the "add" and "scan" invocations) to
remove or override BLOGWATCHER_UNSAFE_CLIENT (e.g. filter it out from
os.Environ() or explicitly set "BLOGWATCHER_UNSAFE_CLIENT=" or
"BLOGWATCHER_UNSAFE_CLIENT=false" before appending "NO_COLOR=1") so the test
reliably exercises the safe client path when invoking the binaryPath with
dbPath.

In `@internal/scraper/scraper.go`:
- Around line 40-45: The Scraper currently panics when its HTTP client is nil
(e.g., NewScraper(nil)) because ScrapeBlog calls s.client.Do(req); update this
by validating the injected client and failing fast: in NewScraper return an
error if the provided http.Client is nil (or set a default client) and/or add a
nil-check at the start of ScrapeBlog to return a clear ScrapeError (include
context like "nil http client") instead of dereferencing s.client in ScrapeBlog;
reference the Scraper type, NewScraper function, ScrapeBlog method, and the
s.client.Do call when making the change.

---

Outside diff comments:
In `@internal/rss/rss.go`:
- Around line 82-108: The DiscoverFeedURL function currently swallows all errors
(e.g., failures from http.NewRequestWithContext, f.client.Do, non-2xx responses,
url.Parse, and goquery.NewDocumentFromReader) by returning ("", nil); change it
to propagate real failures as errors (return "", err or a wrapped error) for
network/parse/http errors and reserve returning ("", nil) only for legitimate
"feed not found" cases (e.g., when discovery markup is absent or HTTP 404/204
where you decide not to treat as an error). Update the error returns in
DiscoverFeedURL, ensure f.client.Do errors are returned, non-2xx responses
return a descriptive error (unless explicitly handled as not-found), and parsing
errors from url.Parse and goquery.NewDocumentFromReader are returned so callers
can distinguish real failures from "not found."

In `@internal/scanner/scanner.go`:
- Around line 157-193: The worker goroutine currently returns early on error
while the main sender keeps pushing to jobs, causing deadlocks; replace the
manual workers/errs pattern with an errgroup (use errgroup.WithContext(ctx)) and
spawn workers via g.Go so that a single cancellation stops all workers and the
sender; inside each worker (the anonymous goroutine that currently calls
storage.OpenDatabase and s.ScanBlog) open the DB, defer Close, then loop
receiving from jobs with select on ctx.Done() and the jobs channel (handle
ok==false to exit), and return any scan error so the errgroup cancels the
context; in the sender loop that enqueues job{Index, Blog} select on ctx.Done()
as well to stop early and close(jobs) after enqueuing, then call g.Wait() to
collect the first error and return results.

---

Nitpick comments:
In `@AGENTS.md`:
- Around line 1-3: The "Before speaking" section has three consecutive sentences
all starting with "If you're unsure" which is repetitive; edit the paragraph
under the "Before speaking" header to vary sentence openings and tighten flow
while preserving the three verification actions (run `go get` and check, read
source/docs, write a test) and the instruction to state what was verified and
how—target the three sentences specifically for rewording so the meaning stays
identical but reads less repetitive.

In `@e2e/e2e_test.go`:
- Around line 423-424: Update the assertion in e2e_test.go to check for the
repo-owned error prefix instead of the SDK message: replace the require.Contains
that expects "is not authorized" with one that looks for "failed to fetch feed:"
(the wrapper string emitted by internal/rss/rss.go when fetch fails). This
ensures the test validates the repo's wrapped error from the fetch helper (see
internal/rss/rss.go) and remains stable against upstream SDK message changes;
modify the require.Contains call in the test accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e320a777-4963-4818-88c5-b9667bd32f88

📥 Commits

Reviewing files that changed from the base of the PR and between 2972f35 and 75c6119.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • AGENTS.md
  • e2e/e2e_test.go
  • go.mod
  • internal/cli/commands.go
  • internal/cli/root.go
  • internal/rss/rss.go
  • internal/rss/rss_test.go
  • internal/scanner/scanner.go
  • internal/scanner/scanner_test.go
  • internal/scraper/scraper.go
  • internal/scraper/scraper_test.go

Comment on lines +40 to +45
func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
if err != nil {
return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)}
}
response, err := client.Do(req)
response, err := s.client.Do(req)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when the scraper client is missing.

NewScraper(nil) currently panics later at s.client.Do(req). Since the injected client now controls timeout and SSRF posture, this should return a regular error instead of dereferencing nil.

🐛 Proposed fix
 func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) {
+	if s == nil || s.client == nil {
+		return nil, ScrapeError{Message: "http client is not configured"}
+	}
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
 	if err != nil {
 		return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)}
 	}
 	response, err := s.client.Do(req)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
if err != nil {
return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)}
}
response, err := client.Do(req)
response, err := s.client.Do(req)
func (s *Scraper) ScrapeBlog(ctx context.Context, blogURL string, selector string) ([]ScrapedArticle, error) {
if s == nil || s.client == nil {
return nil, ScrapeError{Message: "http client is not configured"}
}
req, err := http.NewRequestWithContext(ctx, http.MethodGet, blogURL, nil)
if err != nil {
return nil, ScrapeError{Message: fmt.Sprintf("failed to create request: %v", err)}
}
response, err := s.client.Do(req)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scraper/scraper.go` around lines 40 - 45, The Scraper currently
panics when its HTTP client is nil (e.g., NewScraper(nil)) because ScrapeBlog
calls s.client.Do(req); update this by validating the injected client and
failing fast: in NewScraper return an error if the provided http.Client is nil
(or set a default client) and/or add a nil-check at the start of ScrapeBlog to
return a clear ScrapeError (include context like "nil http client") instead of
dereferencing s.client in ScrapeBlog; reference the Scraper type, NewScraper
function, ScrapeBlog method, and the s.client.Do call when making the change.

JulienTant and others added 3 commits April 3, 2026 15:16
…v leak

- DiscoverFeedURL now propagates real errors (network, context cancel,
  parse) instead of swallowing them as ("", nil)
- Replace scanner worker pool with errgroup to prevent deadlock when a
  worker returns early on error while sender is still pushing jobs
- Filter BLOGWATCHER_UNSAFE_CLIENT from env in SSRF e2e test to ensure
  the safe client path is always exercised
- Check our own wrapped error message in SSRF test instead of SDK string

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JulienTant JulienTant merged commit dbdcb35 into main Apr 4, 2026
2 checks passed
@JulienTant JulienTant deleted the ssrf-safe-client branch April 4, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant