Skip to content

fix(conf): fixed bot configs#12

Merged
cnlangzi merged 6 commits intomainfrom
fix/bots
Jan 11, 2026
Merged

fix(conf): fixed bot configs#12
cnlangzi merged 6 commits intomainfrom
fix/bots

Conversation

@cnlangzi
Copy link
Owner

@cnlangzi cnlangzi commented Jan 11, 2026

Summary by Sourcery

Update bot configuration domains and add new integration tests for external IP sources.

Bug Fixes:

  • Correct bot reverse DNS domain mappings for several known crawlers and services.

Enhancements:

  • Add integration test for parsing Cloudflare IPv4 prefix lists from their published TXT data.
  • Add integration test for parsing Google special crawler IP ranges from the official JSON source.

Tests:

  • Extend integration test suite with coverage for Cloudflare IPv4 ranges and Google special crawler IP ranges.

- Add TestIntegration_Cloudflare for IPv4 IP list
- Add TestIntegration_GoogleSpecial for special crawlers
- Update meta-externalagent to use RDNS (no public IP list available)
- Remove TestIntegration_MetaExternalAgent (no valid URL)
- Add tfbnw.net and facebook.com to verified domains
- Meta crawlers use multiple domain suffixes
- mj12bot: mj12bot.com (not majestic.com)
- yandexbot: yandex.net
- semrushbot: bl.bot.semrush.com
- semrushbot-ba: bl.bot.semrush.com
- sogou: crawl.sogou.com
- pinterestbot: pinimg.com
- google-storebot: googlebot.com, rate-limited-proxy.google.com
- chrome-lighthouse: googlebot.com, rate-limited-proxy.google.com
- linkedinbot: fwd.linkedin.com
- chatgpt-user: chat.openai.com
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 11, 2026

Reviewer's Guide

Adds new integration tests for Cloudflare IP ranges and Google special crawlers, and tightens multiple bot verification configurations by updating RDNS/domain rules to better match official sources.

Sequence diagram for updated bot RDNS and domain verification

sequenceDiagram
    actor Client
    participant WebServer
    participant BotDetector
    participant RDNSResolver
    participant BotConfigStore

    Client->>WebServer: HTTP request with UserAgent and IP
    WebServer->>BotDetector: Classify request

    BotDetector->>BotConfigStore: Load bot config for UserAgent
    BotConfigStore-->>BotDetector: Config with rdns flag and domains list

    alt rdns is true
        BotDetector->>RDNSResolver: Reverse DNS lookup for IP
        RDNSResolver-->>BotDetector: Hostname
        BotDetector->>BotDetector: Check hostname ends with allowed domains
        alt hostname matches allowed domains
            BotDetector-->>WebServer: Mark as verified bot
        else hostname does not match
            BotDetector-->>WebServer: Mark as unverified bot
        end
    else rdns is false
        BotDetector-->>WebServer: Use other parsing logic
    end

    WebServer-->>Client: Response with bot-aware handling
Loading

File-Level Changes

Change Details Files
Added integration tests to validate parsing of Cloudflare IPv4 ranges and Google special crawler IP ranges from their live endpoints.
  • Introduce TestIntegration_Cloudflare using TxtParser against Cloudflare IPv4 list URL and ensure all parsed prefixes are IPv4 and non-empty.
  • Introduce TestIntegration_GoogleSpecial using GoogleParser against Google special crawlers JSON and assert the resulting prefix list is non-empty, skipping when network is unavailable or tests run in short mode.
parser/parser_txt_test.go
parser/parser_google_test.go
Adjusted bot configuration files to more accurately reflect official RDNS and domain patterns for several major bots.
  • Switch meta-externalagent from Google JSON parser to RDNS-based validation, using Meta-owned domains instead of a URL feed.
  • Expand Chrome-Lighthouse and GoogleStoreBot domain lists to include googlebot.com and rate-limited-proxy.google.com for more complete Google RDNS coverage.
  • Correct MJ12Bot RDNS domain from majestic.com to mj12bot.com.
  • Augment ChatGPT-User, LinkedInBot, PinterestBot, and SemrushBot (including backlinks) with additional known RDNS domains used in practice.
  • Extend Sogou and YandexBot configurations with their crawler-specific and additional TLD domains while keeping RDNS verification enabled.
bots/conf.d/meta-externalagent.yaml
bots/conf.d/chrome-lighthouse.yaml
bots/conf.d/google-storebot.yaml
bots/conf.d/mj12bot.yaml
bots/conf.d/chatgpt-user.yaml
bots/conf.d/linkedinbot.yaml
bots/conf.d/pinterestbot.yaml
bots/conf.d/semrushbot-backlinks.yaml
bots/conf.d/semrushbot.yaml
bots/conf.d/sogou.yaml
bots/conf.d/yandexbot.yaml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new integration tests treat network issues differently (Cloudflare fails the test while GoogleSpecial skips on error); consider making their behavior consistent so that transient network problems don’t cause flaky test runs.
  • Both new tests convert the downloaded bytes to string and back into a reader; you can avoid this extra allocation by passing bytes.NewReader(data) directly to the parser.
  • Several bot configs now share very similar rdns domain lists (e.g., google-related domains for Chrome-Lighthouse and GoogleStoreBot); if the config system allows it, consider centralizing or reusing these to reduce duplication and drift over time.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new integration tests treat network issues differently (Cloudflare fails the test while GoogleSpecial skips on error); consider making their behavior consistent so that transient network problems don’t cause flaky test runs.
- Both new tests convert the downloaded bytes to string and back into a reader; you can avoid this extra allocation by passing bytes.NewReader(data) directly to the parser.
- Several bot configs now share very similar rdns domain lists (e.g., google-related domains for Chrome-Lighthouse and GoogleStoreBot); if the config system allows it, consider centralizing or reusing these to reduce duplication and drift over time.

## Individual Comments

### Comment 1
<location> `bots/conf.d/meta-externalagent.yaml:6-10` </location>
<code_context>
 rdns: true
 domains:
   - "openai.com"
</code_context>

<issue_to_address>
**question (bug_risk):** Replacing parser/url-based validation with rdns+domain-only checks may broaden what gets classified as Meta-ExternalAgent.

The previous implementation relied on a canonical JSON definition via the `google` parser, which tightly constrained the IP ranges considered valid. Switching to `rdns: true` with these domains removes that single source of truth and may classify a wider set of IPs as Meta-ExternalAgent if those domains are reused elsewhere. Please confirm that this broader rdns-based behavior is intended, or consider retaining the JSON-based definition (e.g., as a fallback) to keep the scope constrained.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@codecov
Copy link

codecov bot commented Jan 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.85%. Comparing base (60ce193) to head (1a88eaf).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #12      +/-   ##
==========================================
- Coverage   72.76%   71.85%   -0.92%     
==========================================
  Files          15       17       +2     
  Lines         661      732      +71     
==========================================
+ Hits          481      526      +45     
- Misses        136      152      +16     
- Partials       44       54      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Jan 11, 2026

Benchmark Results

BenchmarkFindBotByUA_Hit_First             	 1088859	      1371 ns/op	      11 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_First-4           	 3042296	       595.9 ns/op	      11 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_First-8           	 3862365	       679.5 ns/op	      12 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle            	 1000000	      1283 ns/op	      17 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle-4          	 3722469	       357.7 ns/op	      14 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Middle-8          	 2649879	       406.9 ns/op	       9 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Last              	 1000000	      1204 ns/op	      39 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Last-4            	 3856455	       550.1 ns/op	      15 B/op	       0 allocs/op
BenchmarkFindBotByUA_Hit_Last-8            	 3279913	       460.6 ns/op	      10 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss                  	  520724	      2588 ns/op	      51 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss-4                	 1341723	       885.4 ns/op	      24 B/op	       0 allocs/op
BenchmarkFindBotByUA_Miss-8                	 1336435	       891.4 ns/op	      30 B/op	       0 allocs/op
BenchmarkFindBotByUA_CaseSensitive         	 1676103	      1060 ns/op	      28 B/op	       0 allocs/op
BenchmarkFindBotByUA_CaseSensitive-4       	 4232935	       384.8 ns/op	       9 B/op	       0 allocs/op
BenchmarkFindBotByUA_CaseSensitive-8       	 4064095	       476.9 ns/op	      12 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit           	 1000000	      1483 ns/op	      36 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit-4         	 4355862	       358.2 ns/op	       8 B/op	       0 allocs/op
BenchmarkValidate_KnownBot_IPHit-8         	 3194966	       382.5 ns/op	       6 B/op	       0 allocs/op
BenchmarkValidate_Browser                  	  236318	      5611 ns/op	     164 B/op	       1 allocs/op
BenchmarkValidate_Browser-4                	  627374	      2040 ns/op	      53 B/op	       0 allocs/op
BenchmarkValidate_Browser-8                	  504490	      2066 ns/op	      69 B/op	       0 allocs/op
BenchmarkContainsWord                      	73282026	        19.11 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsWord-4                    	73101067	        16.42 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsWord-8                    	73885605	        16.45 ns/op	       0 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA                	 1000000	      1174 ns/op	      14 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA-4              	 2781686	       440.0 ns/op	       7 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA-8              	 2542948	       450.7 ns/op	      12 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch     	  826449	      1774 ns/op	      62 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch-4   	 2155012	       571.0 ns/op	      19 B/op	       0 allocs/op
BenchmarkValidate_WithBotUA_IPMismatch-8   	 2174548	       565.5 ns/op	      19 B/op	       0 allocs/op
BenchmarkValidate_BrowserUA                	  289954	      4397 ns/op	      91 B/op	       0 allocs/op
BenchmarkValidate_BrowserUA-4              	  837538	      1536 ns/op	      48 B/op	       0 allocs/op
BenchmarkValidate_BrowserUA-8              	  852486	      1563 ns/op	      34 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA             	 8180988	       173.2 ns/op	       5 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA-4           	20317515	        57.00 ns/op	       2 B/op	       0 allocs/op
BenchmarkValidate_UnknownBotUA-8           	20970991	        62.46 ns/op	       2 B/op	       0 allocs/op
BenchmarkContainsIP                        	51838513	        35.69 ns/op	       1 B/op	       0 allocs/op
BenchmarkContainsIP-4                      	97986963	        12.01 ns/op	       0 B/op	       0 allocs/op
BenchmarkContainsIP-8                      	100000000	        14.76 ns/op	       0 B/op	       0 allocs/op
BenchmarkFindBotByUA                       	  803632	      1880 ns/op	      60 B/op	       0 allocs/op
BenchmarkFindBotByUA-4                     	 1973498	       581.1 ns/op	      15 B/op	       0 allocs/op
BenchmarkFindBotByUA-8                     	 2079121	       593.8 ns/op	      18 B/op	       0 allocs/op
BenchmarkClassifyUA                        	 2037619	       591.2 ns/op	       7 B/op	       0 allocs/op
BenchmarkClassifyUA-4                      	 4534287	       251.0 ns/op	       0 B/op	       0 allocs/op
BenchmarkClassifyUA-8                      	 4847118	       248.5 ns/op	       0 B/op	       0 allocs/op
Benchmark_MixedTraffic                     	  476725	      2570 ns/op	      15 B/op	       0 allocs/op
Benchmark_MixedTraffic-4                   	 1290186	       917.4 ns/op	      22 B/op	       0 allocs/op
Benchmark_MixedTraffic-8                   	--- FAIL: Benchmark_MixedTraffic-8
BenchmarkReload                            	     841	   1516048 ns/op	  692339 B/op	    6696 allocs/op
BenchmarkReload-4                          	     940	   1254507 ns/op	  690223 B/op	    6637 allocs/op
BenchmarkReload-8                          	     914	   1216603 ns/op	  688656 B/op	    6689 allocs/op
PASS
ok  	github.com/cnlangzi/knownbots	89.059s

- Removed domains that are covered by shorter suffixes:
  - rate-limited-proxy.google.com (covered by google.com)
  - fwd.linkedin.com (covered by linkedin.com)
  - chat.openai.com (covered by openai.com)
  - bl.bot.semrush.com (covered by semrush.com)
  - crawl.sogou.com (covered by sogou.com)
- Use googlebot.json for IP range verification (faster than RDNS)
- Update parser to google style
- Add official documentation reference
- Use t.Fatalf for network errors in both Cloudflare and GoogleSpecial tests
- Use bytes.NewReader instead of strings.NewReader to avoid extra allocation
@cnlangzi cnlangzi merged commit 5810d8b into main Jan 11, 2026
4 of 5 checks passed
@cnlangzi cnlangzi deleted the fix/bots branch January 11, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant