Skip to content

Remove rules that redirect URLs made up by AI/bot crawlers#277

Merged
speth merged 1 commit intoCantera:mainfrom
speth:manage-crawlers
Oct 23, 2025
Merged

Remove rules that redirect URLs made up by AI/bot crawlers#277
speth merged 1 commit intoCantera:mainfrom
speth:manage-crawlers

Conversation

@speth
Copy link
Member

@speth speth commented Oct 22, 2025

The bandwidth usage from cantera.org has recently surged. Investigating the logs showed that the increased bandwidth was mainly due to a flood of requests for URLs such as:

  • /documentation/reactors/releasenotes/thermo/thermo/kinetics/python/constants.html
  • /documentation/python/kinetics/reactors/yaml/reactors/reactors/reactors/releasenotes/releasenotes/v3.1.html
  • /documentation/dev/doxygen/reference/cxx/thermo/yaml/kinetics/yaml/yaml2ck.html
  • /documentation/dev/doxygen/reference/releasenotes/examples/reactors/thermo/python/python/lxcat_conversion.html

The /documentation prefix was used in the old (pre-Cantera 3.1) website, while the rest of these URLs seem to be composed of components of valid URLs on the Cantera website, but arranged in some random order that corresponds to no page that has ever existed. All of these requests provide user agents claiming to be a real web browser, rather then identifying themselves as bots, and provide no referrer URL, and are distributed across 1000s of IPs. I can only assume this is some AI or bot trying to scrape content for training.

The high bandwidth usage was mainly due the the fact that we were redirecting any URL starting with /documentation to the root of the reference documentation, on the basis that this would be better than giving a 404 to an old deep link into the docs. However, by doing so these crawlers then read not only this page but the full set of resources (.css and .js files) to render this page, consuming quite a bit of bandwidth.

By removing this redirect and returning 404 for these URLs, the bandwidth usage has dropped back down to more manageable levels. I'm hoping this bot net will give up at some point. I've also dropped another rule that could have a similar effect if a bot started to explore that space.

Copy link
Member

@bryanwweber bryanwweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ray! I'd been wondering about those emails from Linode and I can confirm I haven't gotten one recently.

@speth speth merged commit 25cf8f9 into Cantera:main Oct 23, 2025
1 check passed
@speth speth deleted the manage-crawlers branch October 23, 2025 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants