Remove rules that redirect URLs made up by AI/bot crawlers#277
Merged
speth merged 1 commit intoCantera:mainfrom Oct 23, 2025
Merged
Remove rules that redirect URLs made up by AI/bot crawlers#277speth merged 1 commit intoCantera:mainfrom
speth merged 1 commit intoCantera:mainfrom
Conversation
bryanwweber
approved these changes
Oct 23, 2025
Member
bryanwweber
left a comment
There was a problem hiding this comment.
Thanks Ray! I'd been wondering about those emails from Linode and I can confirm I haven't gotten one recently.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bandwidth usage from cantera.org has recently surged. Investigating the logs showed that the increased bandwidth was mainly due to a flood of requests for URLs such as:
/documentation/reactors/releasenotes/thermo/thermo/kinetics/python/constants.html/documentation/python/kinetics/reactors/yaml/reactors/reactors/reactors/releasenotes/releasenotes/v3.1.html/documentation/dev/doxygen/reference/cxx/thermo/yaml/kinetics/yaml/yaml2ck.html/documentation/dev/doxygen/reference/releasenotes/examples/reactors/thermo/python/python/lxcat_conversion.htmlThe
/documentationprefix was used in the old (pre-Cantera 3.1) website, while the rest of these URLs seem to be composed of components of valid URLs on the Cantera website, but arranged in some random order that corresponds to no page that has ever existed. All of these requests provide user agents claiming to be a real web browser, rather then identifying themselves as bots, and provide no referrer URL, and are distributed across 1000s of IPs. I can only assume this is some AI or bot trying to scrape content for training.The high bandwidth usage was mainly due the the fact that we were redirecting any URL starting with
/documentationto the root of the reference documentation, on the basis that this would be better than giving a 404 to an old deep link into the docs. However, by doing so these crawlers then read not only this page but the full set of resources (.cssand.jsfiles) to render this page, consuming quite a bit of bandwidth.By removing this redirect and returning 404 for these URLs, the bandwidth usage has dropped back down to more manageable levels. I'm hoping this bot net will give up at some point. I've also dropped another rule that could have a similar effect if a bot started to explore that space.