Skip to content

Resolve redirects when testing URLs for equality #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

hseg
Copy link

@hseg hseg commented Apr 15, 2025

The URL DOIs resolve to can move around, with redirects pointing to the new location. To make the tests more robust, only fail if the URLs differ after redirections.

See also https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/ and item 10 in https://pardalotus.tech/posts/2024-10-02-falsehoods-programmers-believe-about-dois/

@hseg
Copy link
Author

hseg commented Apr 15, 2025

I noticed this since the test case with https://aip.scitation.org/doi/10.1063/1.5081715 started failing

@hseg hseg force-pushed the canonicalize_urls branch from 9c3e06c to b301736 Compare April 15, 2025 18:52
@hseg
Copy link
Author

hseg commented Apr 15, 2025

Short-circuited the normalized equality so as to cut down on the number of attempts at resolving redirects -- otherwise, have been running into 403 errors that seem to arise from rate-limiting (since they go away with more attempts)

@hseg hseg force-pushed the canonicalize_urls branch from b301736 to c51f6be Compare April 15, 2025 18:56
@alejandrogallo
Copy link
Member

alejandrogallo commented Apr 16, 2025 via email

@hseg
Copy link
Author

hseg commented Apr 16, 2025 via email

@hseg hseg force-pushed the canonicalize_urls branch 2 times, most recently from 289b398 to c67beee Compare April 17, 2025 17:13
@hseg
Copy link
Author

hseg commented Apr 17, 2025

OK, so I've addressed most of my concerns -- I've documented the problem with unconditionally upgrading to https, have made the test case warn when it needs to fall back to resolving redirects, and removed the HTTPException (further testing revealed that the information it provided is accessible by invoking pytest with --showlocals).
However, further testing does indeed show that the test remains flaky, with scitation abusing HTTP 403 when they want to respond with 429. I'd recommend adding pytest-rerunfailures or similar to the dev requirements and marking this particular test as flaky and needing to be rerun on HTTPException.

@hseg
Copy link
Author

hseg commented Apr 17, 2025

Added pytest-rerunfailures to address the flakiness, I have no more notes.

@hseg
Copy link
Author

hseg commented Apr 27, 2025

No good, the test still fails with 403. Can't think of what else they're probing that we need to send -- curl -A Mozilla/5.0 works fine at the https endpoint.

@hseg hseg mentioned this pull request Jun 23, 2025
hseg added 2 commits June 23, 2025 21:59
The URL DOIs resolve to can move around, with redirects pointing to the
new location. To make the tests more robust, only fail if the URLs
differ after redirections.

See also
https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/
@hseg hseg force-pushed the canonicalize_urls branch from c2929c9 to 2e4b622 Compare June 24, 2025 16:56
@hseg
Copy link
Author

hseg commented Jun 24, 2025

OK, I've rebased this on top of #2 (including alexfikl#1 on the assumption @alexfikl will merge it), so now both can be merged with no issue.
I've dropped the pytest-rerunfailures commit, since ultimately what seems to be the correct advice is to just update the URLs when they change, as done in #2.

@hseg
Copy link
Author

hseg commented Jun 24, 2025

Hm. I'm not fully satisfied by my implementation -- I would like to figure out how come I'm managing to get the page to load in the browser but urllib returns 403. I have discovered that urllib is properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper).

@alexfikl
Copy link

I have discovered that urllib is properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper).

Yeah, Cloudflare anti-scraper stuff is making a lot of the downloaders in Papis also fail (even though they would work properly if fed the actual webpage). I'm not sure there's much we can do there.. from what I recall the "fix" for that is to use something like playwright that can fake being a known browser.

@hseg
Copy link
Author

hseg commented Jun 24, 2025 via email

@hseg
Copy link
Author

hseg commented Jun 29, 2025

OK, cloudscraper indeed seems to work:

import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get('http://aip.scitation.org/doi/10.1063/1.5081715').url)
# https://pubs.aip.org/aip/jcp/article-abstract/150/7/074102/197572/Exact-two-component-equation-of-motion-coupled?redirectedFrom=fulltext

@hseg hseg force-pushed the canonicalize_urls branch 2 times, most recently from 7cfe04a to b161e62 Compare June 29, 2025 14:40
@hseg
Copy link
Author

hseg commented Jun 29, 2025

OK, added a [challenges] extra to bypass websites using CloudFlare (and friends, in the future) to protect against DDoS. Also added a test to check my redirect resolution code works, as @alexfikl asked for in #2 (comment), though ATM it only has testcases for cloudscraper-dependent sites. Still, I'm happy with the code as it stands now, and would like to have it merged.
@alejandrogallo, can we get this merged?

Also put in a fallback using requests, but it is hacky and only works
sometimes. cloudscraper stands a better chance of consistently being
able to get to the final URL
This eg makes it easier to spot which particular iteration breaks
@hseg hseg force-pushed the canonicalize_urls branch from b161e62 to 8e5f3c9 Compare June 29, 2025 14:46
@hseg
Copy link
Author

hseg commented Jun 29, 2025

Note this still isn't perfect -- my testing must've gotten my on scitation's warnlist, because I'm starting to get redirected to their captcha page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants