Resolve redirects when testing URLs for equality #3

hseg · 2025-04-15T18:16:56Z

The URL DOIs resolve to can move around, with redirects pointing to the new location. To make the tests more robust, only fail if the URLs differ after redirections.

See also https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/ and item 10 in https://pardalotus.tech/posts/2024-10-02-falsehoods-programmers-believe-about-dois/

hseg · 2025-04-15T18:17:57Z

I noticed this since the test case with https://aip.scitation.org/doi/10.1063/1.5081715 started failing

hseg · 2025-04-15T18:55:40Z

Short-circuited the normalized equality so as to cut down on the number of attempts at resolving redirects -- otherwise, have been running into 403 errors that seem to arise from rate-limiting (since they go away with more attempts)

alejandrogallo · 2025-04-16T06:26:59Z

Is this good to go?

…

-- Alejandro Gallo

hseg · 2025-04-16T07:24:00Z

There's three points I'm unhappy/uncertain about that I'd like feedback on: - this just makes the tests _more_ robust, but I'm unsure it entirely avoids the intermittent 403s I was getting during testing. User-agent and forcing https looks like it should work, from my testing - it might be undesirable to unconditionally upgrade to https - printing the HTTPException when it's caught might be unnecessary - there might be a pytest option for that. I was doing it to try to get more detail on the exception. Other than that, I'm happy with this. 16 abr 2025 09:27:22 Alejandro Gallo ***@***.***>:

…

[Imagen]*alejandrogallo* left a comment (papis/python-doi#3)[#3 (comment)] Is this good to go? -- Alejandro Gallo — Reply to this email directly, view it on GitHub[#3 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAJTW5IFEAHYLDGENOXZ3CL2ZXZ4TAVCNFSM6AAAAAB3GKAMRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBYGUYTINJWGU]. You are receiving this because you authored the thread.

hseg · 2025-04-17T17:18:43Z

OK, so I've addressed most of my concerns -- I've documented the problem with unconditionally upgrading to https, have made the test case warn when it needs to fall back to resolving redirects, and removed the HTTPException (further testing revealed that the information it provided is accessible by invoking pytest with --showlocals).
However, further testing does indeed show that the test remains flaky, with scitation abusing HTTP 403 when they want to respond with 429. I'd recommend adding pytest-rerunfailures or similar to the dev requirements and marking this particular test as flaky and needing to be rerun on HTTPException.

hseg · 2025-04-17T17:23:49Z

Added pytest-rerunfailures to address the flakiness, I have no more notes.

hseg · 2025-04-27T17:18:35Z

No good, the test still fails with 403. Can't think of what else they're probing that we need to send -- curl -A Mozilla/5.0 works fine at the https endpoint.

The URL DOIs resolve to can move around, with redirects pointing to the new location. To make the tests more robust, only fail if the URLs differ after redirections. See also https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/

hseg · 2025-06-24T17:00:49Z

OK, I've rebased this on top of #2 (including alexfikl#1 on the assumption @alexfikl will merge it), so now both can be merged with no issue.
I've dropped the pytest-rerunfailures commit, since ultimately what seems to be the correct advice is to just update the URLs when they change, as done in #2.

hseg · 2025-06-24T17:28:19Z

Hm. I'm not fully satisfied by my implementation -- I would like to figure out how come I'm managing to get the page to load in the browser but urllib returns 403. I have discovered that urllib is properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper).

alexfikl · 2025-06-24T19:00:02Z

I have discovered that urllib is properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper).

Yeah, Cloudflare anti-scraper stuff is making a lot of the downloaders in Papis also fail (even though they would work properly if fed the actual webpage). I'm not sure there's much we can do there.. from what I recall the "fix" for that is to use something like playwright that can fake being a known browser.

hseg · 2025-06-24T19:08:40Z

24 jun 2025 22:00:24 Alex Fikl ***@***.***>:

[Imagen]*alexfikl* left a comment (papis/python-doi#3)[#3 (comment)] I have discovered that *urllib* /is/ properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper). Yeah, Cloudflare anti-scraper stuff is making a lot of the downloaders in Papis also fail (even though they would work properly if fed the actual webpage). I'm not sure there's much we can do there.. from what I recall the "fix" for that is to use something like playwright[https://playwright.dev/] that can fake being a known browser. — Reply to this email directly, view it on GitHub[#3 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAJTW5INPN24E7M46FNZM333FGN4PAVCNFSM6AAAAAB3GKAMRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMBRGU2TMNRVGE]. You are receiving this because you authored the thread. [Imagen de rastreo][https://github.com/notifications/beacon/AAJTW5OF4MY45VOFEFE2TVT3FGN4PA5CNFSM6AAAAAB3GKAMRGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTVS5APKW.gif]

Alternatively, we could take a page out of fanficfare's book and use cloudscraper.

hseg · 2025-06-29T12:58:59Z

OK, cloudscraper indeed seems to work:

import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get('http://aip.scitation.org/doi/10.1063/1.5081715').url)
# https://pubs.aip.org/aip/jcp/article-abstract/150/7/074102/197572/Exact-two-component-equation-of-motion-coupled?redirectedFrom=fulltext

hseg · 2025-06-29T14:42:04Z

OK, added a [challenges] extra to bypass websites using CloudFlare (and friends, in the future) to protect against DDoS. Also added a test to check my redirect resolution code works, as @alexfikl asked for in #2 (comment), though ATM it only has testcases for cloudscraper-dependent sites. Still, I'm happy with the code as it stands now, and would like to have it merged.
@alejandrogallo, can we get this merged?

Also put in a fallback using requests, but it is hacky and only works sometimes. cloudscraper stands a better chance of consistently being able to get to the final URL

This eg makes it easier to spot which particular iteration breaks

hseg · 2025-06-29T17:41:55Z

Note this still isn't perfect -- my testing must've gotten my on scitation's warnlist, because I'm starting to get redirected to their captcha page.

makefile: simplify targets

f27f38d

hseg force-pushed the canonicalize_urls branch from 9c3e06c to b301736 Compare April 15, 2025 18:52

hseg force-pushed the canonicalize_urls branch from b301736 to c51f6be Compare April 15, 2025 18:56

hseg force-pushed the canonicalize_urls branch 2 times, most recently from 289b398 to c67beee Compare April 17, 2025 17:13

alexfikl added 5 commits April 17, 2025 20:56

setup: switch to pyproject and hatchling

3a03a4a

ci: remove travis.yaml

758318b

ci: add tests to ci

507acb0

style: fix flake8 issues

a64941e

tests: fix tests

844fdfd

hseg mentioned this pull request Jun 23, 2025

Modernize #2

Open

hseg added 2 commits June 23, 2025 21:59

tests: Configure pytest to ignore docs

34ea67c

hseg force-pushed the canonicalize_urls branch from c2929c9 to 2e4b622 Compare June 24, 2025 16:56

hseg force-pushed the canonicalize_urls branch 2 times, most recently from 7cfe04a to b161e62 Compare June 29, 2025 14:40

Use cloudscraper to solve cloudflare challenges

ca6dcf0

Also put in a fallback using requests, but it is hacky and only works sometimes. cloudscraper stands a better chance of consistently being able to get to the final URL

Parametrize tests

8e5f3c9

This eg makes it easier to spot which particular iteration breaks

hseg force-pushed the canonicalize_urls branch from b161e62 to 8e5f3c9 Compare June 29, 2025 14:46

Make test_redirect cases prettier

ab9d72a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolve redirects when testing URLs for equality #3

Resolve redirects when testing URLs for equality #3

Uh oh!

hseg commented Apr 15, 2025

Uh oh!

hseg commented Apr 15, 2025

Uh oh!

hseg commented Apr 15, 2025

Uh oh!

alejandrogallo commented Apr 16, 2025 via email

Uh oh!

hseg commented Apr 16, 2025 via email

Uh oh!

hseg commented Apr 17, 2025

Uh oh!

hseg commented Apr 17, 2025

Uh oh!

hseg commented Apr 27, 2025

Uh oh!

hseg commented Jun 24, 2025

Uh oh!

hseg commented Jun 24, 2025

Uh oh!

alexfikl commented Jun 24, 2025

Uh oh!

hseg commented Jun 24, 2025 via email

Uh oh!

hseg commented Jun 29, 2025 •

edited

Loading

Uh oh!

hseg commented Jun 29, 2025 •

edited

Loading

Uh oh!

hseg commented Jun 29, 2025

Uh oh!

Uh oh!

Resolve redirects when testing URLs for equality #3

Are you sure you want to change the base?

Resolve redirects when testing URLs for equality #3

Uh oh!

Conversation

hseg commented Apr 15, 2025

Uh oh!

hseg commented Apr 15, 2025

Uh oh!

hseg commented Apr 15, 2025

Uh oh!

alejandrogallo commented Apr 16, 2025 via email

Uh oh!

hseg commented Apr 16, 2025 via email

Uh oh!

hseg commented Apr 17, 2025

Uh oh!

hseg commented Apr 17, 2025

Uh oh!

hseg commented Apr 27, 2025

Uh oh!

hseg commented Jun 24, 2025

Uh oh!

hseg commented Jun 24, 2025

Uh oh!

alexfikl commented Jun 24, 2025

Uh oh!

hseg commented Jun 24, 2025 via email

Uh oh!

hseg commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hseg commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hseg commented Jun 29, 2025

Uh oh!

Uh oh!

hseg commented Jun 29, 2025 •

edited

Loading

hseg commented Jun 29, 2025 •

edited

Loading