-
Notifications
You must be signed in to change notification settings - Fork 3
Resolve redirects when testing URLs for equality #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I noticed this since the test case with https://aip.scitation.org/doi/10.1063/1.5081715 started failing |
Short-circuited the normalized equality so as to cut down on the number of attempts at resolving redirects -- otherwise, have been running into 403 errors that seem to arise from rate-limiting (since they go away with more attempts) |
Is this good to go?
…--
Alejandro Gallo
|
There's three points I'm unhappy/uncertain about that I'd like feedback on:
- this just makes the tests _more_ robust, but I'm unsure it entirely avoids the intermittent 403s I was getting during testing. User-agent and forcing https looks like it should work, from my testing
- it might be undesirable to unconditionally upgrade to https
- printing the HTTPException when it's caught might be unnecessary - there might be a pytest option for that. I was doing it to try to get more detail on the exception.
Other than that, I'm happy with this.
16 abr 2025 09:27:22 Alejandro Gallo ***@***.***>:
…
[Imagen]*alejandrogallo* left a comment (papis/python-doi#3)[#3 (comment)]
<br> <br> Is this good to go?<br> <br> -- <br> Alejandro Gallo<br> <br>
—
Reply to this email directly, view it on GitHub[#3 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAJTW5IFEAHYLDGENOXZ3CL2ZXZ4TAVCNFSM6AAAAAB3GKAMRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBYGUYTINJWGU].
You are receiving this because you authored the thread.
|
289b398
to
c67beee
Compare
OK, so I've addressed most of my concerns -- I've documented the problem with unconditionally upgrading to https, have made the test case warn when it needs to fall back to resolving redirects, and removed the |
Added |
No good, the test still fails with 403. Can't think of what else they're probing that we need to send -- |
The URL DOIs resolve to can move around, with redirects pointing to the new location. To make the tests more robust, only fail if the URLs differ after redirections. See also https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/
OK, I've rebased this on top of #2 (including alexfikl#1 on the assumption @alexfikl will merge it), so now both can be merged with no issue. |
Hm. I'm not fully satisfied by my implementation -- I would like to figure out how come I'm managing to get the page to load in the browser but |
Yeah, Cloudflare anti-scraper stuff is making a lot of the downloaders in Papis also fail (even though they would work properly if fed the actual webpage). I'm not sure there's much we can do there.. from what I recall the "fix" for that is to use something like playwright that can fake being a known browser. |
24 jun 2025 22:00:24 Alex Fikl ***@***.***>:
[Imagen]*alexfikl* left a comment (papis/python-doi#3)[#3 (comment)]
I have discovered that *urllib* /is/ properly redirecting the URL, it's just that the final URL returns 403 (presumably because Cloudflare is detecting the scraper).
Yeah, Cloudflare anti-scraper stuff is making a lot of the downloaders in Papis also fail (even though they would work properly if fed the actual webpage). I'm not sure there's much we can do there.. from what I recall the "fix" for that is to use something like playwright[https://playwright.dev/] that can fake being a known browser.
—
Reply to this email directly, view it on GitHub[#3 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAJTW5INPN24E7M46FNZM333FGN4PAVCNFSM6AAAAAB3GKAMRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMBRGU2TMNRVGE].
You are receiving this because you authored the thread.
[Imagen de rastreo][https://github.com/notifications/beacon/AAJTW5OF4MY45VOFEFE2TVT3FGN4PA5CNFSM6AAAAAB3GKAMRGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTVS5APKW.gif]
Alternatively, we could take a page out of fanficfare's book and use cloudscraper.
|
OK, import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get('http://aip.scitation.org/doi/10.1063/1.5081715').url)
# https://pubs.aip.org/aip/jcp/article-abstract/150/7/074102/197572/Exact-two-component-equation-of-motion-coupled?redirectedFrom=fulltext |
7cfe04a
to
b161e62
Compare
OK, added a |
Also put in a fallback using requests, but it is hacky and only works sometimes. cloudscraper stands a better chance of consistently being able to get to the final URL
This eg makes it easier to spot which particular iteration breaks
Note this still isn't perfect -- my testing must've gotten my on scitation's warnlist, because I'm starting to get redirected to their captcha page. |
The URL DOIs resolve to can move around, with redirects pointing to the new location. To make the tests more robust, only fail if the URLs differ after redirections.
See also https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/ and item 10 in https://pardalotus.tech/posts/2024-10-02-falsehoods-programmers-believe-about-dois/