-
Notifications
You must be signed in to change notification settings - Fork 13
fix: number of google pages results, limited perms, improvements from WCC #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jirispilka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible I would use reference implementation, I think it is better to have one source of truth
These are LLM complains:
The reference implementation is more robust because it:
- Validates pagination links exist in HTML
- Tracks page numbers to avoid non-existent pages
- Uses multiple fallback conditions
- Handles Google's inconsistent pagination behavior
matyascimbulka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like moving the validation of the results to the collection also the move to fix pagination looks good. But I have found some duplicate code and some opportunities to make the code cleaner.
Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
Thank you for the review, I think I addressed your comments 👍 |
Regarding the "reference" implementation Ondra mentioned that the page tracking logic is legacy and they just left it there, so our simpler implementation should be fine. |
matyascimbulka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better but still found some issues that would most likely cause issues down the line. It would be good to see some test runs of the entire PR stack.
matyascimbulka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize that my comments about it failing are wrong. Still it be nice to see some test runs.
Ok, thanks for the clarification. IMHO it is always better to ask than miss some edge cases |
|
@matyascimbulka thanks for the comments. I appreciate it. I'll leave this to you two :) |
Here is few of my test runs:
Then also tested some edge cases and I haven't encountered any issues with the google serp crawling - feel free to test. |
…d, fix wrong imports, firefox patches and policies, NY TZ (#82) * feat: update libs for limited perms, fix and speed up Dockerfile build, fix wrong imports * magic number to const * lint * fix lint, update crawlee * add firefox policies, patches, set NY TZ * add ghostery blocker * Squashed commit of the following: commit 7aadecd Author: MQ37 <themq37@gmail.com> Date: Wed Nov 26 11:14:09 2025 +0100 add var for better readability commit fd02b0a Author: MQ37 <themq37@gmail.com> Date: Wed Nov 26 11:09:57 2025 +0100 make code more typescripty
@matyascimbulka it is now broken because of the PR I merged here, working on a fix sorry |
|
@matyascimbulka fixed and should be ready now |
matyascimbulka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just found small nit.
Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
closes #80
Fixes the google pages number of results scraping. Instead of using the ?num param that is now ignored by the google it uses the ?start for offset pagination to gather the google result pages urls. Implementation based on https://github.com/apify-store/google-search