Skip to content

Conversation

@MQ37
Copy link
Contributor

@MQ37 MQ37 commented Nov 25, 2025

closes #80

⚠️ MERGE ONLY AFTER #82 IS MERGED ⚠️

Fixes the google pages number of results scraping. Instead of using the ?num param that is now ignored by the google it uses the ?start for offset pagination to gather the google result pages urls. Implementation based on https://github.com/apify-store/google-search

@MQ37 MQ37 requested a review from jirispilka November 25, 2025 09:58
@github-actions github-actions bot added the t-ai Issues owned by the AI team. label Nov 25, 2025
@MQ37 MQ37 requested a review from matyascimbulka November 25, 2025 10:57
Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible I would use reference implementation, I think it is better to have one source of truth

These are LLM complains:
The reference implementation is more robust because it:

  • Validates pagination links exist in HTML
  • Tracks page numbers to avoid non-existent pages
  • Uses multiple fallback conditions
  • Handles Google's inconsistent pagination behavior

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like moving the validation of the results to the collection also the move to fix pagination looks good. But I have found some duplicate code and some opportunities to make the code cleaner.

MQ37 and others added 4 commits November 25, 2025 16:20
Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
@MQ37 MQ37 requested a review from matyascimbulka November 25, 2025 15:45
@MQ37
Copy link
Contributor Author

MQ37 commented Nov 25, 2025

I like moving the validation of the results to the collection also the move to fix pagination looks good. But I have found some duplicate code and some opportunities to make the code cleaner.

Thank you for the review, I think I addressed your comments 👍

@MQ37
Copy link
Contributor Author

MQ37 commented Nov 25, 2025

if possible I would use reference implementation, I think it is better to have one source of truth

These are LLM complains: The reference implementation is more robust because it:

* Validates pagination links exist in HTML

* Tracks page numbers to avoid non-existent pages

* Uses multiple fallback conditions

* Handles Google's inconsistent pagination behavior

Regarding the "reference" implementation Ondra mentioned that the page tracking logic is legacy and they just left it there, so our simpler implementation should be fine.

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better but still found some issues that would most likely cause issues down the line. It would be good to see some test runs of the entire PR stack.

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that my comments about it failing are wrong. Still it be nice to see some test runs.

@jirispilka
Copy link
Collaborator

Regarding the "reference" implementation Ondra mentioned that the page tracking logic is legacy and they just left it there, so our simpler implementation should be fine.

Ok, thanks for the clarification. IMHO it is always better to ask than miss some edge cases

@jirispilka
Copy link
Collaborator

@matyascimbulka thanks for the comments. I appreciate it. I'll leave this to you two :)

@MQ37
Copy link
Contributor Author

MQ37 commented Nov 26, 2025

I just realize that my comments about it failing are wrong. Still it be nice to see some test runs.

Here is few of my test runs:

Then also tested some edge cases and I haven't encountered any issues with the google serp crawling - feel free to test.

…d, fix wrong imports, firefox patches and policies, NY TZ (#82)

* feat: update libs for limited perms, fix and speed up Dockerfile build, fix wrong imports

* magic number to const

* lint

* fix lint, update crawlee

* add firefox policies, patches, set NY TZ

* add ghostery blocker

* Squashed commit of the following:

commit 7aadecd
Author: MQ37 <themq37@gmail.com>
Date:   Wed Nov 26 11:14:09 2025 +0100

    add var for better readability

commit fd02b0a
Author: MQ37 <themq37@gmail.com>
Date:   Wed Nov 26 11:09:57 2025 +0100

    make code more typescripty
@MQ37 MQ37 changed the title fix: number of google pages results fix: number of google pages results, limited perms, improvements from WCC Nov 26, 2025
@MQ37
Copy link
Contributor Author

MQ37 commented Nov 26, 2025

I just realize that my comments about it failing are wrong. Still it be nice to see some test runs.

Here is few of my test runs:

* basic run Cheerio - https://console.apify.com/view/runs/Z50RSxXrERLUGulIC

* basic run Playwright - https://console.apify.com/view/runs/EuFSPFMcLoQghzRMN

* run to test the google serp pages crawling with obscure search term - https://console.apify.com/view/runs/FwSzRAAsyEhBE8RCJ

Then also tested some edge cases and I haven't encountered any issues with the google serp crawling - feel free to test.

@matyascimbulka it is now broken because of the PR I merged here, working on a fix sorry

@MQ37
Copy link
Contributor Author

MQ37 commented Nov 26, 2025

@matyascimbulka fixed and should be ready now

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just found small nit.

Co-authored-by: Matyáš Cimbulka <matyas.cimbulka@apify.com>
@MQ37 MQ37 merged commit c851b23 into master Nov 26, 2025
2 checks passed
@MQ37 MQ37 deleted the fix-google-num-results branch November 26, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-ai Issues owned by the AI team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rag Web Browser only returns 10 results

4 participants