Skip to content

switch to google custom search engine#31

Open
raylu wants to merge 1 commit intomat-1:masterfrom
raylu:google
Open

switch to google custom search engine#31
raylu wants to merge 1 commit intomat-1:masterfrom
raylu:google

Conversation

@raylu
Copy link
Copy Markdown
Contributor

@raylu raylu commented Oct 10, 2025

copying 4get's homework: https://4get.ca/help-me.html
https://git.lolcat.ca/lolcat/4get/src/branch/master/scraper/google_api.php

  • this disables the google search engine by default since it no longer works
  • to use the new metasearch2 google engine, you must create and configure a custom_search_api_key
    without setting up billing, this gives you 100 free searches a day

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 10, 2025

This looks useful but it doesn't really fit my vision for metasearch (which is that it should Just Work without needing API keys). The scraper for Google (and Bing 🥴) should be fixed instead.

@raylu
Copy link
Copy Markdown
Contributor Author

raylu commented Oct 11, 2025

do you want me to split out the disable-by-default then?
google search doesn't work as is but it's on by default now
(it also won't work until you start parsing some JS)

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 11, 2025

I'd merge if Google Custom Search was implemented as its own separate engine from Google and the google engine was kept as the default, so in theory everything works out of the box but the user can configure Custom Search if they want to.

@raylu
Copy link
Copy Markdown
Contributor Author

raylu commented Oct 14, 2025

er... but the current google search scraper does not work right now and it likely never will (unless we start parsing some JS)

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 14, 2025

The plan is to do whatever it takes to make the scrapers work, I just haven't found the time to update them yet :(

If SearxNG can get something working then it's relatively easy for me to copy their homework, too.

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 20, 2025

Just to give an update on this, I have been working on fixing the Google engine. The main change is that requesting the search page (on your first search when logged out) now gives you a JavaScript challenge that you have to execute and set the SG_SS cookie to. After making the first search with the NID cookie that Google gave you and the SG_SS that you generated, you only need to persist the NID cookie to keep getting actual results.

The challenge is some insane cipher (that involves a big base64 string that Google gives you and the current system time) obfuscated with insane Closure Compiler passes (you can tell someone at Google had fun writing them). It's very easily solvable by just including a JavaScript runtime (I was able to get a valid challenge solution in NodeJS/Deno by just setting a few global variables) but I'd prefer deobfuscating the challenge and reimplementing it instead, since fully executing Google's JS would make it way easier for them to detect us in the future by adding more random integrity checks later. This deobfuscation just happens to be very time-consuming, as you may expect (and as Google intended). :)

If you're curious, some of the obfuscation passes that stood out to me were:

  • Every function/variable/field has a randomly generated name (this is just a normal Closure Compiler thing), but one function is named after the first three characters of the big string (plus an underscore).
  • Statements are often merged into expressions, like a = 1; b = 2 becomes b = (a = 1, 2).
  • Most class fields are stored in an array with random indexes.
  • Most functions are now state machines to make their control flow harder to follow.
  • Sometimes multiple functions are merged into one, with a number field that's checked with bitwise operators to determine which function it is.
  • Sometimes constant values are included as parameters.
  • Most booleans are represented in funny randomly-generated ways like NaN != NaN and ![] == true.
  • Rarely, statements are wrapped like while (true) { statement; if (true) break }.
  • Some if statements are converted into switch statements.
  • Sometimes if/else statements are reversed.

@rinj-shine
Copy link
Copy Markdown

Can you share the part of Google code that generates the SG_SS please?

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 20, 2025

Can you share the part of Google code that generates the SG_SS please?

Make a request to https://google.com/search?query=meow without any cookies and find the script with challenge_version = 0 and the script immediately before that. The earlier script has a string with JavaScript code that contains the bulk of the challenge.

Here's the NodeJS code I used (uncleaned and barely tested, the code in the string and big base64 string were manually extracted, could easily be done automatically though)

globalThis.window = globalThis
globalThis.performance = { now: () => Date.now() }
this.document = {
    readyState: 'complete' 
}
const innerJs = require('fs').readFileSync('./inner.js', 'utf8')
eval(innerJs);
const bigString = "insert the value of the big base64 string here";
const res = this.knitsail.a(bigString, function () { }, false)
console.log('SG_SS:', res[0]([]))

@lukasmaz
Copy link
Copy Markdown

@mat-1 I was trying to run the script after small fixes but I failed. I'm not sure what part of Google Search response must be extracted to inner.js & bigString. I want to run it as a proof of concept for now and improve it later. When you have a moment to add some more details please do.

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 20, 2025

Sure, I'll write a better proof of concept in a bit when I have time

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 20, 2025

Im confused, how are you able to run their javascript code without an actual browser runtime? Doesn't it check for things like Canvas, GPU configuration and such?

Nope, at least for now their "integrity" checks are quite weak. :)

That's part of why I don't want to rely on a server-side JS runtime though; it wouldn't be particularly hard for Google to improve this in the future and I'd rather not be forced to run a full browser since I know my software is often deployed on very weak servers. In theory (assuming I can consistently reverse-engineer whatever Google does), reimplementing Google's challenge is the most reliable solution for my use-case.

@yeyuchen198
Copy link
Copy Markdown

@mat-1 I was trying to run the script after small fixes but I failed. I'm not sure what part of Google Search response must be extracted to inner.js & bigString. I want to run it as a proof of concept for now and improve it later. When you have a moment to add some more details please do.

inner.js:
(function(){var *********.call(this);'].join('\n')));}).call(this);

bigString:
var p='*******'

@lukasmaz
Copy link
Copy Markdown

@yeyuchen198 thanks, yeah, I managed to make it work now but even with the SG_SS cookie generated & set, I'm getting flagged (HTTP 429), like you wrote here.

@yeyuchen198
Copy link
Copy Markdown

@yeyuchen198 thanks, yeah, I managed to make it work now but even with the SG_SS cookie generated & set, I'm getting flagged (HTTP 429), like you wrote here.

I got the same result — the generated SG_SS seems to be invalid, which is why it returns a 429 error. I suspect the JavaScript environment check didn’t pass.

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 21, 2025

Oh yeah, you're right, the tokens generated from my NodeJS snippet do always result in a captcha page. I still don't believe the integrity check is particularly sophisticated but it seems there's more to it than I realized.

@unixfox
Copy link
Copy Markdown

unixfox commented Oct 22, 2025

Maybe you should try it in jsdom. It supports canvas. And that's what Youtube.js uses for also loading a "challenge" javascript code (called Potoken) from YouTube.

Example: https://github.com/LuanRT/BgUtils/blob/main/examples/node/innertube-challenge-fetcher-example.ts

@yeyuchen198
Copy link
Copy Markdown


The following is a part of the environment detection information I obtained in the browser. It differs in Node.js because the exposed environment causes the JS detection to enter a different control flow—for example, one check is performance.nodeTiming, which exists in Node.js but not in the browser.



[Hook] performance.getEntriesByType("navigation") -> [object PerformanceNavigationTiming]
[Hook][get] performance.timeOrigin -> 1761214372048.5
[Hook][get] performance.timing -> [object PerformanceTiming]
[Hook] performance.now() -> 870.9000000953674
[Hook][get] performance.memory -> [object MemoryInfo]


[Hook][get] navigator.webdriver -> false
[Hook][get] navigator.hardwareConcurrency -> 12
[Hook][get] navigator.maxTouchPoints -> 0
[Hook][get] navigator.languages -> zh-CN
[Hook][get] navigator.deviceMemory -> 8
[Hook][get] navigator.connection -> [object NetworkInformation]



[Hook] document.createElement("iframe") -> [object HTMLIFrameElement]
[Hook] document.appendChild({}) -> [object HTMLIFrameElement]
[Hook] document.createElement("div") -> [object HTMLDivElement]
[Hook] document.createElement("img") -> [object HTMLImageElement]
[Hook] document.createEvent("MouseEvents") -> [object MouseEvent]
[Hook] document.removeChild({}) -> [object HTMLIFrameElement]
[Hook] document.createElement("a") -> 
[Hook] document.createElement("iframe") -> [object HTMLIFrameElement]



[Hook][get] window.isSecureContext -> true
[Hook][get] window.trustedTypes -> [object TrustedTypePolicyFactory]
[Hook][get] window.parent -> [object Window]
[Hook][get] window.self -> [object Window]
[Hook][get] window.outerWidth -> 1280
[Hook][get] window.outerHeight -> 672
[Hook][get] window.innerWidth -> 1280
[Hook][get] window.innerHeight -> 551
[Hook][get] window.devicePixelRatio -> 1.5
[Hook][get] window.opener -> null
[Hook][get] window.screen -> [object Screen]
[Hook][get] window.performance -> [object Performance]
[Hook][get] window.navigator -> [object Navigator]
[Hook][get] window.history -> [object History]
[Hook][get] window.localStorage -> [object Storage]
[Hook][get] window.sessionStorage -> [object Storage]


[Hook][get] screen.width -> 1280
[Hook][get] screen.height -> 720
[Hook][get] screen.availWidth -> 1280
[Hook][get] screen.availHeight -> 672
[Hook][get] screen.availLeft -> 0
[Hook][get] screen.availTop -> 0




[Hook][get] history.length -> 50



[Hook] sessionStorage.setItem("o__5aIyZNY2Ixc8P3ZmM8Ag", "1761214372048") -> undefined

@mat-1
Copy link
Copy Markdown
Owner

mat-1 commented Oct 23, 2025

Nice! I'm still working on writing my deobfuscator (going well but taking a while as I don't have much free time) but this is good to know.

@lukasmaz
Copy link
Copy Markdown

I've seen some (successful) efforts on csdn.net:
https://blog.csdn.net/m0_66390393/article/details/151690312

but I don't know if there is an article with more details available somewhere, I don't have an account there

@raylu
Copy link
Copy Markdown
Contributor Author

raylu commented Jan 11, 2026

searxng/searxng#5644

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants