Conversation
|
This looks useful but it doesn't really fit my vision for metasearch (which is that it should Just Work without needing API keys). The scraper for Google (and Bing 🥴) should be fixed instead. |
|
do you want me to split out the disable-by-default then? |
|
I'd merge if Google Custom Search was implemented as its own separate engine from Google and the |
|
er... but the current google search scraper does not work right now and it likely never will (unless we start parsing some JS) |
|
The plan is to do whatever it takes to make the scrapers work, I just haven't found the time to update them yet :( If SearxNG can get something working then it's relatively easy for me to copy their homework, too. |
|
Just to give an update on this, I have been working on fixing the Google engine. The main change is that requesting the search page (on your first search when logged out) now gives you a JavaScript challenge that you have to execute and set the SG_SS cookie to. After making the first search with the NID cookie that Google gave you and the SG_SS that you generated, you only need to persist the NID cookie to keep getting actual results. The challenge is some insane cipher (that involves a big base64 string that Google gives you and the current system time) obfuscated with insane Closure Compiler passes (you can tell someone at Google had fun writing them). It's very easily solvable by just including a JavaScript runtime (I was able to get a valid challenge solution in NodeJS/Deno by just setting a few global variables) but I'd prefer deobfuscating the challenge and reimplementing it instead, since fully executing Google's JS would make it way easier for them to detect us in the future by adding more random integrity checks later. This deobfuscation just happens to be very time-consuming, as you may expect (and as Google intended). :) If you're curious, some of the obfuscation passes that stood out to me were:
|
|
Can you share the part of Google code that generates the SG_SS please? |
Make a request to https://google.com/search?query=meow without any cookies and find the script with Here's the NodeJS code I used (uncleaned and barely tested, the code in the string and big base64 string were manually extracted, could easily be done automatically though) globalThis.window = globalThis
globalThis.performance = { now: () => Date.now() }
this.document = {
readyState: 'complete'
}
const innerJs = require('fs').readFileSync('./inner.js', 'utf8')
eval(innerJs);
const bigString = "insert the value of the big base64 string here";
const res = this.knitsail.a(bigString, function () { }, false)
console.log('SG_SS:', res[0]([])) |
|
@mat-1 I was trying to run the script after small fixes but I failed. I'm not sure what part of Google Search response must be extracted to inner.js & bigString. I want to run it as a proof of concept for now and improve it later. When you have a moment to add some more details please do. |
|
Sure, I'll write a better proof of concept in a bit when I have time |
Nope, at least for now their "integrity" checks are quite weak. :) That's part of why I don't want to rely on a server-side JS runtime though; it wouldn't be particularly hard for Google to improve this in the future and I'd rather not be forced to run a full browser since I know my software is often deployed on very weak servers. In theory (assuming I can consistently reverse-engineer whatever Google does), reimplementing Google's challenge is the most reliable solution for my use-case. |
inner.js: bigString: |
|
@yeyuchen198 thanks, yeah, I managed to make it work now but even with the SG_SS cookie generated & set, I'm getting flagged (HTTP 429), like you wrote here. |
I got the same result — the generated SG_SS seems to be invalid, which is why it returns a 429 error. I suspect the JavaScript environment check didn’t pass. |
|
Oh yeah, you're right, the tokens generated from my NodeJS snippet do always result in a captcha page. I still don't believe the integrity check is particularly sophisticated but it seems there's more to it than I realized. |
|
Maybe you should try it in jsdom. It supports canvas. And that's what Youtube.js uses for also loading a "challenge" javascript code (called Potoken) from YouTube. Example: https://github.com/LuanRT/BgUtils/blob/main/examples/node/innertube-challenge-fetcher-example.ts |
|
The following is a part of the environment detection information I obtained in the browser. It differs in Node.js because the exposed environment causes the JS detection to enter a different control flow—for example, one check is [Hook] performance.getEntriesByType("navigation") -> [object PerformanceNavigationTiming]
[Hook][get] performance.timeOrigin -> 1761214372048.5
[Hook][get] performance.timing -> [object PerformanceTiming]
[Hook] performance.now() -> 870.9000000953674
[Hook][get] performance.memory -> [object MemoryInfo]
[Hook][get] navigator.webdriver -> false
[Hook][get] navigator.hardwareConcurrency -> 12
[Hook][get] navigator.maxTouchPoints -> 0
[Hook][get] navigator.languages -> zh-CN
[Hook][get] navigator.deviceMemory -> 8
[Hook][get] navigator.connection -> [object NetworkInformation]
[Hook] document.createElement("iframe") -> [object HTMLIFrameElement]
[Hook] document.appendChild({}) -> [object HTMLIFrameElement]
[Hook] document.createElement("div") -> [object HTMLDivElement]
[Hook] document.createElement("img") -> [object HTMLImageElement]
[Hook] document.createEvent("MouseEvents") -> [object MouseEvent]
[Hook] document.removeChild({}) -> [object HTMLIFrameElement]
[Hook] document.createElement("a") ->
[Hook] document.createElement("iframe") -> [object HTMLIFrameElement]
[Hook][get] window.isSecureContext -> true
[Hook][get] window.trustedTypes -> [object TrustedTypePolicyFactory]
[Hook][get] window.parent -> [object Window]
[Hook][get] window.self -> [object Window]
[Hook][get] window.outerWidth -> 1280
[Hook][get] window.outerHeight -> 672
[Hook][get] window.innerWidth -> 1280
[Hook][get] window.innerHeight -> 551
[Hook][get] window.devicePixelRatio -> 1.5
[Hook][get] window.opener -> null
[Hook][get] window.screen -> [object Screen]
[Hook][get] window.performance -> [object Performance]
[Hook][get] window.navigator -> [object Navigator]
[Hook][get] window.history -> [object History]
[Hook][get] window.localStorage -> [object Storage]
[Hook][get] window.sessionStorage -> [object Storage]
[Hook][get] screen.width -> 1280
[Hook][get] screen.height -> 720
[Hook][get] screen.availWidth -> 1280
[Hook][get] screen.availHeight -> 672
[Hook][get] screen.availLeft -> 0
[Hook][get] screen.availTop -> 0
[Hook][get] history.length -> 50
[Hook] sessionStorage.setItem("o__5aIyZNY2Ixc8P3ZmM8Ag", "1761214372048") -> undefined
|
|
Nice! I'm still working on writing my deobfuscator (going well but taking a while as I don't have much free time) but this is good to know. |
|
I've seen some (successful) efforts on csdn.net: but I don't know if there is an article with more details available somewhere, I don't have an account there |
copying 4get's homework: https://4get.ca/help-me.html
https://git.lolcat.ca/lolcat/4get/src/branch/master/scraper/google_api.php
custom_search_api_keywithout setting up billing, this gives you 100 free searches a day