Conversation
- 原实现使用 axios 直接请求,存在过期 cookie、错误选择器、无 URL 解码等问题 - 新实现通过 puppeteer-core 驱动系统 Edge/Chrome 浏览器进行搜索 - 添加会话预热机制,解决 cn.bing.com 对多词中文查询的反爬虫问题 - 手动 spawn 浏览器进程并通过 WebSocket 连接,避免与 MCP 服务器的 stdio 冲突 - 添加 decodeBingUrl 函数解码 Bing 重定向 URL - 每次搜索使用唯一临时目录,搜索完成后自动清理
- Windows下改用WMI创建浏览器进程,避免VS Code Job Object终止子进程 - Linux/macOS下使用detached spawn - 缓存浏览器会话,跨搜索复用而非每次新建 - 进程退出时自动清理浏览器与临时目录 - [Copilot review #3] fetch轮询添加AbortController超时 - [Copilot review Aas-ee#6] Linux spawn添加error事件监听 - [Copilot review Aas-ee#7] puppeteer.connect失败时清理资源 - [Copilot review #1,#2,Aas-ee#4,Aas-ee#5] 不适用/已解决,注释说明原因
Issue Aas-ee#29 - 百度搜索被封禁: - 将百度引擎从 axios HTTP 改为 puppeteer 共享浏览器 - 添加反自动化检测 (disable-blink-features, CDP隐藏webdriver) - 热门关键词(天气预报/Python教程等)现可正常返回结果 Issue Aas-ee#19 - 引擎名大小写敏感: - 移除 z.enum() 约束,改用 z.string() + transform 归一化 - 支持任意大小写输入(如 Bing/BAIDU/DuckDuckGo) - MCPO 代理等发送首字母大写引擎名不再报错 架构优化: - 新增全局共享浏览器模块 (engines/shared/browser.ts) - Bing 和 Baidu 共用同一浏览器实例,减少资源开销 - Bing 引擎代码从约330行精简至约110行 测试: - 新增百度可用性测试 (test-baidu-availability) - 新增引擎名大小写测试 (test-engine-case) - 新增搜索结果相关性测试 (test-search-relevance)
- 修正 PROXY_URL 默认值为 http://127.0.0.1:10809(与代码一致) - 新增 MAX_DESCRIPTION_LENGTH 环境变量说明 - 新增 search 工具 maxDescriptionLength 参数说明 - 补充 DEFAULT_SEARCH_ENGINE 可选值 linuxdo - 修正工具数量描述:四个→五个(补充 fetchJuejinArticle) - 补充文章获取功能支持的站点:juejin、linux.do - 同步更新 Docker 部署环境变量表(补充缺失的变量和引擎) - 补充 limit 参数范围说明 (1-50) - 补充 engines 参数大小写不敏感说明 - 中英文 README 同步更新
- 重写 bing.ts,通过搜索框 (#sb_form_q) 提交查询,而非直接 URL 导航(Bing 对直接 URL 返回降级结果) - 将浏览器从 headless 模式切换为 GUI 模式(Bing 在服务端检测 headless) - 在 Windows 上通过 WMI Win32_ProcessStartup ShowWindow=0 (SW_HIDE) 隐藏浏览器窗口 - 新增 test-bing-quality.ts,使用随机后缀方法验证搜索结果相关性
There was a problem hiding this comment.
Pull request overview
This PR is a comprehensive overhaul of the search engine architecture that addresses critical functionality issues with Bing and Baidu search engines. The changes replace the axios-based HTTP scraping approach with a Puppeteer-based browser automation solution to bypass anti-bot measures.
Changes:
- Rewrote Bing and Baidu search engines using Puppeteer with headless browser automation
- Implemented shared browser management module to reduce resource overhead
- Fixed case-insensitive engine name validation issue (#19)
- Added configurable description length truncation feature
- Updated documentation to reflect code changes and fix inconsistencies
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/engines/shared/browser.ts |
New shared Puppeteer browser lifecycle management with Windows WMI process spawning |
src/engines/bing/bing.ts |
Complete rewrite using Puppeteer, search box submission, and Bing URL decoding |
src/engines/baidu/baidu.ts |
Rewrite using Puppeteer with CDP script injection to hide automation features |
src/tools/setupTools.ts |
Added case-insensitive engine name handling and description truncation logic |
src/config.ts |
Added maxDescriptionLength configuration option |
src/engines/linuxdo/fetchLinuxDoArticle.ts |
Updated URL regex to support both /topic/ and /t/ formats |
src/engines/juejin/juejin.ts |
Changed console.log to console.error for MCP compatibility |
package.json |
Added puppeteer-core and zod dependencies |
README.md |
Comprehensive documentation update with corrected default values |
| Test files | Added comprehensive test coverage for new features |
Comments suppressed due to low confidence (10)
package.json:34
- The package-lock.json shows that zod is marked as "peer: true" (line 7643), but it's listed as a regular dependency in package.json. This inconsistency could lead to installation issues. The zod package should either be a regular dependency (not peer) or moved to peerDependencies in package.json if it's expected to be provided by the consuming application.
"zod": "^3.23.0"
src/engines/baidu/baidu.ts:29
- There's a potential resource leak in the searchBaidu function. If an error occurs after creating a page but before it's closed (e.g., during page.goto or cheerio operations), the page will never be closed, leading to memory leaks. The page.close() call should be wrapped in a try-finally block or moved to a finally block to ensure it always executes.
const page = await browser.newPage();
// 通过 CDP 隐藏 webdriver/自动化特征,绕过百度反爬检测
const client = await page.createCDPSession();
await client.send('Page.addScriptToEvaluateOnNewDocument', {
source: `
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
delete navigator.__proto__.webdriver;
`
});
const searchUrl = `https://www.baidu.com/s?wd=${encodeURIComponent(query)}&pn=${pn}&ie=utf-8`;
await page.goto(searchUrl, { waitUntil: 'networkidle2', timeout: 15000 });
await new Promise(r => setTimeout(r, 1000));
const html = await page.content();
await page.close();
src/tools/setupTools.ts:218
- The description truncation implementation adds '...' unconditionally when truncating, which makes the final length
descLimit + 3characters instead of exactlydescLimitcharacters. This could be misleading to users who expect the maxDescriptionLength to be the actual maximum length. Consider either including the ellipsis in the limit (e.g.,r.description.slice(0, descLimit - 3) + '...') or documenting that the actual length may exceed the limit by 3 characters.
? results.map(r => ({
...r,
description: r.description.length > descLimit
? r.description.slice(0, descLimit) + '...'
: r.description
src/test/test-engine-case.ts:18
- The test file test-engine-case.ts doesn't match the actual implementation in setupTools.ts. The test uses a simple transform to normalize and filter engines, but setupTools.ts uses
z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum())which validates against the enum AFTER lowercasing. The test should replicate this exact behavior to accurately test the implementation.
const enginesSchema = z.array(z.string())
.min(1).default(['bing'])
.transform(requestedEngines => {
const allowed = getAllowedEngines();
const normalized = requestedEngines.map(e => e.toLowerCase());
const valid = normalized.filter(e => allowed.includes(e));
if (valid.length === 0) {
throw new Error(`No valid engine found. Allowed engines: ${allowed.join(', ')}`);
}
return valid;
});
src/engines/shared/browser.ts:38
- The path concatenation on lines 30-38 uses string concatenation instead of path.join(), which could lead to incorrect paths on some systems. While this works for Windows paths, it's better practice to use path.join() or template literals with proper path separators for cross-platform compatibility and readability.
candidates.push(pf86 + '\\Microsoft\\Edge\\Application\\msedge.exe');
candidates.push(pf86 + '\\Google\\Chrome\\Application\\chrome.exe');
}
if (pf) {
candidates.push(pf + '\\Microsoft\\Edge\\Application\\msedge.exe');
candidates.push(pf + '\\Google\\Chrome\\Application\\chrome.exe');
}
if (localAppData) {
candidates.push(localAppData + '\\Google\\Chrome\\Application\\chrome.exe');
src/engines/linuxdo/fetchLinuxDoArticle.ts:6
- The regex pattern for validating Linux.do URLs has an issue. The pattern
/(?:\/topic\/|\/)t\/(?:[^\/]+\/)?(\d+)/will incorrectly match URLs like/t/123at the root level due to the alternation(?:\/topic\/|\/). The second alternative\/matches any single slash, so it will match the slash beforetin/t/123. This should be/(?:\/topic|\/t)\/(?:[^\/]+\/)?(\d+)/to match either/topic/or/t/but not just any slash.
const match = url.match(/(?:\/topic\/|\/)t\/(?:[^\/]+\/)?(\d+)/) || url.match(/\/topic\/(\d+)/);
src/tools/setupTools.ts:190
- The engines parameter uses a complex chained transformation:
z.array(z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum())). This validates each string after lowercasing it against the enum. However, if an invalid engine name is provided (e.g., "InvalidEngine"), the error message will show the lowercased version "invalidengine" rather than the original input, which could be confusing to users. Consider validating before transforming or providing custom error messages that reference the original input.
engines: z.array(z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum())).min(1).default([config.defaultSearchEngine])
src/engines/bing/bing.ts:19
- The decodeBingUrl function assumes the encoded URL parameter always starts with 'a1' and strips the first 2 characters. However, there's no validation that the parameter actually starts with 'a1' or has at least 2 characters. If Bing changes this format or if the parameter has a different prefix, the function will silently decode invalid data or fail. Consider adding validation:
if (!encodedUrl.startsWith('a1') || encodedUrl.length < 3) return bingUrl;
const base64Part = encodedUrl.substring(2);
const decodedUrl = Buffer.from(base64Part, 'base64').toString('utf-8');
src/tools/setupTools.ts:90
- The validation regex in setupTools.ts has the same issue as fetchLinuxDoArticle.ts. The pattern
/\/t(opic)?\//.test(url)will match/topic/or/t/correctly, but there's an inconsistency: fetchLinuxDoArticle.ts expects both/topic/123and/t/slug/123formats, while the validation only checks for the path pattern without ensuring the full format matches what the fetch function expects.
return urlObj.hostname === 'linux.do' && /\/t(opic)?\//.test(url);
src/engines/shared/browser.ts:123
- The parseInt call on line 123 doesn't specify a radix. While this works in most cases, it's a best practice to always specify the radix (base 10) to avoid potential issues with leading zeros being interpreted as octal. Change to
parseInt(output.trim(), 10).
browserPid = parseInt(output.trim());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Windows: 用 CreateDesktop + CreateProcessW 在不可见桌面上启动浏览器, 子进程自动继承桌面,用户桌面零弹窗 - DuplicateHandle 将桌面句柄复制到浏览器进程,防止启动脚本退出后桌面被销毁 - 移除百度 CDP webdriver 属性覆盖(GUI模式+AutomationControlled已足够)
- Windows: 用 CreateDesktop + CreateProcessW 在不可见桌面上启动浏览器, 子进程自动继承桌面,用户桌面零弹窗 - DuplicateHandle 将桌面句柄复制到浏览器进程,防止启动脚本退出后桌面被销毁 - 移除百度 CDP webdriver 属性覆盖(GUI模式+AutomationControlled已足够)
经验证 GUI 模式下直接 URL 导航与搜索框提交返回完全相同的结果, 关键因素仅为 GUI vs headless,与提交方式无关。 移除 submitSearchViaSearchBox 函数及相关依赖(Page type import)。
- 知乎/LinuxDo: 将 axios HTTP 请求替换为 puppeteer 浏览器搜索, 修复 site: 运算符失效导致返回无关结果的问题 - 百度: 修复搜索天气等特殊词时 iframe 被销毁导致导航失败的问题, 回退到 domcontentloaded - Bing: 修复翻页时 waitForNavigation 竞态条件, 兼容 AJAX 翻页 - CSDN: 修复文库类内容 digest 字段为 undefined 导致崩溃的问题 - 安全: 移除 npx 依赖, 消除 65 个安全漏洞
- 重构浏览器关闭逻辑:browser.close()优雅退出+强杀回退,setTimeout加unref防止阻塞退出
- 监听MCP server onclose事件,stdin关闭时主动清理浏览器并退出,解决停止服务器慢的问题
- 移除ws/@types/ws依赖(不再使用UnrefTransport)
- executeSearch返回{results, errors}结构,引擎全部失败时isError:true报告给Agent
- 部分引擎失败时在warnings中附带错误详情
- DuckDuckGo子方法错误不再吞掉,正确向上抛出
- 新增getErrorMessage()辅助函数,处理AxiosError/AggregateError的空message问题
- DuckDuckGo新增errMsg()辅助函数处理空错误消息
- 所有console.error改为简洁单行日志,不再dump整个错误对象
- 新增全局requestTimeout配置(环境变量REQUEST_TIMEOUT,默认30000ms)
- 所有引擎硬编码超时替换为config.requestTimeout
- 修复掘金搜索null result_model导致的崩溃
- 测试脚本添加destroySharedBrowser()清理调用
- 区分全部失败与部分失败:errors.length === engines.length 才判定为全部失败 - 0结果但部分引擎失败时,同时报告失败引擎的错误和成功引擎的0结果状态 - 避免DuckDuckGo超时掩盖Bing返回0结果的信息
- Bing: 用 waitForSelector 等待搜索结果 DOM 就绪,替代固定 500ms 延时 - Bing: 首次解析为空时自动重试,兼容 cn.bing.com 异步渲染延迟 - Bing: 翻页后也用 waitForSelector 替代固定等待 - DuckDuckGo: 增加预加载和 HTML 方法返回空结果时的中文警告日志 - 搜索工具: 新增所有引擎成功但 0 结果的分支,返回中文信息性提示 - 所有新增日志和提示文本均使用中文
概述
本 PR 修复了多个已知问题并优化了搜索引擎架构,主要包括:
maxDescriptionLength参数 +MAX_DESCRIPTION_LENGTH环境变量)关联 Issue
变更详情
1. 用 Puppeteer 重写 Bing 搜索引擎
问题:原实现使用 axios 直接发送 HTTP 请求,存在以下问题:
方案:
puppeteer-core驱动系统已安装的 Edge/Chrome 浏览器CreateProcessW在隐藏桌面上创建进程,避免 VS Code Job Object 终止子进程decodeBingUrl()函数正确解码 Bing 重定向 URL文件:
src/engines/bing/bing.ts— 完全重写src/engines/shared/browser.ts— 新增全局共享浏览器管理模块2. 修复 Bing 搜索结果质量问题
问题:搜索 "MCP tool rename" 等查询时,Bing 返回不相关的 "Microsoft Certified Professional" 结果,而非预期的 "Model Context Protocol" 结果。
根因分析(通过大量实验排除了 UA、域名、ensearch 参数、搜索框提交方式、反检测注入等假设后定位):
方案:
CreateDesktopAPI 创建不可见桌面,CreateProcessW的STARTUPINFO.lpDesktop指定该桌面启动浏览器,Edge 的所有子进程(GPU、渲染器等)自动继承父桌面,用户桌面上完全不可见DuplicateHandle将桌面句柄复制到浏览器进程中,确保启动脚本退出后桌面不会被销毁--window-position=-32000,-32000 --window-size=1,1作为跨平台后备方案验证(使用 Puppeteer 和 Playwright 双重验证):
文件:
src/engines/bing/bing.ts— 重写为 Puppeteer + 直接 URL 导航src/engines/shared/browser.ts— GUI 模式 + 隐藏桌面src/test/test-bing-quality.ts— 新增搜索结果质量测试(随机后缀防缓存)3. 修复百度搜索被封禁 (Issue #29)
问题:百度搜索返回"百度安全验证"页面。
方案:
--disable-blink-features=AutomationControlled+ GUI 模式已足够绕过百度反爬检测,无需额外 CDP 注入文件:
src/engines/baidu/baidu.ts— 重写为 Puppeteer 实现src/test/test-baidu-availability.ts— 新增百度可用性测试4. 修复引擎名大小写敏感 (Issue #19)
问题:通过 MCPO 代理调用时,代理发送
"Bing"/"Baidu"等首字母大写的引擎名,但z.enum()要求严格小写,导致参数校验失败。方案:
z.enum()约束,改用z.string()+.transform(v => v.toLowerCase())归一化Bing、BAIDU、DuckDuckGo)文件:
src/tools/setupTools.ts— 引擎名校验逻辑src/test/test-engine-case.ts— 新增大小写测试5. 新增搜索结果描述截断功能
需求:搜索结果的
description字段可能很长,消耗大量 token,用户希望能限制其长度。方案:
config.ts:新增maxDescriptionLength配置项,通过MAX_DESCRIPTION_LENGTH环境变量设置全局默认值setupTools.ts:search工具新增maxDescriptionLength可选参数,调用时可覆盖全局配置...文件:
src/config.ts— 新增maxDescriptionLength配置项src/tools/setupTools.ts— 新增参数及截断逻辑src/test/test-description-length.ts— 新增截断功能单元测试 + 集成测试6. 全局共享浏览器架构
新增
src/engines/shared/browser.ts模块,统一管理浏览器生命周期:CreateProcessW(绕过 VS Code Job Object)、Linux/macOS detached spawnCreateDesktop不可见桌面 +CreateProcessW+DuplicateHandle保活;跨平台后备:屏幕外定位7. README 文档修复
README.md),英文移至README-en.mdPROXY_URL默认值与代码一致 (http://127.0.0.1:10809)MAX_DESCRIPTION_LENGTH环境变量说明search工具的maxDescriptionLength参数linuxdo引擎(移除删除线标记)新增文件
src/engines/shared/browser.tssrc/test/test-bing-quality.tssrc/test/test-baidu-availability.tssrc/test/test-engine-case.tssrc/test/test-search-relevance.tssrc/test/test-description-length.tssrc/test/fetchLinuxDoArticleTests.tsREADME-en.md修改文件
src/engines/bing/bing.tssrc/engines/baidu/baidu.tssrc/tools/setupTools.tssrc/config.tssrc/engines/juejin/juejin.tssrc/engines/linuxdo/fetchLinuxDoArticle.tsREADME.mdpackage.jsonpuppeteer-core依赖删除文件
README-zh.md测试
所有测试通过:
为什么选择 Puppeteer 而非 Playwright
在开发过程中对 Puppeteer 和 Playwright 进行了对比测试,两者在 Bing 反自动化检测面前表现一致(headless 模式均被检测,GUI 模式均正常)。最终选择
puppeteer-core的原因:puppeteer-core仅 ~2MB,不捆绑浏览器;Playwright 需安装专用浏览器二进制 (~400MB+)puppeteer-core原生支持connect()连接已有浏览器进程,适合本项目的共享浏览器架构;Playwright 的connectOverCDP()为后加入的功能,API 不如 Puppeteer 成熟npx一行启动)增加了不必要的复杂度变更统计