Skip to content

修复 Bing/百度搜索引擎、引擎名大小写问题及文档同步#38

Open
Ebola-Chan-bot wants to merge 17 commits intoAas-ee:mainfrom
Ebola-Chan-bot:main
Open

修复 Bing/百度搜索引擎、引擎名大小写问题及文档同步#38
Ebola-Chan-bot wants to merge 17 commits intoAas-ee:mainfrom
Ebola-Chan-bot:main

Conversation

@Ebola-Chan-bot
Copy link

@Ebola-Chan-bot Ebola-Chan-bot commented Feb 25, 2026

概述

本 PR 修复了多个已知问题并优化了搜索引擎架构,主要包括:

  1. 用 Puppeteer 重写 Bing 和百度搜索引擎,解决了原 axios 实现被网站反爬机制封禁的问题
  2. 修复 Bing 搜索结果质量问题,从 headless 切换为 GUI 模式
  3. 修复引擎名大小写敏感问题 (Issue Pb of mapping with MPCO from openwebui #19)
  4. 修复百度搜索被"安全验证"封禁 (Issue baidu被ban了 #29)
  5. 新增搜索结果描述截断功能 (maxDescriptionLength 参数 + MAX_DESCRIPTION_LENGTH 环境变量)
  6. 修复 README 文档与代码不一致

关联 Issue


变更详情

1. 用 Puppeteer 重写 Bing 搜索引擎

问题:原实现使用 axios 直接发送 HTTP 请求,存在以下问题:

  • 硬编码的 cookie 过期后搜索失败
  • HTML 选择器与 Bing 页面结构不匹配
  • Bing 重定向 URL 未解码
  • cn.bing.com 对多词中文查询触发反爬虫

方案

  • 通过 puppeteer-core 驱动系统已安装的 Edge/Chrome 浏览器
  • 手动 spawn 浏览器进程并通过 WebSocket 连接(避免与 MCP 服务器的 stdio 冲突)
  • Windows 下使用 CreateProcessW 在隐藏桌面上创建进程,避免 VS Code Job Object 终止子进程
  • 添加 decodeBingUrl() 函数正确解码 Bing 重定向 URL

文件

  • src/engines/bing/bing.ts — 完全重写
  • src/engines/shared/browser.ts — 新增全局共享浏览器管理模块

2. 修复 Bing 搜索结果质量问题

问题:搜索 "MCP tool rename" 等查询时,Bing 返回不相关的 "Microsoft Certified Professional" 结果,而非预期的 "Model Context Protocol" 结果。

根因分析(通过大量实验排除了 UA、域名、ensearch 参数、搜索框提交方式、反检测注入等假设后定位):

  • Bing 在服务端检测 headless 浏览器,headless 模式下无论直接 URL 还是搜索框提交均返回降级/空结果
  • GUI 模式下直接 URL 导航即可获得正确结果,与提交方式无关

方案

  • 浏览器从 headless 切换为 GUI 模式,直接 URL 导航搜索
  • Windows 上通过 Win32 CreateDesktop API 创建不可见桌面,CreateProcessWSTARTUPINFO.lpDesktop 指定该桌面启动浏览器,Edge 的所有子进程(GPU、渲染器等)自动继承父桌面,用户桌面上完全不可见
  • 通过 DuplicateHandle 将桌面句柄复制到浏览器进程中,确保启动脚本退出后桌面不会被销毁
  • 同时添加 --window-position=-32000,-32000 --window-size=1,1 作为跨平台后备方案

验证(使用 Puppeteer 和 Playwright 双重验证):

模式 Bing 结果 百度结果
Headless ❌ 降级/空结果 ✅ 正常
GUI ✅ 10/10 相关 ✅ 正常

文件

  • src/engines/bing/bing.ts — 重写为 Puppeteer + 直接 URL 导航
  • src/engines/shared/browser.ts — GUI 模式 + 隐藏桌面
  • src/test/test-bing-quality.ts — 新增搜索结果质量测试(随机后缀防缓存)

3. 修复百度搜索被封禁 (Issue #29)

问题:百度搜索返回"百度安全验证"页面。

方案

  • 将百度引擎从 axios HTTP 改为共享的 Puppeteer 浏览器
  • 共享浏览器的 --disable-blink-features=AutomationControlled + GUI 模式已足够绕过百度反爬检测,无需额外 CDP 注入

文件

  • src/engines/baidu/baidu.ts — 重写为 Puppeteer 实现
  • src/test/test-baidu-availability.ts — 新增百度可用性测试

4. 修复引擎名大小写敏感 (Issue #19)

问题:通过 MCPO 代理调用时,代理发送 "Bing" / "Baidu" 等首字母大写的引擎名,但 z.enum() 要求严格小写,导致参数校验失败。

方案

  • 移除 z.enum() 约束,改用 z.string() + .transform(v => v.toLowerCase()) 归一化
  • 支持任意大小写输入(如 BingBAIDUDuckDuckGo

文件

  • src/tools/setupTools.ts — 引擎名校验逻辑
  • src/test/test-engine-case.ts — 新增大小写测试

5. 新增搜索结果描述截断功能

需求:搜索结果的 description 字段可能很长,消耗大量 token,用户希望能限制其长度。

方案

  • config.ts:新增 maxDescriptionLength 配置项,通过 MAX_DESCRIPTION_LENGTH 环境变量设置全局默认值
  • setupTools.tssearch 工具新增 maxDescriptionLength 可选参数,调用时可覆盖全局配置
  • 截断逻辑:超过限制的描述截取前 N 个字符并追加 ...
  • 优先级:调用参数 > 全局环境变量 > 不限制

文件

  • src/config.ts — 新增 maxDescriptionLength 配置项
  • src/tools/setupTools.ts — 新增参数及截断逻辑
  • src/test/test-description-length.ts — 新增截断功能单元测试 + 集成测试

6. 全局共享浏览器架构

新增 src/engines/shared/browser.ts 模块,统一管理浏览器生命周期:

  • 单例模式:所有搜索引擎共用一个浏览器实例,减少资源开销
  • 自动发现:按优先级搜索 Edge → Chrome(Windows/Linux/macOS)
  • 可靠启动:Windows CreateProcessW(绕过 VS Code Job Object)、Linux/macOS detached spawn
  • 窗口隐藏:Win32 CreateDesktop 不可见桌面 + CreateProcessW + DuplicateHandle 保活;跨平台后备:屏幕外定位
  • 优雅清理:进程退出时自动关闭浏览器、清理临时目录
  • 自动恢复:浏览器崩溃后自动重新启动

7. README 文档修复

  • 将中文 README 设为默认 (README.md),英文移至 README-en.md
  • 修正 PROXY_URL 默认值与代码一致 (http://127.0.0.1:10809)
  • 补充 MAX_DESCRIPTION_LENGTH 环境变量说明
  • 补充 search 工具的 maxDescriptionLength 参数
  • 修正工具数量描述(四个 → 五个)
  • 补充 linuxdo 引擎(移除删除线标记)
  • 同步更新 Docker 部署环境变量表
  • 中英文 README 同步更新

新增文件

文件 说明
src/engines/shared/browser.ts 全局共享 Puppeteer 浏览器管理模块
src/test/test-bing-quality.ts Bing 搜索结果质量测试(随机后缀防缓存)
src/test/test-baidu-availability.ts 百度搜索可用性测试
src/test/test-engine-case.ts 引擎名大小写兼容测试
src/test/test-search-relevance.ts 搜索结果相关性测试
src/test/test-description-length.ts 描述截断长度测试
src/test/fetchLinuxDoArticleTests.ts Linux.do 文章获取测试
README-en.md 英文 README(从 README-zh.md 重命名)

修改文件

文件 说明
src/engines/bing/bing.ts 用 Puppeteer + 直接 URL 导航重写
src/engines/baidu/baidu.ts 用 Puppeteer 重写
src/tools/setupTools.ts 引擎名大小写归一化
src/config.ts 配置调整
src/engines/juejin/juejin.ts 小修
src/engines/linuxdo/fetchLinuxDoArticle.ts 小修
README.md 中文 README(重写,原英文移至 README-en.md)
package.json 新增 puppeteer-core 依赖

删除文件

文件 说明
README-zh.md 已提升为默认 README.md

测试

所有测试通过:

✅ Bing 搜索质量测试 — 10/10 MCP 相关结果(随机后缀)
✅ 百度搜索可用性测试 — 正常返回结果,无安全验证
✅ 引擎名大小写测试 — Bing/BAIDU/DuckDuckGo 等均可识别
✅ 搜索结果相关性测试
✅ MCP 工具调用测试 — 通过 MCP 协议调用 search 工具验证

为什么选择 Puppeteer 而非 Playwright

在开发过程中对 Puppeteer 和 Playwright 进行了对比测试,两者在 Bing 反自动化检测面前表现一致(headless 模式均被检测,GUI 模式均正常)。最终选择 puppeteer-core 的原因:

  1. 更轻量puppeteer-core 仅 ~2MB,不捆绑浏览器;Playwright 需安装专用浏览器二进制 (~400MB+)
  2. 复用系统浏览器puppeteer-core 原生支持 connect() 连接已有浏览器进程,适合本项目的共享浏览器架构;Playwright 的 connectOverCDP() 为后加入的功能,API 不如 Puppeteer 成熟
  3. 无额外运行时:Playwright 需要下载并管理自己的浏览器副本,对 MCP 服务器场景(用户希望 npx 一行启动)增加了不必要的复杂度
  4. 社区生态:Puppeteer 是 Chrome DevTools Protocol 的事实标准客户端,与本项目通过 CDP 直连浏览器的架构理念更契合

变更统计

20 files changed, 2568 insertions(+), 945 deletions(-)

埃博拉酱 added 9 commits January 4, 2026 12:19
- 原实现使用 axios 直接请求,存在过期 cookie、错误选择器、无 URL 解码等问题
- 新实现通过 puppeteer-core 驱动系统 Edge/Chrome 浏览器进行搜索
- 添加会话预热机制,解决 cn.bing.com 对多词中文查询的反爬虫问题
- 手动 spawn 浏览器进程并通过 WebSocket 连接,避免与 MCP 服务器的 stdio 冲突
- 添加 decodeBingUrl 函数解码 Bing 重定向 URL
- 每次搜索使用唯一临时目录,搜索完成后自动清理
- Windows下改用WMI创建浏览器进程,避免VS Code Job Object终止子进程
- Linux/macOS下使用detached spawn
- 缓存浏览器会话,跨搜索复用而非每次新建
- 进程退出时自动清理浏览器与临时目录
- [Copilot review #3] fetch轮询添加AbortController超时
- [Copilot review Aas-ee#6] Linux spawn添加error事件监听
- [Copilot review Aas-ee#7] puppeteer.connect失败时清理资源
- [Copilot review #1,#2,Aas-ee#4,Aas-ee#5] 不适用/已解决,注释说明原因
Issue Aas-ee#29 - 百度搜索被封禁:
- 将百度引擎从 axios HTTP 改为 puppeteer 共享浏览器
- 添加反自动化检测 (disable-blink-features, CDP隐藏webdriver)
- 热门关键词(天气预报/Python教程等)现可正常返回结果

Issue Aas-ee#19 - 引擎名大小写敏感:
- 移除 z.enum() 约束,改用 z.string() + transform 归一化
- 支持任意大小写输入(如 Bing/BAIDU/DuckDuckGo)
- MCPO 代理等发送首字母大写引擎名不再报错

架构优化:
- 新增全局共享浏览器模块 (engines/shared/browser.ts)
- Bing 和 Baidu 共用同一浏览器实例,减少资源开销
- Bing 引擎代码从约330行精简至约110行

测试:
- 新增百度可用性测试 (test-baidu-availability)
- 新增引擎名大小写测试 (test-engine-case)
- 新增搜索结果相关性测试 (test-search-relevance)
- 修正 PROXY_URL 默认值为 http://127.0.0.1:10809(与代码一致)
- 新增 MAX_DESCRIPTION_LENGTH 环境变量说明
- 新增 search 工具 maxDescriptionLength 参数说明
- 补充 DEFAULT_SEARCH_ENGINE 可选值 linuxdo
- 修正工具数量描述:四个→五个(补充 fetchJuejinArticle)
- 补充文章获取功能支持的站点:juejin、linux.do
- 同步更新 Docker 部署环境变量表(补充缺失的变量和引擎)
- 补充 limit 参数范围说明 (1-50)
- 补充 engines 参数大小写不敏感说明
- 中英文 README 同步更新
- 重写 bing.ts,通过搜索框 (#sb_form_q) 提交查询,而非直接 URL 导航(Bing 对直接 URL 返回降级结果)
- 将浏览器从 headless 模式切换为 GUI 模式(Bing 在服务端检测 headless)
- 在 Windows 上通过 WMI Win32_ProcessStartup ShowWindow=0 (SW_HIDE) 隐藏浏览器窗口
- 新增 test-bing-quality.ts,使用随机后缀方法验证搜索结果相关性
Copilot AI review requested due to automatic review settings February 25, 2026 12:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a comprehensive overhaul of the search engine architecture that addresses critical functionality issues with Bing and Baidu search engines. The changes replace the axios-based HTTP scraping approach with a Puppeteer-based browser automation solution to bypass anti-bot measures.

Changes:

  • Rewrote Bing and Baidu search engines using Puppeteer with headless browser automation
  • Implemented shared browser management module to reduce resource overhead
  • Fixed case-insensitive engine name validation issue (#19)
  • Added configurable description length truncation feature
  • Updated documentation to reflect code changes and fix inconsistencies

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/engines/shared/browser.ts New shared Puppeteer browser lifecycle management with Windows WMI process spawning
src/engines/bing/bing.ts Complete rewrite using Puppeteer, search box submission, and Bing URL decoding
src/engines/baidu/baidu.ts Rewrite using Puppeteer with CDP script injection to hide automation features
src/tools/setupTools.ts Added case-insensitive engine name handling and description truncation logic
src/config.ts Added maxDescriptionLength configuration option
src/engines/linuxdo/fetchLinuxDoArticle.ts Updated URL regex to support both /topic/ and /t/ formats
src/engines/juejin/juejin.ts Changed console.log to console.error for MCP compatibility
package.json Added puppeteer-core and zod dependencies
README.md Comprehensive documentation update with corrected default values
Test files Added comprehensive test coverage for new features
Comments suppressed due to low confidence (10)

package.json:34

  • The package-lock.json shows that zod is marked as "peer: true" (line 7643), but it's listed as a regular dependency in package.json. This inconsistency could lead to installation issues. The zod package should either be a regular dependency (not peer) or moved to peerDependencies in package.json if it's expected to be provided by the consuming application.
    "zod": "^3.23.0"

src/engines/baidu/baidu.ts:29

  • There's a potential resource leak in the searchBaidu function. If an error occurs after creating a page but before it's closed (e.g., during page.goto or cheerio operations), the page will never be closed, leading to memory leaks. The page.close() call should be wrapped in a try-finally block or moved to a finally block to ensure it always executes.
            const page = await browser.newPage();

            // 通过 CDP 隐藏 webdriver/自动化特征,绕过百度反爬检测
            const client = await page.createCDPSession();
            await client.send('Page.addScriptToEvaluateOnNewDocument', {
                source: `
                    Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
                    delete navigator.__proto__.webdriver;
                `
            });

            const searchUrl = `https://www.baidu.com/s?wd=${encodeURIComponent(query)}&pn=${pn}&ie=utf-8`;

            await page.goto(searchUrl, { waitUntil: 'networkidle2', timeout: 15000 });
            await new Promise(r => setTimeout(r, 1000));

            const html = await page.content();
            await page.close();

src/tools/setupTools.ts:218

  • The description truncation implementation adds '...' unconditionally when truncating, which makes the final length descLimit + 3 characters instead of exactly descLimit characters. This could be misleading to users who expect the maxDescriptionLength to be the actual maximum length. Consider either including the ellipsis in the limit (e.g., r.description.slice(0, descLimit - 3) + '...') or documenting that the actual length may exceed the limit by 3 characters.
                    ? results.map(r => ({
                        ...r,
                        description: r.description.length > descLimit
                            ? r.description.slice(0, descLimit) + '...'
                            : r.description

src/test/test-engine-case.ts:18

  • The test file test-engine-case.ts doesn't match the actual implementation in setupTools.ts. The test uses a simple transform to normalize and filter engines, but setupTools.ts uses z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum()) which validates against the enum AFTER lowercasing. The test should replicate this exact behavior to accurately test the implementation.
const enginesSchema = z.array(z.string())
    .min(1).default(['bing'])
    .transform(requestedEngines => {
        const allowed = getAllowedEngines();
        const normalized = requestedEngines.map(e => e.toLowerCase());
        const valid = normalized.filter(e => allowed.includes(e));
        if (valid.length === 0) {
            throw new Error(`No valid engine found. Allowed engines: ${allowed.join(', ')}`);
        }
        return valid;
    });

src/engines/shared/browser.ts:38

  • The path concatenation on lines 30-38 uses string concatenation instead of path.join(), which could lead to incorrect paths on some systems. While this works for Windows paths, it's better practice to use path.join() or template literals with proper path separators for cross-platform compatibility and readability.
        candidates.push(pf86 + '\\Microsoft\\Edge\\Application\\msedge.exe');
        candidates.push(pf86 + '\\Google\\Chrome\\Application\\chrome.exe');
    }
    if (pf) {
        candidates.push(pf + '\\Microsoft\\Edge\\Application\\msedge.exe');
        candidates.push(pf + '\\Google\\Chrome\\Application\\chrome.exe');
    }
    if (localAppData) {
        candidates.push(localAppData + '\\Google\\Chrome\\Application\\chrome.exe');

src/engines/linuxdo/fetchLinuxDoArticle.ts:6

  • The regex pattern for validating Linux.do URLs has an issue. The pattern /(?:\/topic\/|\/)t\/(?:[^\/]+\/)?(\d+)/ will incorrectly match URLs like /t/123 at the root level due to the alternation (?:\/topic\/|\/). The second alternative \/ matches any single slash, so it will match the slash before t in /t/123. This should be /(?:\/topic|\/t)\/(?:[^\/]+\/)?(\d+)/ to match either /topic/ or /t/ but not just any slash.
    const match = url.match(/(?:\/topic\/|\/)t\/(?:[^\/]+\/)?(\d+)/) || url.match(/\/topic\/(\d+)/);

src/tools/setupTools.ts:190

  • The engines parameter uses a complex chained transformation: z.array(z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum())). This validates each string after lowercasing it against the enum. However, if an invalid engine name is provided (e.g., "InvalidEngine"), the error message will show the lowercased version "invalidengine" rather than the original input, which could be confusing to users. Consider validating before transforming or providing custom error messages that reference the original input.
            engines: z.array(z.string().transform(s => s.toLowerCase()).pipe(getEnginesEnum())).min(1).default([config.defaultSearchEngine])

src/engines/bing/bing.ts:19

  • The decodeBingUrl function assumes the encoded URL parameter always starts with 'a1' and strips the first 2 characters. However, there's no validation that the parameter actually starts with 'a1' or has at least 2 characters. If Bing changes this format or if the parameter has a different prefix, the function will silently decode invalid data or fail. Consider adding validation: if (!encodedUrl.startsWith('a1') || encodedUrl.length < 3) return bingUrl;
        const base64Part = encodedUrl.substring(2);
        const decodedUrl = Buffer.from(base64Part, 'base64').toString('utf-8');

src/tools/setupTools.ts:90

  • The validation regex in setupTools.ts has the same issue as fetchLinuxDoArticle.ts. The pattern /\/t(opic)?\//.test(url) will match /topic/ or /t/ correctly, but there's an inconsistency: fetchLinuxDoArticle.ts expects both /topic/123 and /t/slug/123 formats, while the validation only checks for the path pattern without ensuring the full format matches what the fetch function expects.
                return urlObj.hostname === 'linux.do' && /\/t(opic)?\//.test(url);

src/engines/shared/browser.ts:123

  • The parseInt call on line 123 doesn't specify a radix. While this works in most cases, it's a best practice to always specify the radix (base 10) to avoid potential issues with leading zeros being interpreted as octal. Change to parseInt(output.trim(), 10).
            browserPid = parseInt(output.trim());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

埃博拉酱 added 2 commits February 25, 2026 21:02
- Windows: 用 CreateDesktop + CreateProcessW 在不可见桌面上启动浏览器,
  子进程自动继承桌面,用户桌面零弹窗
- DuplicateHandle 将桌面句柄复制到浏览器进程,防止启动脚本退出后桌面被销毁
- 移除百度 CDP webdriver 属性覆盖(GUI模式+AutomationControlled已足够)
- Windows: 用 CreateDesktop + CreateProcessW 在不可见桌面上启动浏览器,
  子进程自动继承桌面,用户桌面零弹窗
- DuplicateHandle 将桌面句柄复制到浏览器进程,防止启动脚本退出后桌面被销毁
- 移除百度 CDP webdriver 属性覆盖(GUI模式+AutomationControlled已足够)
埃博拉酱 added 6 commits February 25, 2026 21:16
经验证 GUI 模式下直接 URL 导航与搜索框提交返回完全相同的结果,
关键因素仅为 GUI vs headless,与提交方式无关。
移除 submitSearchViaSearchBox 函数及相关依赖(Page type import)。
- 知乎/LinuxDo: 将 axios HTTP 请求替换为 puppeteer 浏览器搜索, 修复 site: 运算符失效导致返回无关结果的问题
- 百度: 修复搜索天气等特殊词时 iframe 被销毁导致导航失败的问题, 回退到 domcontentloaded
- Bing: 修复翻页时 waitForNavigation 竞态条件, 兼容 AJAX 翻页
- CSDN: 修复文库类内容 digest 字段为 undefined 导致崩溃的问题
- 安全: 移除 npx 依赖, 消除 65 个安全漏洞
- 重构浏览器关闭逻辑:browser.close()优雅退出+强杀回退,setTimeout加unref防止阻塞退出
- 监听MCP server onclose事件,stdin关闭时主动清理浏览器并退出,解决停止服务器慢的问题
- 移除ws/@types/ws依赖(不再使用UnrefTransport)
- executeSearch返回{results, errors}结构,引擎全部失败时isError:true报告给Agent
- 部分引擎失败时在warnings中附带错误详情
- DuckDuckGo子方法错误不再吞掉,正确向上抛出
- 新增getErrorMessage()辅助函数,处理AxiosError/AggregateError的空message问题
- DuckDuckGo新增errMsg()辅助函数处理空错误消息
- 所有console.error改为简洁单行日志,不再dump整个错误对象
- 新增全局requestTimeout配置(环境变量REQUEST_TIMEOUT,默认30000ms)
- 所有引擎硬编码超时替换为config.requestTimeout
- 修复掘金搜索null result_model导致的崩溃
- 测试脚本添加destroySharedBrowser()清理调用
- 区分全部失败与部分失败:errors.length === engines.length 才判定为全部失败
- 0结果但部分引擎失败时,同时报告失败引擎的错误和成功引擎的0结果状态
- 避免DuckDuckGo超时掩盖Bing返回0结果的信息
- Bing: 用 waitForSelector 等待搜索结果 DOM 就绪,替代固定 500ms 延时
- Bing: 首次解析为空时自动重试,兼容 cn.bing.com 异步渲染延迟
- Bing: 翻页后也用 waitForSelector 替代固定等待
- DuckDuckGo: 增加预加载和 HTML 方法返回空结果时的中文警告日志
- 搜索工具: 新增所有引擎成功但 0 结果的分支,返回中文信息性提示
- 所有新增日志和提示文本均使用中文
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

baidu被ban了 Pb of mapping with MPCO from openwebui

2 participants