Skip to content

[integrations] Chrome extension — Gemini bulk history sync (Phase B/C)#21

Open
alanshurafa wants to merge 25 commits intomainfrom
contrib/alanshurafa/chrome-capture-gemini-history
Open

[integrations] Chrome extension — Gemini bulk history sync (Phase B/C)#21
alanshurafa wants to merge 25 commits intomainfrom
contrib/alanshurafa/chrome-capture-gemini-history

Conversation

@alanshurafa
Copy link
Copy Markdown
Owner

Stacks on NateBJones-Projects#214 (chrome-capture-extension). Adds batchexecute-based history backfill (Phase B) and Sync All UI (Phase C). Pre-review fork PR; upstream PR will follow after cross-AI review.

Summary

Adds Gemini bulk history sync to the Chrome capture extension. Google does not expose a public conversation API for Gemini, so the implementation observes Gemini's own internal batchexecute history-load RPC via chrome.debugger (Phase B) and adds a Sync All / Sync New orchestrator that walks the sidebar and drives the debugger through each conversation (Phase C).

  • Phase Bchrome.debugger attaches to gemini.google.com tabs and watches exactly one URL pattern: batchexecute requests with rpcids=hNvQHb. On loadingFinished the service worker reads the response body, parses Gemini's framed positional-array envelope, and funnels every user+assistant turn through the existing processCaptureRequest pipeline. No DOM scraping, no parallel ingest path.
  • Phase C — pure state-machine helper (lib/gemini-sync-state.js) plus orchestrator (background/gemini-sync.js). Sync All enumerates the sidebar, opens a hidden tab, and drives it through each conversation one at a time while Phase B observes the history-load RPC. Sync New filters against lifetime synced IDs. Auto-sync (4h, opt-in) runs incremental. Resumable across MV3 service-worker restarts via chrome.storage.local.
  • Anti-bot throttle — 4–12 s jittered per-conversation delay (sub-millisecond precision) plus a longer "reading pause" every 10 conversations. If Gemini redirects the sync tab to a CAPTCHA the orchestrator pauses gracefully and the Sync All button relabels to "Resume Sync" so the user can pick up where it left off.
  • 35 unit tests (node --test) cover state transitions, ID deduplication and cap enforcement, completion bookkeeping, progress summary, and the waiter registry.
  • Manifest adds debugger and scripting permissions; version 0.4.0 → 0.5.0. No other permission changes.
  • Metadata bumps to 1.1.0, adds the gemini-bulk-sync tag. License unchanged (FSL-1.1-MIT). No new runtime deps, no binary blobs, no telemetry.

Test plan

  • node --test integrations/chrome-capture-extension/lib/__tests__/gemini-sync-state.test.js → 35/35 pass
  • node --check on every new/modified JS file — clean
  • markdownlint-cli2 on the updated README — 0 errors
  • check-jsonschema against .github/metadata.schema.json — valid
  • Load unpacked in Chrome 120+, sign into Gemini, click Sync All History in the Sync tab — observe hidden tab navigating through conversations and captured-count rising
  • Toggle Gemini capture off in Settings — debugger detaches from all Gemini tabs
  • Click Sync New with everSyncedIds populated — only new conversations navigate
  • Toggle Auto-sync every 4 hours on — alarm fires and runs incremental
  • Cancel mid-run — state flips to canceled, button relabels to Resume Sync

Stacks on

This PR stacks on top of NateBJones-Projects#214 (chrome-capture-extension). It adds files under integrations/chrome-capture-extension/; merging NateBJones-Projects#214 first, then this PR, applies both layers of the extension cleanly.

🤖 Generated with Claude Code

alanshurafa and others added 18 commits April 17, 2026 23:39
Chrome MV3 extension that captures AI conversations into Open Brain via the REST API. First-run config screen collects API URL and key (stored in chrome.storage.local). All ExoCortex-specific hardcoded Supabase project URLs removed — extension is fully configurable. Runtime host permissions model documented.
….debugger

Adds the Phase B foundation for Gemini bulk backfill. Gemini exposes no
public conversation API, so the extension attaches chrome.debugger to
gemini.google.com tabs and watches for the one internal RPC that Gemini
itself uses to load conversation history (batchexecute rpcids=hNvQHb).

On loadingFinished the service worker reads the response body via the
debugger protocol, parses Gemini's framed positional-array envelope,
and yields one normalized turn per user/assistant exchange. All turns
are funneled through the existing processCaptureRequest pipeline so
they inherit retry queue, sensitivity filter, fingerprint dedup, and
session metrics — no parallel /ingest path.

This commit is debugger infrastructure only. Phase C (the Sync All
orchestrator that drives per-conversation navigation) lands in a
follow-up commit. Live StreamGenerate / ambient capture is NOT ported
— the extension's public release deliberately dropped ambient capture,
and this port preserves that policy.

Manifest adds the minimum permissions needed: `debugger` (attach only
to gemini.google.com, observe one RPC pattern) and `scripting` (for
the Phase C sidebar enumerator). Version bumps 0.4.0 → 0.5.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Phase C orchestrator that drives full-history backfill. The
state machine lives in a pure helper (lib/gemini-sync-state.js) so it
can be unit-tested under node --test without a Chrome stub; the
orchestrator (background/gemini-sync.js) owns all chrome.* calls.

Flow:
  1. Enumerate the Gemini sidebar via chrome.scripting.executeScript
     (scrolls the list until IDs stop growing).
  2. Open one background sync tab, drive it through each conversation.
  3. Per conversation: register a waiter keyed by the conversation ID,
     navigate the tab, wait for Phase B to notify via
     notifyHistoryCaptured(id, totals).
  4. Fingerprint dedup at the ingest layer guarantees re-runs are safe
     (duplicate turns return duplicate_fingerprint / existing).

Resilience:
  - Resumable across MV3 SW restarts via chrome.storage.local state.
  - User-cancelable at any time; Sync All button relabels to Resume
    Sync when a run was paused mid-flight.
  - Tab-health check before every navigation catches Google bot
    challenges (CAPTCHA / login prompts that redirect off Gemini) and
    transitions gracefully to a CANCELED paused state instead of
    burning through the queue with silent timeouts.

Anti-bot throttle:
  - 4–12 s jittered delay between conversations (sub-millisecond
    precision so whole-second clusters don't fingerprint as a bot).
  - 20–35 s "reading pause" every 10 conversations to break cadence.
  Tuned to stay under Google's challenge threshold (earlier uniform
  4 s cadence tripped the challenge around conversation 21).

Incremental path:
  - syncIncremental() filters the sidebar against lifetime everSyncedIds
    and navigates only the delta. Capped at 20 per run so scheduled
    use stays quiet.
  - Auto-sync (4-hour cadence, opt-in) drives incremental sync.

Tests (35 cases) cover state transitions, pendingIds deduplication and
cap enforcement, completion/failure bookkeeping, progress summary, and
the waiter registry's resolve/abort/abortAll semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loads the Phase B debugger and Phase C orchestrator via importScripts,
adds GEMINI_SYNC_* message handlers (start, cancel, resume, incremental,
status, auto-sync toggle), and drives the Gemini auto-sync alarm on
install/startup.

Exposes processCaptureRequest on the service-worker global so the Gemini
debugger module can funnel history turns through the same ingest pipeline
as manual capture. Classic-script function declarations are already global
in the SW scope, but the explicit assignment pins the cross-module
contract so it doesn't quietly break if the function is ever rewrapped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "Gemini bulk sync not supported" hint with the full Sync All
History / Sync New / Cancel button group plus a 4-hour auto-sync toggle.
The button auto-labels as "Resume Sync" when a run was paused mid-flight
(Google bot challenge, SW restart) so the user can pick up where the state
machine left off without re-enumerating.

Progress surfaces via a 2-second polling loop that reads
GEMINI_SYNC_STATUS, renders percent complete and the captured/dedup
counters, and stops the poll as soon as the state machine leaves
enumerating/syncing. Paused state shows the failure reason and — when
the reason looks like a CAPTCHA — prompts the user to solve it first
and then click Resume.

The existing shared sync-progress-area (used by Claude/ChatGPT full
sync) remains in place unchanged; Gemini's progress lives in its own
gemini-sync-progress block so the two flows don't clobber each other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README: adds the "Gemini bulk history sync (Phase B/C)" section covering
how the debugger-based capture works end-to-end, the "Debugging this
browser" banner users will see while syncing, the anti-bot throttling
strategy, and how to pause the flow (toggle Gemini off or dismiss the
debugger banner). Updates the Supported Sites table row for Gemini, the
Usage paragraph, and the Chrome Web Store permission justifications for
`debugger` and `scripting`.

metadata.json: bumps version 1.0.0 → 1.1.0, adds the `gemini-bulk-sync`
tag, updates the `updated` date, and drops the `_todo` field so the
current (stricter) metadata schema validates. The TODO it referenced
(PR NateBJones-Projects#201 slug) is still explained in the README alongside the
prerequisite link.

License remains FSL-1.1-MIT. No new runtime dependencies, no binary
blobs, no telemetry or third-party hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 770594fe1e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (!over) return;

const thoughtId = active.id as number;
const newStatus = over.id as string;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive destination status from column, not over.id

When a card is dropped over another card, dnd-kit sets over.id to that card's numeric thought ID, not a kanban status string; this value is then sent as status, which fails server validation (VALID_STATUSES) and causes every such drag to revert. In practice this breaks moving items into populated columns because drops frequently land on cards rather than empty column space.

Useful? React with 👍 / 👎.


try {
const result = await driveConversation(record.syncTabId, conversationId);
await updateState((rec) => stateMod.recordCompletion(rec, conversationId, result));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Only mark Gemini conversations synced after successful ingest

This path records every driven conversation as a completion regardless of capture outcome, and recordCompletion adds the ID to lifetime everSyncedIds; statuses like disabled_platform/other non-ingest outcomes still flow through here, so incremental sync later skips those conversations permanently via filterToNewIds even though nothing was actually captured.

Useful? React with 👍 / 👎.

Comment on lines +973 to +974
if (attachedTabs && typeof attachedTabs.has === 'function' && !attachedTabs.has(syncTabId)) {
return { healthy: false, reason: 'debugger not attached to sync tab' };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid canceling sync before debugger attach settles

The health check treats "debugger not attached to sync tab" as fatal, but immediately after creating/updating the sync tab there is an async attach race (tabs.onUpdated in the debugger module) where _attachedTabs can still be empty. In that window, the loop cancels the run before first capture; this should wait for attach (or defer this specific check) instead of hard-failing.

Useful? React with 👍 / 👎.

alanshurafa and others added 5 commits April 21, 2026 16:49
… false-negatives on debugger attach race

On fresh sync, ensureSyncTab creates the Gemini tab synchronously but Phase B's
chrome.debugger.attach runs async via chrome.tabs.onUpdated. mainLoop's first
iteration called checkSyncTabHealthy immediately and could false-flag the run
as paused with "debugger not attached" before attach won the race.

Move attach awareness entirely into driveConversation (which already does it
via waitForDebuggerAttach with a tolerant 2s budget), so the health check
focuses on the real CAPTCHA/navigation-away signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…meIfInterrupted

resumeIfInterrupted() is called at module load and was setting syncInFlight=true
only AFTER an await loadState(). A popup-triggered startSync/resumeSync or
alarm-triggered syncIncremental arriving during that async gap would observe
syncInFlight=false, slip past the guard, and double-enter mainLoop against the
same persisted queue.

Claim the lock synchronously at the function top before any await. Release it
explicitly when we short-circuit out (no-op / stale-reset paths); the happy
path hands off to the existing finally block that already clears the lock
after mainLoop completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etion totals double-count

Two pure-helper fixes with test coverage:

1. waiter register() silently overwrote an existing slot, so a reentrant
   driveConversation or stale slot that outlived its setTimeout left the old
   promise hanging until abortAll. Worse, the old timeout's abort-by-id call
   could then reject the NEW waiter. Now reject the prior waiter with a
   clear reason before replacing.

2. recordCompletion added the result totals every call even when the id was
   already in completedIds. A duplicate notify (retry, late-resolve, or
   re-sync of the same conversation) inflated captured/dedup/turn counts.
   Gate the totals fold on first-time completion only.

Adds two tests (36, 37) that guard both behaviors. Full suite 37/37 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… alarm + popup

Four related fixes across the Phase B/C surface:

1. gemini-debugger.js — reject base64Encoded response bodies (the hNvQHb RPC
   is always text; base64 means wrong content type or URL misidentification)
   and cap parseable bodies at 8 MB so a malformed payload can't OOM the
   service worker.

2. gemini-debugger.js — on empty/malformed extractor output, recover the
   conversation id from the sync tab's /app/<id> URL and resolve the
   waiter with zero-result counts. Previously the orchestrator blocked on
   the full 15s capture timeout even when the response was visibly empty.

3. service-worker.js — Gemini auto-sync alarm now reads status first and
   skips syncIncremental when the prior run is paused on a Google
   challenge (state=canceled with pendingIds). Prevents alarm-driven
   re-runs from re-triggering the CAPTCHA until the user resumes.

4. popup.js — resume-vs-start click dispatch keys off a data-mode attribute
   set by renderGeminiProgress instead of string-matching the button text.
   Robust to future copy or localization changes.

Tests 37/37 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…execute parsing

Adds 13 tests that exercise extractGeminiHistory against synthesized
batchexecute payloads. Guards the shape layers codex called out as
under-tested:

- XSSI prefix handling and whitespace tolerance
- empty / non-history / non-wrb.fr frames return null cleanly
- single-turn and multi-turn payloads decode in historyOrder
- turns with missing ids / empty prompts drop without aborting the batch
- historic timestamp decoding from [seconds, nanos] pair
- parseAdaptive length-prefix drift tolerance (+/- 5 bytes)
- fuzz: extractor never throws on 7 varieties of garbage input

Test total now 50 (was 37). All pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + resumeIfInterrupted lock leak

Round 2 review caught two bugs I introduced in the round 1 fix batch.

1. gemini-debugger.js — the empty-payload notifyHistoryCaptured
   (captured:0, total:0) path marked the conversation as completed and
   added it to everSyncedIds, silently skipping it forever on future
   incremental syncs. If the payload was just a transient parse failure
   or future Gemini format change, the user loses that conversation.
   Revert to letting the orchestrator's 15s capture timeout fire
   naturally, which routes the id to failedIds instead — user can retry
   via Sync All after an extractor fix ships.

2. gemini-sync.js — resumeIfInterrupted's finally block checked the
   local record.state (still SYNCING/ENUMERATING) to decide whether to
   release syncInFlight. The stale-heartbeat branch updated persisted
   state but not the local copy, so the finally wrongly assumed we were
   handing off to the happy path and never cleared the lock. Result:
   after one stale-interrupted detection, every subsequent startSync,
   resumeSync, and syncIncremental returned 'sync already running'
   until the SW terminated.
   Replace the state-based heuristic with an explicit
   handedOffToMainLoop flag set only when we actually proceed.

Tests 50/50 still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after markdownlint cleanup merged into fork main.

@alanshurafa alanshurafa reopened this Apr 22, 2026
@alanshurafa
Copy link
Copy Markdown
Owner Author

Refreshing checks after fork markdownlint workflow fix.

@alanshurafa alanshurafa reopened this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant