Skip to content

Rebuild Gmail job-email pipeline with deterministic gating and guarded DB updates#449

Open
carkod wants to merge 1 commit intomasterfrom
codex/build-email-processing-pipeline-for-job-applications
Open

Rebuild Gmail job-email pipeline with deterministic gating and guarded DB updates#449
carkod wants to merge 1 commit intomasterfrom
codex/build-email-processing-pipeline-for-job-applications

Conversation

@carkod
Copy link
Copy Markdown
Owner

@carkod carkod commented Apr 22, 2026

Motivation

  • Replace an ad-hoc, streaming AI/email flow with a strict, auditable pipeline to reduce false positives (shopping/marketing emails) and avoid AI-driven unsafe DB writes.
  • Support incremental Gmail ingestion using watch/history IDs and Pub/Sub payloads so the system processes only mailbox deltas.
  • Use the skill-creator guidance to structure a two-stage AI interaction (classification → extraction) with strict JSON schemas and deterministic guards.

Description

  • Reworked Gmail integration in back/src/services/GmailApi.js to add startWatch(...), fetchHistory(...), decodePubSubMessage(...), and extractMessageIdsFromHistory(...) helpers and centralized header construction.
  • Replaced the streaming/function-call Gemini approach with a single-response JSON model in back/src/services/GeminiApi.js, introducing CLASSIFICATION_SCHEMA and EXTRACTION_SCHEMA, a requestJson(...) helper, region rotation, and raw-AI output logging for debuggability.
  • Implemented a strict staged pipeline in back/src/services/emailParser.js that includes deterministic pre-filter scoring (positive job signals and negative commerce signals), an AI classification gate (confidence >= 0.8), AI extraction, deterministic DB matching order (threadId → exact company → exact title → fuzzy title similarity), and a guarded update that only writes when extraction.confidence >= 0.85 and match.confidence >= 0.9, otherwise queuing the item for review.
  • Updated the API in back/src/Tracker.js for /api/applications/scan to accept an optional lastHistoryId and return pipeline results (processed items, lastHistoryId, and reviewQueue) instead of immediately returning a full application query.

Testing

  • Built the backend JS artifacts with yarn run build-js which completed successfully (Babel compiled files).
  • Confirmed the code compiles and the updated pipeline files were included in the build output.

Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 773be3b0a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

async guardedUpdate({ extraction, match, emailInput, emailDate }) {
if (extraction.confidence >= 0.85 && match.confidence >= 0.9) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Permit new records when no application match exists

The write gate in guardedUpdate() requires match.confidence >= 0.9 before entering either update or create paths, but matchApplication() returns { applicationId: null, confidence: 0 } when no prior record matches. That makes the new-record branch effectively unreachable, so genuinely new job-application emails are always sent to review and never inserted automatically, even when extraction confidence is high.

Useful? React with 👍 / 👎.

if (extraction.confidence >= 0.85 && match.confidence >= 0.9) {
const updatePayload = {
role: extraction.job_title || undefined,
company: extraction.company || "Unknown",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep existing company instead of overwriting with "Unknown"

For matched updates, company is always set to extraction.company || "Unknown". In common follow-up emails where the extractor omits company but matching succeeds via threadId/title, this overwrites a valid stored company with "Unknown", which is data loss and can trigger unique-index conflicts because company is unique in ApplicationSchema.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant