Skip to content

Translation workflow: generate glossary term map from Notion and inject into LLM translation context #138

@luandro

Description

@luandro

Context

We translate CoMapeo docs by fetching English content from Notion, converting to Markdown, and translating using OpenAI models.

We also maintain a Notion glossary database: “Awana Digital Glossary (EN)”, where:

  • Term = English canonical term (title)
  • Glossary (PT) / Glossary (ES) = relation columns to PT/ES glossary pages
  • We only need page titles (term translations), not definitions.

Database schema confirms:

  • Term is a title field
  • Glossary (PT) and Glossary (ES) are relation fields

Problem

Our translation LLM sometimes translates key product terms inconsistently (or incorrectly). We already have canonical translations in the glossary database, but we’re not using them as context during translation.

Goal

  1. Generate a glossary mapping file from Notion:
  • en_termpt_term, es_term, … (based on related page titles in each language glossary)
  1. Inject this mapping into every translation task as additional context, so the model prefers the canonical terminology.

Non-goals

  • Not translating glossary definitions.
  • Not auto-editing docs after the fact to “fix” terms via string replacement (unless we add later).
  • Not building a full termbase/QA tool in this ticket.

Specification

Data extraction rules

Source: Notion database ID b4e92786e7ce4db68a43e83ddeb069e7 (Awana Digital Glossary EN).

For each row:

  • Read English term from the page title (Term).

  • For each language relation property (at minimum Glossary (PT), Glossary (ES)):

    • Follow the relation(s) and read the related page title(s).
    • Use related page title as the canonical translation term.

Notes:

  • If relation contains multiple pages:

    • either choose the first (deterministic sort), or store as list. (Recommend list to avoid losing info.)
  • If a translation is missing for a language, omit it (don’t invent).

Output file format (recommended)

Generate a single JSON file (easy to load in Node):

i18n/glossary/terms.json

{
  "version": "notion:<last_edited_time_or_hash>",
  "generatedAt": "2026-02-12T00:00:00Z",
  "terms": [
    {
      "en": "Observation",
      "pt": ["Observação"],
      "es": ["Observación"]
    }
  ]
}

Also optionally generate per-language “flat maps” for prompt compactness:

  • i18n/glossary/pt.json : { "Observation": "Observação", ... }
  • i18n/glossary/es.json : { "Observation": "Observación", ... }

Integration into translation prompts

In the translation code (likely scripts/notion-translate/translateFrontMatter.ts or wherever translateText() assembles prompts), prepend a glossary section like:

Glossary (use these exact translations for key terms):

  • Observation → PT: Observação | ES: Observación
  • Project invite → PT: <…> | ES: <…>

Rules for the model:

  • Prefer these exact translations when they appear as product concepts.
  • Do not translate the “EN term” into something else if a glossary translation exists.
  • Preserve Markdown structure; do not alter code blocks/links/paths.

Prompt size safeguards

  • Cap glossary to relevant languages for the current translation run (e.g., PT-only when translating PT).
  • Cap number of entries included if needed (e.g., top N used terms later). For now include all, but log token size and warn if exceeding a threshold.

When to regenerate glossary

  • On every translation run, fetch glossary and regenerate file, OR
  • Cache it using last_edited_time and only regenerate if changed.

(Recommend caching: stable + faster.)


Acceptance Criteria

  • A script exists to generate glossary mapping JSON from the Notion glossary database.
  • The file includes EN term titles mapped to PT/ES (and any other language relations present).
  • Translation workflow loads the glossary map and injects it into the model context for every page translation.
  • Smoke test: run translation on at least one doc containing known glossary terms; output uses canonical glossary terms.
  • If Notion glossary is missing a translation for a term/language, translation still succeeds (no crash), and logs missing entries.

Suggested tasks

  1. Implement scripts/notion-translate/generateGlossary.ts

  2. Write output to i18n/glossary/terms.json (+ optional per-language maps)

  3. Hook into translation workflow:

    • ensure glossary is generated/loaded before translating pages
  4. Add basic unit test(s) for:

    • relation traversal and title extraction
    • output schema stability
  5. Add prompt-injection-safe formatting for glossary content (plain text list, no user-controlled instructions)

Metadata

Metadata

Assignees

No one assigned

    Labels

    High PriorityShould have priority in solving

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions