-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
We translate CoMapeo docs by fetching English content from Notion, converting to Markdown, and translating using OpenAI models.
We also maintain a Notion glossary database: “Awana Digital Glossary (EN)”, where:
Term= English canonical term (title)Glossary (PT)/Glossary (ES)= relation columns to PT/ES glossary pages- We only need page titles (term translations), not definitions.
Database schema confirms:
Termis a title fieldGlossary (PT)andGlossary (ES)are relation fields
Problem
Our translation LLM sometimes translates key product terms inconsistently (or incorrectly). We already have canonical translations in the glossary database, but we’re not using them as context during translation.
Goal
- Generate a glossary mapping file from Notion:
en_term→pt_term,es_term, … (based on related page titles in each language glossary)
- Inject this mapping into every translation task as additional context, so the model prefers the canonical terminology.
Non-goals
- Not translating glossary definitions.
- Not auto-editing docs after the fact to “fix” terms via string replacement (unless we add later).
- Not building a full termbase/QA tool in this ticket.
Specification
Data extraction rules
Source: Notion database ID b4e92786e7ce4db68a43e83ddeb069e7 (Awana Digital Glossary EN).
For each row:
-
Read English term from the page title (
Term). -
For each language relation property (at minimum
Glossary (PT),Glossary (ES)):- Follow the relation(s) and read the related page title(s).
- Use related page title as the canonical translation term.
Notes:
-
If relation contains multiple pages:
- either choose the first (deterministic sort), or store as list. (Recommend list to avoid losing info.)
-
If a translation is missing for a language, omit it (don’t invent).
Output file format (recommended)
Generate a single JSON file (easy to load in Node):
i18n/glossary/terms.json
{
"version": "notion:<last_edited_time_or_hash>",
"generatedAt": "2026-02-12T00:00:00Z",
"terms": [
{
"en": "Observation",
"pt": ["Observação"],
"es": ["Observación"]
}
]
}Also optionally generate per-language “flat maps” for prompt compactness:
i18n/glossary/pt.json:{ "Observation": "Observação", ... }i18n/glossary/es.json:{ "Observation": "Observación", ... }
Integration into translation prompts
In the translation code (likely scripts/notion-translate/translateFrontMatter.ts or wherever translateText() assembles prompts), prepend a glossary section like:
Glossary (use these exact translations for key terms):
- Observation → PT: Observação | ES: Observación
- Project invite → PT: <…> | ES: <…>
…
Rules for the model:
- Prefer these exact translations when they appear as product concepts.
- Do not translate the “EN term” into something else if a glossary translation exists.
- Preserve Markdown structure; do not alter code blocks/links/paths.
Prompt size safeguards
- Cap glossary to relevant languages for the current translation run (e.g., PT-only when translating PT).
- Cap number of entries included if needed (e.g., top N used terms later). For now include all, but log token size and warn if exceeding a threshold.
When to regenerate glossary
- On every translation run, fetch glossary and regenerate file, OR
- Cache it using
last_edited_timeand only regenerate if changed.
(Recommend caching: stable + faster.)
Acceptance Criteria
- A script exists to generate glossary mapping JSON from the Notion glossary database.
- The file includes EN term titles mapped to PT/ES (and any other language relations present).
- Translation workflow loads the glossary map and injects it into the model context for every page translation.
- Smoke test: run translation on at least one doc containing known glossary terms; output uses canonical glossary terms.
- If Notion glossary is missing a translation for a term/language, translation still succeeds (no crash), and logs missing entries.
Suggested tasks
-
Implement
scripts/notion-translate/generateGlossary.ts -
Write output to
i18n/glossary/terms.json(+ optional per-language maps) -
Hook into translation workflow:
- ensure glossary is generated/loaded before translating pages
-
Add basic unit test(s) for:
- relation traversal and title extraction
- output schema stability
-
Add prompt-injection-safe formatting for glossary content (plain text list, no user-controlled instructions)