Skip to content

Notion translation pipeline: replace expiring Notion/S3 image URLs with canonical /images/... paths (reuse EN images) #137

@luandro

Description

@luandro

Problem

Our Notion translation workflow (scripts/notion-translate/index.ts) converts Notion pages to Markdown via n2m.pageToMarkdown() and translates using OpenAI. The resulting translated Markdown often contains expiring Notion/S3 image URLs, which break over time.

Meanwhile, the Notion fetch pipeline already downloads and rewrites images into the canonical Docusaurus location:

  • disk: static/images/
  • web path: /images/<filename>
    (using scripts/notion-fetch/imageProcessing.ts and scripts/notion-fetch/imageReplacer.ts)

But translation does not reuse this image rewrite step, so translated pages keep expiring URLs.

Goal

All translated Markdown should reference the same stable canonical images as English by using /images/... paths (backed by static/images/). Translations must not include expiring Notion/S3 URLs.

Proposed Solution

Integrate the existing image replacement pipeline into the translation flow:

  1. In scripts/notion-translate/index.ts, after:
const markdownContent = await convertPageToMarkdown(englishPage.id);

run:

  • processAndReplaceImages(markdownContent, safeFilename) from scripts/notion-fetch/imageReplacer.ts

This will:

  • detect Notion/S3 image URLs in the markdown (and <img> tags),
  • download images into static/images/ (or reuse cache),
  • rewrite URLs to /images/<filename>.
  1. Translate the image-stabilized markdown (already using /images/...), and ensure translation does not mutate those paths.

Implementation Details

  • Import:

    • processAndReplaceImages (and optionally validateAndFixRemainingImages) from scripts/notion-fetch/imageReplacer.ts.
  • Use the same “safe filename” slug style as we already compute in saveTranslatedContentToDisk() (title → slug). Pass that as the safeFilename so image filenames remain deterministic.

  • Ensure /images/... links remain unchanged through translation:

    • Add a strong instruction in the translation prompt: “Do not modify URLs or paths, especially anything starting with /images/.”
    • Optionally add a post-translation guard: run validateAndFixRemainingImages(translatedMarkdown, safeFilename).

Acceptance Criteria

  • After running bun scripts/notion-translate, translated markdown contains zero URLs matching any of:

    • secure.notion-static.com
    • notion-static.com
    • amazonaws.com
    • X-Amz- params
    • www.notion.so/image/
  • Images in translated pages reference /images/<filename> and resolve at build time (Docusaurus static).

  • Images are not duplicated per language (translations reuse the shared /images assets).

  • Works for both:

    • Markdown image syntax: ![alt](...)
    • Inline HTML: <img src="...">
  • Idempotent: re-running translation does not cause churn in image links.

Tests

Add tests under scripts/notion-translate/__tests__/ (or extend existing notion-fetch tests) that verify:

  • markdown with Notion/S3 image URLs is rewritten to /images/...
  • translated output contains no Notion/S3 URLs
  • /images/... is preserved

Notes

This repo already has detailed handling for expiring Notion URLs and image caching under scripts/notion-fetch/ — translation should reuse that instead of introducing a new mapping system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    High PriorityShould have priority in solving

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions