Skip to content

Inconsistent handling of Markdown syntax #68

@Fevol

Description

@Fevol

Originally opened in the Firefox Translations repo as issue 499

Describe the bug
Markdown formatting will in most cases not survive the translation process, either being mangled, closed improperly, mapped to other characters or outright omitted in the final translation.

This is mostly a nice to have, there are some parts of the syntax that will never be able to be properly translated (i.e. codeblocks and quotes). I also fully realize that this is a rather niche use case, and might degrade overal translation performance.

Related issues
mozilla/firefox-translations#486

Potential solution
Include markdown syntax as part of the training pipeline, similar to what was mentioned in above issue


Example

Test environment

Translations were run with simplified BergamotWorker script and WASM without postprocessing
Firefox Translation Models were used as translation models, version 0.3.3

Example

Markdown snippet used for testing, covering most aspects of the syntax

--- 
# Markdown test
This is a *test* to see how **well** Bergamot _handles_ the [Markdown](https://www.markdownguide.org/) syntax. 

1. The **bergamot orange**, is a fragrant citrus fruit the size of an orange
2. Has a *yellow* or *green* color similar to a lime, depending on ripeness

- The word bergamot is derived from the Italian word _bergamotto_
	- It is a small tree that blossoms during the winter

\```js
variable = 10
if (variable == "10")
	variable = "10" + 1
\```

> “Beware of bugs in the above code; I have only proved it correct, not tried it.”
> — Donald E. Knuth.
--- 

Some failing examples

French:

  • # turns into -
  • ** disappears or gets changed into a single quote

Dutch:

  • --- turns into -- ---
  • # turns into
  • Numberings gets repeated

German:

  • # turns into -
  • _ turns into "

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions