Skip to content

Mengfei-Li/document-equation-migration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Document Equation Migration

Research-grade tools for converting MathType OLE-embedded equations in Word .docx files into MathML, OMML, and editable Word equations.

This project is Windows-first because the OMML conversion path uses Microsoft Office's MML2OMML.XSL, and the optional PDF validation path uses Word COM automation.

Status

This repository is a research preview, not a guaranteed lossless converter.

The current pipeline is designed for documents whose formulas are stored as MathType OLE objects such as word/embeddings/oleObject1.bin. It can batch-convert many OLE-embedded formulas, but complex formulas should still be reviewed before production use.

What It Does

  • Extracts oleObject*.bin files from a Word .docx.
  • Converts MathType OLE / MTEF content to intermediate XML.
  • Converts the intermediate XML to MathML.
  • Normalizes common MathML defects found in MathType-to-MathML conversion output.
  • Converts MathML to OMML with Office's MML2OMML.XSL.
  • Replaces OLE formula objects in a copy of the original .docx.
  • Produces a LaTeX validation preview and risk classification output.

What It Does Not Promise

  • It does not guarantee pixel-identical layout after conversion.
  • It does not guarantee semantic equivalence for every possible MathType equation.
  • It does not include proprietary or third-party sample documents.
  • It does not vendor a JDK or large third-party runtime binaries.
  • It does not replace legal review for documents that you do not own or cannot redistribute.

Pipeline

DOCX
  -> word/embeddings/oleObject*.bin
  -> MathType / MTEF XML
  -> MathML
  -> normalized MathML
  -> OMML
  -> new DOCX with editable Word math

Requirements

  • Windows.
  • Python 3.11 or newer.
  • Full Java JDK 17 or newer with javac.exe and jdk.charsets. A JRE or stripped runtime image is not enough for first-run compilation or JRuby/Nokogiri extraction.
  • Microsoft Office with MML2OMML.XSL.
  • Optional: pandoc for LaTeX validation previews.
  • Optional: Microsoft Word desktop for PDF export validation.

Python packages:

python -m pip install -r requirements.txt

Optional visual PDF comparison packages:

python -m pip install -r requirements-visual.txt

Prepare third-party converter sources:

powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1

The bootstrap script clones:

  • transpect/mathtype-extension
  • jure/mathtype_to_mathml

It also applies the local quality patch in patches/mathtype_to_mathml-quality-fixes.patch. You must comply with the licenses of those projects and their dependencies.

For the full external-tool requirements, known Java charset failure mode, and troubleshooting guidance, see Dependencies.

Quick Start

python -m venv .venv
.\.venv\Scripts\python -m pip install -r requirements.txt
powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1

powershell -ExecutionPolicy Bypass -File .\run_docx_open_source_pipeline.ps1 `
  -InputDocx .\input.docx `
  -OutputDir .\out `
  -MathtypeExtensionDir .\third_party\mathtype-extension `
  -MathTypeToMathMlDir .\third_party\mathtype_to_mathml `
  -Mml2OmmlXsl "C:\Program Files\Microsoft Office\root\Office16\MML2OMML.XSL"

If you do not have pandoc installed and only need the converted .docx, add -SkipLatexPreview to the pipeline command.

Main output:

  • out\<input-name>.omml.docx: converted Word document.
  • out\pipeline_summary.txt: conversion counts.
  • out\converted\summary.csv: per-equation conversion summary.
  • out\<input-name>.omml.validation.tex: LaTeX validation preview, unless -SkipLatexPreview is used.
  • out\<input-name>.omml.ole_map.json: mapping between formulas and document context.

Detector-First MVP Core

The repository now also contains an experimental shared core package for source detection and manifest generation:

  • src/document_equation_migration/source_taxonomy.py
  • src/document_equation_migration/manifest.py
  • src/document_equation_migration/container_scan.py
  • src/document_equation_migration/detectors/base.py
  • src/document_equation_migration/detectors/registry.py
  • src/document_equation_migration/cli.py

This shared core does not replace the existing MathType conversion scripts. It establishes a detector-first entry point that inventories formula sources before routing them to source-specific conversion paths.

Install the package locally for development:

python -m pip install -e ".[test]"

Scan a document and write a manifest, routing report, execution plan, plus a human-readable summary:

dem scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.json --summary .\out\summary.txt

Equivalent module invocation:

python -m document_equation_migration.cli scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.json

Supported detector-first source families currently include:

  • mathtype-ole
  • omml-native
  • equation-editor-3-ole
  • axmath-ole
  • odf-native
  • libreoffice-transformed

The detector-first CLI identifies formula sources and writes a manifest; it does not yet perform full MathML / OMML / LaTeX conversion for every source family.

routing.json is a document-level route decision artifact. It includes:

  • recommended_sequence: source families ordered by route priority
  • route_plan: next action per source family
  • manual_review_required and manual_review_reasons

execution-plan.json is a converter-oriented plan generated from routing.json. It includes:

  • steps: source-family execution steps with provider name and ordered actions
  • manual_review_required: aggregated gate for downstream validation

Preview the current execution plan without executing any converter commands:

dem run-plan .\out\execution-plan.json --dry-run --output .\out\execution-report.json

execution-report.json is a dry-run executor report. In the current milestone:

  • mathtype and omml expose concrete dry-run command bindings
  • other providers remain explicit manual gates until their executor bindings are added

Run the currently supported execution bindings:

dem run-plan .\out\execution-plan.json --execute --output-dir .\out\execution --output .\out\execution-report.json

In the current milestone:

  • omml can execute a native-preserving execution slice that extracts OMML XML fragments, writes a manifest, performs a deterministic packaging pass, and records execution metadata
  • mathtype is wired to the existing PowerShell/Python document pipeline, but external tools are blocked unless you explicitly pass --allow-external-tools; Word validation remains a separate gate
  • equation3 provides an Equation Editor 3.0 evidence/probe skeleton only; conversion and Word roundtrip stay manual/review gated until fixture coverage is stronger
  • axmath is export-assisted and stays behind external export / validation gates; the project does not claim a native static AxMath parser
  • odf-native can execute a native MathML extraction slice from ODF/FODT content, while libreoffice-transformed remains a bridge provenance review gate
  • render parity, Word opening, and PDF export are still validation gates; an execution report alone is not proof of deliverable Word output

Execute-mode provider outputs are evidence-oriented:

  • each provider output root should contain either validation-evidence.json or blocker-record.json
  • validation-plan.json can exist as a supporting artifact, but it does not replace the evidence/blocker contract on its own
  • validation-gated and review-gated statuses mean the slice produced traceable evidence or a review gate, not that deliverable conversion is complete

Only allow external MathType tools after the dry-run report has been inspected and Java / Office XSL / local script dependencies are ready:

dem run-plan .\out\execution-plan.json --execute --allow-external-tools --output-dir .\out\execution --output .\out\execution-report.json

For MathType live conversion, verify that JAVA_EXE / JAVAC_EXE point to a full JDK and that MML2OMML_XSL points to an Office-provided MML2OMML.XSL. A runtime missing jdk.charsets can fail during extraction with UnsupportedCharsetException: ISO-2022-JP.

Validate a target DOCX and write a reusable validation report artifact:

dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native

If an execute output already wrote execution metadata or validation evidence with a packaged validation target, resolve the DOCX directly from that JSON instead of reconstructing the path manually:

dem validate-docx --target-from-metadata .\out\execution\omml-native\package\execution-metadata.json --output-dir .\out\validation --provider omml --source-family omml-native

For deliverable-oriented Word validation, allow Word PDF export:

dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export

If you also have a reference PDF and the optional visual dependencies installed, run visual comparison:

dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compare

You can tighten or relax the shared visual gate explicitly:

dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compare --visual-max-changed-ratio-per-page 0.02 --visual-max-unmatched-pages 0

validation-report.json distinguishes:

  • deliverable-ready: target DOCX exists and Word PDF export passed
  • review-gated: Word PDF export passed and visual compare ran, but the current visual gate threshold was exceeded
  • research-only: structural evidence exists, but Word deliverability was not yet validated
  • blocked: target file is missing, Word export failed, or requested visual comparison failed

Important: visual_compare = passed now means both "the compare pipeline ran" and "the current visual gate thresholds were met". If the compare pipeline runs but page-count mismatch or changed-ratio exceeds threshold, the visual check becomes review-gated instead of passed.

First-Release Review Gate

For the research-preview release, MathType conversion results should be interpreted conservatively:

  • deliverable-ready is an automated candidate only when Word export passes, conversion/replacement counts are complete, and the configured visual gate passes.
  • review-gated can be a manual-review candidate when Word export passes, conversion/replacement counts are complete, source and converted page counts match, unmatched pages are zero, and the visual drift is documented for human review.
  • blocked means the output should not be presented as a usable converted document until the failed or missing gate is resolved.

Current real MathType evidence supports the guarded layout-preservation path as a manual-review candidate, not as a pixel-identical or lossless converter. The guarded layout option remains opt-in because its current factor is sample-derived and requires broader validation.

For a structured statement of the current claim boundary, evidence classes, and manual-review gate, see MathType evidence pack.

Before using a review-gated output in production, review the generated PDF, inspect changed pages, spot-check high-risk formulas, and keep the source document available for comparison.

Run the current test gate:

python -m pytest tests -q

Risk Analysis

After generating summary.csv and an OLE map, classify equations with:

python .\analyze_formula_risks.py `
  .\out\converted\summary.csv `
  .\out\input.omml.ole_map.json `
  .\out\risk_analysis.json `
  .\out\risk_analysis.txt

The categories are:

  • auto_replace: simple formulas that did not trigger known risk rules.
  • spot_check: complex formulas that deserve sampling.
  • manual_review: formulas that match patterns associated with likely conversion defects.

Risk analysis is most useful when LaTeX previews are available, so QA runs should keep LaTeX previews enabled when possible.

Validation

Optional PDF validation requires Microsoft Word desktop:

powershell -ExecutionPolicy Bypass -File .\export_word_pdf.ps1 `
  -InputDocx .\out\input.omml.docx `
  -OutputPdf .\out\converted.pdf

Visual PDF comparison uses PyMuPDF and Pillow:

python -m pip install -r requirements-visual.txt
python .\compare_pdf_visual.py .\original.pdf .\converted.pdf .\out\visual_compare

Documentation

License

This repository's original code is licensed under the MIT License. Third-party tools referenced by this project keep their own licenses.

About

Research-preview toolkit for detecting and migrating document equation sources.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors