Research-grade tools for converting MathType OLE-embedded equations in Word .docx files into MathML, OMML, and editable Word equations.
This project is Windows-first because the OMML conversion path uses Microsoft Office's MML2OMML.XSL, and the optional PDF validation path uses Word COM automation.
This repository is a research preview, not a guaranteed lossless converter.
The current pipeline is designed for documents whose formulas are stored as MathType OLE objects such as word/embeddings/oleObject1.bin. It can batch-convert many OLE-embedded formulas, but complex formulas should still be reviewed before production use.
- Extracts
oleObject*.binfiles from a Word.docx. - Converts MathType OLE / MTEF content to intermediate XML.
- Converts the intermediate XML to MathML.
- Normalizes common MathML defects found in MathType-to-MathML conversion output.
- Converts MathML to OMML with Office's
MML2OMML.XSL. - Replaces OLE formula objects in a copy of the original
.docx. - Produces a LaTeX validation preview and risk classification output.
- It does not guarantee pixel-identical layout after conversion.
- It does not guarantee semantic equivalence for every possible MathType equation.
- It does not include proprietary or third-party sample documents.
- It does not vendor a JDK or large third-party runtime binaries.
- It does not replace legal review for documents that you do not own or cannot redistribute.
DOCX
-> word/embeddings/oleObject*.bin
-> MathType / MTEF XML
-> MathML
-> normalized MathML
-> OMML
-> new DOCX with editable Word math
- Windows.
- Python 3.11 or newer.
- Full Java JDK 17 or newer with
javac.exeandjdk.charsets. A JRE or stripped runtime image is not enough for first-run compilation or JRuby/Nokogiri extraction. - Microsoft Office with
MML2OMML.XSL. - Optional:
pandocfor LaTeX validation previews. - Optional: Microsoft Word desktop for PDF export validation.
Python packages:
python -m pip install -r requirements.txtOptional visual PDF comparison packages:
python -m pip install -r requirements-visual.txtPrepare third-party converter sources:
powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1The bootstrap script clones:
transpect/mathtype-extensionjure/mathtype_to_mathml
It also applies the local quality patch in patches/mathtype_to_mathml-quality-fixes.patch. You must comply with the licenses of those projects and their dependencies.
For the full external-tool requirements, known Java charset failure mode, and troubleshooting guidance, see Dependencies.
python -m venv .venv
.\.venv\Scripts\python -m pip install -r requirements.txt
powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1
powershell -ExecutionPolicy Bypass -File .\run_docx_open_source_pipeline.ps1 `
-InputDocx .\input.docx `
-OutputDir .\out `
-MathtypeExtensionDir .\third_party\mathtype-extension `
-MathTypeToMathMlDir .\third_party\mathtype_to_mathml `
-Mml2OmmlXsl "C:\Program Files\Microsoft Office\root\Office16\MML2OMML.XSL"If you do not have pandoc installed and only need the converted .docx, add -SkipLatexPreview to the pipeline command.
Main output:
out\<input-name>.omml.docx: converted Word document.out\pipeline_summary.txt: conversion counts.out\converted\summary.csv: per-equation conversion summary.out\<input-name>.omml.validation.tex: LaTeX validation preview, unless-SkipLatexPreviewis used.out\<input-name>.omml.ole_map.json: mapping between formulas and document context.
The repository now also contains an experimental shared core package for source detection and manifest generation:
src/document_equation_migration/source_taxonomy.pysrc/document_equation_migration/manifest.pysrc/document_equation_migration/container_scan.pysrc/document_equation_migration/detectors/base.pysrc/document_equation_migration/detectors/registry.pysrc/document_equation_migration/cli.py
This shared core does not replace the existing MathType conversion scripts. It establishes a detector-first entry point that inventories formula sources before routing them to source-specific conversion paths.
Install the package locally for development:
python -m pip install -e ".[test]"Scan a document and write a manifest, routing report, execution plan, plus a human-readable summary:
dem scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.json --summary .\out\summary.txtEquivalent module invocation:
python -m document_equation_migration.cli scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.jsonSupported detector-first source families currently include:
mathtype-oleomml-nativeequation-editor-3-oleaxmath-oleodf-nativelibreoffice-transformed
The detector-first CLI identifies formula sources and writes a manifest; it does not yet perform full MathML / OMML / LaTeX conversion for every source family.
routing.json is a document-level route decision artifact. It includes:
recommended_sequence: source families ordered by route priorityroute_plan: next action per source familymanual_review_requiredandmanual_review_reasons
execution-plan.json is a converter-oriented plan generated from routing.json. It includes:
steps: source-family execution steps with provider name and ordered actionsmanual_review_required: aggregated gate for downstream validation
Preview the current execution plan without executing any converter commands:
dem run-plan .\out\execution-plan.json --dry-run --output .\out\execution-report.jsonexecution-report.json is a dry-run executor report. In the current milestone:
mathtypeandommlexpose concrete dry-run command bindings- other providers remain explicit manual gates until their executor bindings are added
Run the currently supported execution bindings:
dem run-plan .\out\execution-plan.json --execute --output-dir .\out\execution --output .\out\execution-report.jsonIn the current milestone:
ommlcan execute a native-preserving execution slice that extracts OMML XML fragments, writes a manifest, performs a deterministic packaging pass, and records execution metadatamathtypeis wired to the existing PowerShell/Python document pipeline, but external tools are blocked unless you explicitly pass--allow-external-tools; Word validation remains a separate gateequation3provides an Equation Editor 3.0 evidence/probe skeleton only; conversion and Word roundtrip stay manual/review gated until fixture coverage is strongeraxmathis export-assisted and stays behind external export / validation gates; the project does not claim a native static AxMath parserodf-nativecan execute a native MathML extraction slice from ODF/FODT content, whilelibreoffice-transformedremains a bridge provenance review gate- render parity, Word opening, and PDF export are still validation gates; an execution report alone is not proof of deliverable Word output
Execute-mode provider outputs are evidence-oriented:
- each provider output root should contain either
validation-evidence.jsonorblocker-record.json validation-plan.jsoncan exist as a supporting artifact, but it does not replace the evidence/blocker contract on its ownvalidation-gatedandreview-gatedstatuses mean the slice produced traceable evidence or a review gate, not that deliverable conversion is complete
Only allow external MathType tools after the dry-run report has been inspected and Java / Office XSL / local script dependencies are ready:
dem run-plan .\out\execution-plan.json --execute --allow-external-tools --output-dir .\out\execution --output .\out\execution-report.jsonFor MathType live conversion, verify that JAVA_EXE / JAVAC_EXE point to a full JDK and that MML2OMML_XSL points to an Office-provided MML2OMML.XSL. A runtime missing jdk.charsets can fail during extraction with UnsupportedCharsetException: ISO-2022-JP.
Validate a target DOCX and write a reusable validation report artifact:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-nativeIf an execute output already wrote execution metadata or validation evidence with a packaged validation target, resolve the DOCX directly from that JSON instead of reconstructing the path manually:
dem validate-docx --target-from-metadata .\out\execution\omml-native\package\execution-metadata.json --output-dir .\out\validation --provider omml --source-family omml-nativeFor deliverable-oriented Word validation, allow Word PDF export:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-exportIf you also have a reference PDF and the optional visual dependencies installed, run visual comparison:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compareYou can tighten or relax the shared visual gate explicitly:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compare --visual-max-changed-ratio-per-page 0.02 --visual-max-unmatched-pages 0validation-report.json distinguishes:
deliverable-ready: target DOCX exists and Word PDF export passedreview-gated: Word PDF export passed and visual compare ran, but the current visual gate threshold was exceededresearch-only: structural evidence exists, but Word deliverability was not yet validatedblocked: target file is missing, Word export failed, or requested visual comparison failed
Important: visual_compare = passed now means both "the compare pipeline ran" and "the current visual gate thresholds were met". If the compare pipeline runs but page-count mismatch or changed-ratio exceeds threshold, the visual check becomes review-gated instead of passed.
For the research-preview release, MathType conversion results should be interpreted conservatively:
deliverable-readyis an automated candidate only when Word export passes, conversion/replacement counts are complete, and the configured visual gate passes.review-gatedcan be a manual-review candidate when Word export passes, conversion/replacement counts are complete, source and converted page counts match, unmatched pages are zero, and the visual drift is documented for human review.blockedmeans the output should not be presented as a usable converted document until the failed or missing gate is resolved.
Current real MathType evidence supports the guarded layout-preservation path as a manual-review candidate, not as a pixel-identical or lossless converter. The guarded layout option remains opt-in because its current factor is sample-derived and requires broader validation.
For a structured statement of the current claim boundary, evidence classes, and manual-review gate, see MathType evidence pack.
Before using a review-gated output in production, review the generated PDF, inspect changed pages, spot-check high-risk formulas, and keep the source document available for comparison.
Run the current test gate:
python -m pytest tests -qAfter generating summary.csv and an OLE map, classify equations with:
python .\analyze_formula_risks.py `
.\out\converted\summary.csv `
.\out\input.omml.ole_map.json `
.\out\risk_analysis.json `
.\out\risk_analysis.txtThe categories are:
auto_replace: simple formulas that did not trigger known risk rules.spot_check: complex formulas that deserve sampling.manual_review: formulas that match patterns associated with likely conversion defects.
Risk analysis is most useful when LaTeX previews are available, so QA runs should keep LaTeX previews enabled when possible.
Optional PDF validation requires Microsoft Word desktop:
powershell -ExecutionPolicy Bypass -File .\export_word_pdf.ps1 `
-InputDocx .\out\input.omml.docx `
-OutputPdf .\out\converted.pdfVisual PDF comparison uses PyMuPDF and Pillow:
python -m pip install -r requirements-visual.txt
python .\compare_pdf_visual.py .\original.pdf .\converted.pdf .\out\visual_compareThis repository's original code is licensed under the MIT License. Third-party tools referenced by this project keep their own licenses.