Skip to content

HTML mapper enforcement feature and USFM footnote update#785

Open
Fikitti wants to merge 9 commits intodevfrom
746-we-need-to-optionally-alert-users-that-the-structure-of-the-source-is-not-matched
Open

HTML mapper enforcement feature and USFM footnote update#785
Fikitti wants to merge 9 commits intodevfrom
746-we-need-to-optionally-alert-users-that-the-structure-of-the-source-is-not-matched

Conversation

@Fikitti
Copy link
Copy Markdown
Contributor

@Fikitti Fikitti commented Mar 23, 2026

Add option to enforce html structure mapper for certain (round-trip) files.

  • Introduced a new utility for comparing HTML structures, including functions to extract HTML skeletons and identify mismatches.
  • Implemented checks for HTML structure mismatches in the export process, providing warnings for discrepancies.
  • Added a user interface component to enable/disable HTML structure enforcement during round-trip exports.
  • Enhanced the CodexCellEditor to display HTML structure errors and allow users to resolve them interactively.
  • Created comprehensive tests for the new HTML structure utilities to ensure reliability and correctness.

Also redefined the way we handle footnotes with USFM file importing and exporting and also the HTML preservation of the actual footnote inside the cell.

Should resolve issue #751 with ETEN bible translations. For the future imports and translations, but I think we would have to update/migrate their stuff that's supposedly broken.


Note

Medium Risk
Adds new structure validation and LLM-assisted auto-fix flows that can affect round-trip export correctness and editor saves, plus substantial changes to USFM inline footnote parsing/export.

Overview
Adds an optional enforceHtmlStructure flag for round-trip content that compares translated cell HTML tag structure against the source and surfaces mismatches.

When enabled, the editor highlights mismatched cells and offers a Resolve action that calls an LLM to insert missing structural elements; project export also runs a preflight check for rebuild-export and prompts before continuing, with a UI popup summary.

Updates USFM inline footnote handling to use a bracket-based representation and improves export stripping/decoding of bracketed markers (including legacy/unescaped cases), with new tests; also promotes the DOCX round-trip importer/exporter paths from experiment/ to the main docx/ implementation and updates integration tests accordingly.

Written by Cursor Bugbot for commit aa97e7c. This will update automatically on new commits. Configure here.

Fikitti and others added 7 commits March 14, 2026 18:10
Refactor mergeOriginalFilesHashes function and add comprehensive unit tests

- Changed the mergeOriginalFilesHashes function to be exported for external use.
- Introduced a new test suite for mergeOriginalFilesHashes, covering various scenarios including handling undefined inputs, merging entries with different hashes, and deduplicating referencedBy and originalNames.
- Ensured that version numbers are correctly managed and that fileNameToHash is built accurately from merged entries.
…-mergeoriginalfileshashes

Added a resolver test for merging Original file hashes
Add option to enforce html structure mapper for certain (round-trip) files.
- Introduced a new utility for comparing HTML structures, including functions to extract HTML skeletons and identify mismatches.
- Implemented checks for HTML structure mismatches in the export process, providing warnings for discrepancies.
- Added a user interface component to enable/disable HTML structure enforcement during round-trip exports.
- Enhanced the CodexCellEditor to display HTML structure errors and allow users to resolve them interactively.
- Created comprehensive tests for the new HTML structure utilities to ensure reliability and correctness.

Also redefined the way we handle footnotes with USFM file importing and exporting and also the HTML preservation of the actual footnote inside the cell.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Circular import between index.tsx and index.ts in same directory
    • I removed the unused self-referential import and made the re-export path explicit to ./index.ts so resolution cannot loop back to index.tsx.
  • ✅ Fixed: Bracket-stripping regex may corrupt non-footnote USFM content
    • I replaced the broad bracket-stripping regex with a guarded helper that only unwraps bracketed segments matching expected USFM marker patterns.

Create PR

Or push these changes by commenting:

@cursor push accb09f3e8
Preview (accb09f3e8)
diff --git a/src/exportHandler/usfmExporter.ts b/src/exportHandler/usfmExporter.ts
--- a/src/exportHandler/usfmExporter.ts
+++ b/src/exportHandler/usfmExporter.ts
@@ -109,6 +109,27 @@
     return bookCodeToName[upperCode] || bookCode;
 }
 
+const BRACKETED_USFM_MARKER_PATTERN = /\\(?:\+?[a-z0-9]+)\*?/i;
+const BRACKETED_USFM_START_PATTERN = /^\s*\\/;
+const BRACKETED_USFM_END_PATTERN =
+    /^[\d:;,\-.\s]+\s*\\(?:\+?[a-z0-9]+)\*?\s*$/i;
+
+function stripBracketedUsfmMarkers(content: string): string {
+    return content.replace(/<([^>]+)>/g, (match, innerContent: string) => {
+        const normalizedContent = innerContent.trim();
+        if (!BRACKETED_USFM_MARKER_PATTERN.test(normalizedContent)) {
+            return match;
+        }
+        if (
+            BRACKETED_USFM_START_PATTERN.test(normalizedContent) ||
+            BRACKETED_USFM_END_PATTERN.test(normalizedContent)
+        ) {
+            return innerContent;
+        }
+        return match;
+    });
+}
+
 function convertHtmlToUsfm(html: string): string {
     if (!html) return "";
 
@@ -145,7 +166,7 @@
     });
 
     // Strip bracket-format footnotes (literal angle brackets) before HTML tag cleanup
-    content = content.replace(/<([^>]*\\[^>]*)>/g, "$1");
+    content = stripBracketedUsfmMarkers(content);
 
     content = content.replace(/<[^>]*>/g, "");
     content = content.replace(/&nbsp;/g, " ");
@@ -156,7 +177,7 @@
     content = content.replace(/&apos;/g, "'");
 
     // Strip entity-encoded bracket-format footnotes too
-    content = content.replace(/<([^>]*\\[^>]*)>/g, "$1");
+    content = stripBracketedUsfmMarkers(content);
 
     return content;
 }

diff --git a/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx b/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
--- a/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
+++ b/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
@@ -1,10 +1,9 @@
 import { ImporterPlugin } from "../../types/plugin";
 import { FileText } from "lucide-react";
 import { DocxImporterForm } from "./DocxImporterForm";
-import { validateFile, parseFile } from "./index";
 
 // Re-export for convenience
-export { validateFile, parseFile, docxImporter } from "./index";
+export { validateFile, parseFile, docxImporter } from "./index.ts";
 
 export const docxRoundtripImporterPlugin: ImporterPlugin = {
     id: "docx",

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

// Re-export the parsing functions from the existing index.ts
export { validateFile, parseFile } from "./index";
// Re-export for convenience
export { validateFile, parseFile, docxImporter } from "./index";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular import between index.tsx and index.ts in same directory

Medium Severity

index.tsx imports from "./index" on line 4 and re-exports from "./index" on line 7. Since both index.ts and index.tsx coexist in the same directory, module resolution depends on bundler configuration. The webpack config for the extension bundle resolves [".ts", ".js", ".mjs"] (no .tsx), but the webview Vite/esbuild bundler may resolve ./index to index.tsx itself, creating a circular import. The line 4 import is also unused — only the re-export on line 7 matters.

Fix in Cursor Fix in Web

});

// Strip bracket-format footnotes (literal angle brackets) before HTML tag cleanup
content = content.replace(/<([^>]*\\[^>]*)>/g, "$1");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bracket-stripping regex may corrupt non-footnote USFM content

Medium Severity

The regex /<([^>]*\\[^>]*)>/g at line 148 strips angle brackets from any tag-like content containing a backslash. This runs on all USFM exports, not just those using the new bracket-format footnotes. If any content happens to contain a backslash inside angle brackets for other reasons (e.g., user-entered content, or edge cases from other importers), this regex will silently corrupt it by removing the surrounding brackets.

Fix in Cursor Fix in Web

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 23, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@Fikitti Fikitti changed the base branch from main to dev March 24, 2026 13:52
…into 746-we-need-to-optionally-alert-users-that-the-structure-of-the-source-is-not-matched

# Conflicts:
#	src/activationHelpers/contextAware/contentIndexes/indexes/sqliteIndex.ts
@BenjaminScholtens BenjaminScholtens changed the base branch from dev-old to dev March 27, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

We need to optionally alert users that the structure of the source is not matched

3 participants