HTML mapper enforcement feature and USFM footnote update#785
HTML mapper enforcement feature and USFM footnote update#785
Conversation
Refactor mergeOriginalFilesHashes function and add comprehensive unit tests - Changed the mergeOriginalFilesHashes function to be exported for external use. - Introduced a new test suite for mergeOriginalFilesHashes, covering various scenarios including handling undefined inputs, merging entries with different hashes, and deduplicating referencedBy and originalNames. - Ensured that version numbers are correctly managed and that fileNameToHash is built accurately from merged entries.
…-mergeoriginalfileshashes Added a resolver test for merging Original file hashes
Add option to enforce html structure mapper for certain (round-trip) files. - Introduced a new utility for comparing HTML structures, including functions to extract HTML skeletons and identify mismatches. - Implemented checks for HTML structure mismatches in the export process, providing warnings for discrepancies. - Added a user interface component to enable/disable HTML structure enforcement during round-trip exports. - Enhanced the CodexCellEditor to display HTML structure errors and allow users to resolve them interactively. - Created comprehensive tests for the new HTML structure utilities to ensure reliability and correctness. Also redefined the way we handle footnotes with USFM file importing and exporting and also the HTML preservation of the actual footnote inside the cell.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Circular import between
index.tsxandindex.tsin same directory- I removed the unused self-referential import and made the re-export path explicit to
./index.tsso resolution cannot loop back toindex.tsx.
- I removed the unused self-referential import and made the re-export path explicit to
- ✅ Fixed: Bracket-stripping regex may corrupt non-footnote USFM content
- I replaced the broad bracket-stripping regex with a guarded helper that only unwraps bracketed segments matching expected USFM marker patterns.
Or push these changes by commenting:
@cursor push accb09f3e8
Preview (accb09f3e8)
diff --git a/src/exportHandler/usfmExporter.ts b/src/exportHandler/usfmExporter.ts
--- a/src/exportHandler/usfmExporter.ts
+++ b/src/exportHandler/usfmExporter.ts
@@ -109,6 +109,27 @@
return bookCodeToName[upperCode] || bookCode;
}
+const BRACKETED_USFM_MARKER_PATTERN = /\\(?:\+?[a-z0-9]+)\*?/i;
+const BRACKETED_USFM_START_PATTERN = /^\s*\\/;
+const BRACKETED_USFM_END_PATTERN =
+ /^[\d:;,\-.\s]+\s*\\(?:\+?[a-z0-9]+)\*?\s*$/i;
+
+function stripBracketedUsfmMarkers(content: string): string {
+ return content.replace(/<([^>]+)>/g, (match, innerContent: string) => {
+ const normalizedContent = innerContent.trim();
+ if (!BRACKETED_USFM_MARKER_PATTERN.test(normalizedContent)) {
+ return match;
+ }
+ if (
+ BRACKETED_USFM_START_PATTERN.test(normalizedContent) ||
+ BRACKETED_USFM_END_PATTERN.test(normalizedContent)
+ ) {
+ return innerContent;
+ }
+ return match;
+ });
+}
+
function convertHtmlToUsfm(html: string): string {
if (!html) return "";
@@ -145,7 +166,7 @@
});
// Strip bracket-format footnotes (literal angle brackets) before HTML tag cleanup
- content = content.replace(/<([^>]*\\[^>]*)>/g, "$1");
+ content = stripBracketedUsfmMarkers(content);
content = content.replace(/<[^>]*>/g, "");
content = content.replace(/ /g, " ");
@@ -156,7 +177,7 @@
content = content.replace(/'/g, "'");
// Strip entity-encoded bracket-format footnotes too
- content = content.replace(/<([^>]*\\[^>]*)>/g, "$1");
+ content = stripBracketedUsfmMarkers(content);
return content;
}
diff --git a/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx b/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
--- a/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
+++ b/webviews/codex-webviews/src/NewSourceUploader/importers/docx/index.tsx
@@ -1,10 +1,9 @@
import { ImporterPlugin } from "../../types/plugin";
import { FileText } from "lucide-react";
import { DocxImporterForm } from "./DocxImporterForm";
-import { validateFile, parseFile } from "./index";
// Re-export for convenience
-export { validateFile, parseFile, docxImporter } from "./index";
+export { validateFile, parseFile, docxImporter } from "./index.ts";
export const docxRoundtripImporterPlugin: ImporterPlugin = {
id: "docx",This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
| // Re-export the parsing functions from the existing index.ts | ||
| export { validateFile, parseFile } from "./index"; | ||
| // Re-export for convenience | ||
| export { validateFile, parseFile, docxImporter } from "./index"; |
There was a problem hiding this comment.
Circular import between index.tsx and index.ts in same directory
Medium Severity
index.tsx imports from "./index" on line 4 and re-exports from "./index" on line 7. Since both index.ts and index.tsx coexist in the same directory, module resolution depends on bundler configuration. The webpack config for the extension bundle resolves [".ts", ".js", ".mjs"] (no .tsx), but the webview Vite/esbuild bundler may resolve ./index to index.tsx itself, creating a circular import. The line 4 import is also unused — only the re-export on line 7 matters.
| }); | ||
|
|
||
| // Strip bracket-format footnotes (literal angle brackets) before HTML tag cleanup | ||
| content = content.replace(/<([^>]*\\[^>]*)>/g, "$1"); |
There was a problem hiding this comment.
Bracket-stripping regex may corrupt non-footnote USFM content
Medium Severity
The regex /<([^>]*\\[^>]*)>/g at line 148 strips angle brackets from any tag-like content containing a backslash. This runs on all USFM exports, not just those using the new bracket-format footnotes. If any content happens to contain a backslash inside angle brackets for other reasons (e.g., user-entered content, or edge cases from other importers), this regex will silently corrupt it by removing the surrounding brackets.
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
…into 746-we-need-to-optionally-alert-users-that-the-structure-of-the-source-is-not-matched # Conflicts: # src/activationHelpers/contextAware/contentIndexes/indexes/sqliteIndex.ts



Add option to enforce html structure mapper for certain (round-trip) files.
Also redefined the way we handle footnotes with USFM file importing and exporting and also the HTML preservation of the actual footnote inside the cell.
Should resolve issue #751 with ETEN bible translations. For the future imports and translations, but I think we would have to update/migrate their stuff that's supposedly broken.
Note
Medium Risk
Adds new structure validation and LLM-assisted auto-fix flows that can affect round-trip export correctness and editor saves, plus substantial changes to USFM inline footnote parsing/export.
Overview
Adds an optional
enforceHtmlStructureflag for round-trip content that compares translated cell HTML tag structure against the source and surfaces mismatches.When enabled, the editor highlights mismatched cells and offers a
Resolveaction that calls an LLM to insert missing structural elements; project export also runs a preflight check forrebuild-exportand prompts before continuing, with a UI popup summary.Updates USFM inline footnote handling to use a bracket-based representation and improves export stripping/decoding of bracketed markers (including legacy/unescaped cases), with new tests; also promotes the DOCX round-trip importer/exporter paths from
experiment/to the maindocx/implementation and updates integration tests accordingly.Written by Cursor Bugbot for commit aa97e7c. This will update automatically on new commits. Configure here.