Skip to content

Group orphan raw-HTML opener/closer blocks around their markdown content#16

Merged
srid merged 8 commits intomasterfrom
group-orphan-rawhtml-blocks
Apr 29, 2026
Merged

Group orphan raw-HTML opener/closer blocks around their markdown content#16
srid merged 8 commits intomasterfrom
group-orphan-rawhtml-blocks

Conversation

@srid
Copy link
Copy Markdown
Owner

@srid srid commented Apr 29, 2026

Pandoc splits CommonMark "type 6" raw-HTML blocks at the next blank line, so markdown like

<details>

**bold** content

</details>

reaches the renderer as three separate blocks: an opener RawBlock carrying "<details>\n", a Para, and a closer RawBlock carrying "</details>\n". Today's rawNode wraps each raw blob in its own <rawhtml> element to keep xmlhtml's parser from mangling the bytes; the side effect is that the <details> open and close tags get trapped inside their wrappers, so the markdown paragraph ends up a sibling of the (now empty) details element rather than its child. See srid/emanote#433.

This PR adds a new Heist.Extra.Splices.Pandoc.RawHtmlGroup module exporting one function:

groupRawHtmlBlocks :: [B.Block] -> [B.Block]

It walks a block list and, when it sees an unbalanced opening tag followed downstream by a matching closing tag (depth counted only against opens of the same tag name), replaces that span with a B.Div carrying the tag in a "tag" directive attribute. The renderer's Div arm already turns Div with a "tag" attr into the named element via divTag, and now also calls stripTagDirective so the directive doesn't survive into serialised HTML as a literal tag="…". renderPandocWith applies the pass via Text.Pandoc.Walk.walk, so nested block lists (BlockQuote, list items, etc.) are covered too.

How the AST changes

Before: [Para "aaaa", RawBlock "<details>\n", Para "bbb",
         RawBlock "</details>\n", Para "eee"]

After:  [Para "aaaa",
         Div ("",[],[("tag","details")]) [Para "bbb"],
         Para "eee"]

Decisions worth surfacing

Question Decision
Tags without a matching closer Left as raw blocks — emitting a synthetic close would change the input's semantics
Tag names Compared case-insensitively to follow HTML's own rules
Opener attributes (e.g. <details open>) Accepted by the parser, dropped on the produced Divdeliberate scope until a real case demands otherwise
Self-closing forms (<br />) Rejected by the opener parser so void elements never start a group
Single-block balanced fragments (<span>x</span>) Left alone — the renderer already handles those correctly

Where the volatility lives

RawHtmlGroup owns the directive scheme end-to-end: tagDirectiveKey, divTag, and stripTagDirective all live there, and Render imports them as a consumer. Future evolution (alternative wrapping strategies, attribute preservation, void-element awareness, smarter nesting heuristics) starts in RawHtmlGroup so the renderer's interface stays stable.

Test coverage

13 unit tests pin the AST transformation: the issue's exact example, empty-group case, orphan opener/closer, consecutive pairs, same-tag nesting via depth counting, self-closing rejection, balanced-in-one-block rejection, case-insensitive matching, attribute tolerance, hyphenated custom-element names, mismatched-tag orphan, and the malformed-closer (missing >) regression. One end-to-end integration test through renderPandocWith asserts the markdown paragraph lands inside <details> with no <rawhtml> wrapper. Plus a regression test pinning that the tag directive doesn't leak as a literal HTML attribute.

Closes srid/emanote#433. Emanote PR bumping the pin: srid/emanote#686.

Generated by /do on Claude Code (model claude-opus-4-7).

srid added 8 commits April 28, 2026 21:09
CommonMark "type 6" HTML blocks end at the next blank line, so markdown
like

    <details>

    **bold** content

    </details>

reaches the renderer as three blocks: an opener `RawBlock` carrying
`"<details>\n"`, a `Para`, and a closer `RawBlock` with `"</details>\n"`.
The current rawNode wraps each raw blob in its own `<rawhtml>` element
to keep xmlhtml from mangling the bytes; the side effect is that the
`<details>` open and close tags get trapped inside their wrappers and
the markdown paragraph ends up a sibling of the (empty) details element
rather than its child.

New `Heist.Extra.Splices.Pandoc.RawHtmlGroup.groupRawHtmlBlocks` pass
walks the AST (via `Text.Pandoc.Walk`, so nested block lists are covered
too) and rewrites those orphan triplets into a `B.Div` carrying the tag
in its `"tag"` attribute. The `Div` arm of `rpBlock'` already turns that
into the named element. It now also strips the `"tag"` directive before
serialising attributes so the override doesn't leak as a literal
`tag="…"` attribute on the rendered element.

Tests: 12 unit tests pin the parser and grouping behaviour (issue
example, empty group, orphan open/close, consecutive pairs, same-tag
nesting via depth counting, self-closing rejection, balanced-in-one-block
rejection, case-insensitive matching, attribute tolerance, hyphenated
custom-element names, mismatched-tag orphan), plus an end-to-end
integration test through `renderPandocWith` that asserts the markdown
paragraph lands inside `<details>` with no `<rawhtml>` wrapper.

Closes srid/emanote#433.
The "tag" attribute on a B.Div is a directive used by groupRawHtmlBlocks
(in RawHtmlGroup) to override the rendered element name. The directive's
key was scattered as a string literal across three sites — the producer
in RawHtmlGroup and two consumers (getTag, dropTagAttr) buried in a
where-clause inside rpBlock'.

Promote the protocol to first-class names in Render.Internal:

- tagDirectiveKey :: Text — the single source of truth for the key.
- divTag :: Text -> B.Attr -> Text — formerly the local getTag.
- stripTagDirective :: B.Attr -> B.Attr — formerly the local dropTagAttr.

RawHtmlGroup now imports tagDirectiveKey when constructing the Div, and
the Div arm of rpBlock' calls divTag/stripTagDirective directly. The
where-clause in rpBlock' loses two helpers and Map import; everything
else is unchanged.

Addresses Hickey #2 (named protocol over implicit convention) and
Lowy #2/#4 (extract tagDirectiveKey + named helper for dropTagAttr).
openerTag and closerTag both stripped a prefix, parsed a tag-name span,
and verified that nothing but whitespace followed the closing '>'. The
opener has one extra check (reject self-closing); other than that the
two parsers were the same shape. Extract the shared work into one
parseTagAfterPrefix and let openerTag layer the void-element rejection
on top.
…unused defaultTag

The tag-directive scheme (key + resolver + stripper) was sitting in
Render.Internal — a module whose docstring scopes it to "pure helpers
extracted from Render.hs", i.e. table rendering. The producer of the
directive is RawHtmlGroup, and the volatility lives there: any future
change to the wire format starts at the module that decides what shape
to emit. Move the three helpers to RawHtmlGroup (the producer) and have
Render import from there. Render.Internal is back to its original
table-helpers scope.

While moving, also drop the unused defaultTag parameter on divTag —
every call site passes "div"; bake it in. Switch divTag from
Map.fromList+Map.lookup to plain Data.List.lookup since attr lists are
flat assoc lists with no duplicates in practice (no behaviour change for
real input). Save a Map allocation per Div on the rendering hot path.
- RenderSpec: comment said 'getTag' but the helper was renamed to divTag.
- RawHtmlGroup module docstring: trim the Public-surface paragraph from
  three sentences to one — the rest narrated cabal config that is one
  grep away.
- RawHtmlGroupSpec: drop the 'Block helpers' header comment that
  narrated what the next three lines obviously are.
@srid
Copy link
Copy Markdown
Owner Author

srid commented Apr 29, 2026

Hickey/Lowy Analysis

# Lens Finding Disposition
1 Hickey closerTag silently matched when > is absent Fixed in this PR
2 Hickey getTag/dropTagAttr buried in a where clause as ad-hoc protocol Fixed in this PR
3 Hickey Opener/closer parsing asymmetry (inline vs where-bound) No-op
4 Hickey reverse acc accumulator pattern in splitAtMatchingCloser No-op
5 Hickey B.Figure arm doesn't strip the tag directive No-op
6 Lowy Module header should state its volatility axis Fixed in this PR
7 Lowy tag attribute protocol scattered as string literals across three sites Fixed in this PR
8 Lowy RawHtmlGroup exposed without an explicit reuse contract Fixed in this PR
9 Lowy dropTagAttr strip was anonymous; no named helper Fixed in this PR
10 Lowy Module name is functional, not volatility-named No-op

Hickey rationale

Five findings on the diff. The one real correctness item — closerTag accepting a malformed </details (no >) because T.drop 1 (T.dropWhile (/= '>') after) would silently advance past a missing terminator — got fixed by mirroring the opener's T.stripPrefix ">" guard. Two non-issues sat on the structural side: the parseOpener/closerTag asymmetry is justified (opener is tag-agnostic, closer is tag-parameterised) and the reverse acc accumulator is the standard tail-recursive idiom. The getTag/dropTagAttr "buried in a where clause" item became Lowy's #7+#9 and got addressed there.

Lowy rationale

Five findings on the boundary. The boundary itself passes the "almost expendable" test — if Pandoc stops splitting orphan-HTML blocks, the entire RawHtmlGroup module is trivial to remove. What needed cleanup was the directive scheme leaking through three sites (RawHtmlGroup produces, Render consumes via two helpers buried in a where clause, and the test code hardcodes the key). Resolution: the "tag" key + resolver + stripper now all live in RawHtmlGroup (the producer that owns the wire format's volatility), and Render imports divTag/stripTagDirective as a consumer. The module header gained a Volatility & boundary section that names the axis of change so future maintainers know the boundary is load-bearing, plus a Public surface paragraph that mirrors the Render.Internal arrangement.

The one "No-op" item — RawHtmlGroup is a functional name, not a volatility-named one (OrphanHtmlRebalancer would have been more Lowy-aligned) — is noise: the boundary is correct, the doc comment explains the context, and renaming a public module name carries deprecation cost that doesn't earn its keep.

@srid srid marked this pull request as ready for review April 29, 2026 18:52
@srid srid merged commit f496d0c into master Apr 29, 2026
@srid srid deleted the group-orphan-rawhtml-blocks branch April 29, 2026 18:52
srid added a commit to srid/emanote that referenced this pull request Apr 29, 2026
…686)

**Markdown content between two blank-line-separated raw HTML tags now
nests inside the surrounding element instead of escaping to be its
sibling.** Pandoc emits CommonMark "type 6" HTML blocks like `<details>`
… markdown … `</details>` as three separate AST blocks (opener
`RawBlock`, `Para`, closer `RawBlock`); without grouping, each raw blob
ends up wrapped in its own element and the markdown paragraph drifts
outside.

The actual fix is upstream in
[srid/heist-extra#16](srid/heist-extra#16),
which adds a `groupRawHtmlBlocks` AST-preprocess pass that rewrites
those orphan triplets into a `B.Div` carrying the tag in a directive
attribute — the existing `Div` renderer turns it into the named element
with the markdown content as a real DOM child.

This PR pins emanote to that branch and notes the fix under the `Bug
fixes` section of the unreleased changelog.

> Closes #433.

### Try it locally

```sh
nix run github:srid/emanote/empty-group
```

_Generated by [`/do`](https://github.com/srid/agency) on Claude Code
(model `claude-opus-4-7`)._
srid added a commit that referenced this pull request Apr 29, 2026
Brings in #16 (group orphan raw-HTML opener/closer blocks) so this PR
sits on top of latest master.

# Conflicts:
#	CHANGELOG.md
#	heist-extra.cabal
#	test/Spec.hs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong rendering of HTML blocks

1 participant