Skip to content

Short-circuit inline parsing plain text#149

Merged
dillonkearns merged 4 commits intomasterfrom
shortcircuit-inline-parsing
Feb 6, 2026
Merged

Short-circuit inline parsing plain text#149
dillonkearns merged 4 commits intomasterfrom
shortcircuit-inline-parsing

Conversation

@dillonkearns
Copy link
Copy Markdown
Owner

@dillonkearns dillonkearns commented Feb 5, 2026

Summary

This PR adds an early-exit optimization to the inline parser that skips expensive tokenization when text contains no special markdown characters. This speeds up parsing for plain text content.

Changes

  • hasAnyTokenChar uses String.any to efficiently check if text contains any markdown-relevant characters (`, *, _, ~, [, ], <, >, \n)
  • When no special characters are present, the tokenizer returns [] immediately, avoiding 8+ separate regex/pattern scans
  • For text with special characters, individual String.contains checks gate each tokenizer

Why multiple String.contains instead of single-pass?

I explored two alternative "cleaner" approaches and benchmarked them:

Alternative 1: Single-pass String.foldl

Build all character flags in one pass instead of multiple String.contains calls:

detectTokenChars : String -> TokenFlags
detectTokenChars str =
    String.foldl (\c flags -> case c of ...) emptyFlags str

Result: Slightly slower (0.080ms vs 0.078ms for plain text)

Alternative 2: Single-pass recursive tokenizer

Replace all regex-based tokenizers with a single character-by-character scan:

tokenizeLoop : String -> Int -> TokenizeState -> TokenizeState
tokenizeLoop rawText index state = ...

Result: Significantly slower (0.258ms vs 0.078ms for plain text — 3.3x regression)

Why native operations win

Multiple String.contains calls are actually faster than manual single-pass approaches because:

  1. String.contains compiles to JavaScript's native indexOf, which is heavily optimized at the engine level
  2. Regex.find similarly uses the browser's native regex engine
  3. Manual character iteration in Elm incurs function call overhead for each character
  4. Native string operations can short-circuit early when a match is found

The "inelegant" multiple-pass approach leverages these native optimizations, making it faster than conceptually cleaner single-pass alternatives.

Performance Impact

Plain text (no formatting): ~1.5x faster
Long unformatted lines: ~2x faster
Formatted content: No regression

Benchmarking

You can verify the results by running the benchmark script:

cd spec-tests
npx elm make OutputMarkdownHtml.elm --optimize --output elm.js
node benchmark.js

Sample results (on my machine):

Test Case Before After Speedup
Plain text, no formatting (1400 chars) 0.114ms 0.078ms 1.5x
Plain text with newlines (1410 chars) 0.132ms 0.100ms 1.3x
Long line, no formatting (10k chars) 0.502ms 0.244ms 2.1x
Typical README (595 chars) 0.278ms 0.274ms ~same
README x10 (5950 chars) 2.376ms 2.290ms ~same
Large table (250 cells) 1.708ms 1.352ms ~same

dillonkearns and others added 4 commits February 4, 2026 15:47
…d expensive parsing for stretches of plain text.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dillonkearns dillonkearns merged commit 6b8d7e5 into master Feb 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant