An educational example demonstrating the architecture of txt2tex.
This minimal implementation converts Reverse Polish Notation (RPN) expressions to LaTeX math mode, showcasing the same pipeline architecture used in the full txt2tex system.
This project helps you understand the txt2tex codebase by providing:
- A simplified implementation of the same concepts
- A hands-on exercise to extend the functionality
- Clear documentation explaining each component
rpn2tex follows the same pipeline as txt2tex:
Input Text → Lexer → Tokens → Parser → AST → Generator → LaTeX
| rpn2tex File | txt2tex File | Purpose |
|---|---|---|
tokens.py |
tokens.py |
Token types and Token dataclass |
lexer.py |
lexer.py |
Tokenizes input text |
ast_nodes.py |
ast_nodes.py |
AST node definitions |
parser.py |
parser.py |
Builds AST from tokens |
latex_gen.py |
latex_gen.py |
Converts AST to LaTeX |
errors.py |
errors.py |
Error formatting |
cli.py |
cli.py |
Command-line interface |
| Aspect | rpn2tex | txt2tex |
|---|---|---|
| Parser type | Stack-based (RPN) | Recursive descent (infix) |
| Token types | ~6 | ~100+ |
| AST nodes | 2 | 30+ |
| Features | Basic arithmetic | Full Z notation |
# From the rpn2tex directory
pip install -e .
# Or use hatch for development
cd rpn2tex
hatch shell# Convert RPN file to LaTeX
rpn2tex examples/basic.rpn
# Output: $5 + 3$
rpn2tex examples/complex.rpn
# Output: $( 2 + 3 ) \times 4$
# Save to file
rpn2tex examples/basic.rpn -o output.tex
# Read from stdin
echo "10 2 / 5 +" | rpn2tex -from rpn2tex.lexer import Lexer
from rpn2tex.parser import Parser
from rpn2tex.latex_gen import LaTeXGenerator
# Pipeline: text → tokens → AST → LaTeX
text = "5 3 + 2 *"
tokens = Lexer(text).tokenize()
ast = Parser(tokens).parse()
latex = LaTeXGenerator().generate(ast)
print(latex) # $( 5 + 3 ) \times 2$Reverse Polish Notation places operators after their operands:
| Infix (Standard) | RPN | LaTeX Output |
|---|---|---|
5 + 3 |
5 3 + |
$5 + 3$ |
5 - 3 |
5 3 - |
$5 - 3$ |
5 * 3 |
5 3 * |
$5 \times 3$ |
6 / 2 |
6 2 / |
$6 \div 2$ |
(2 + 3) * 4 |
2 3 + 4 * |
$( 2 + 3 ) \times 4$ |
2 + 3 * 4 |
3 4 * 2 + |
$2 + 3 \times 4$ |
RPN uses a stack to evaluate expressions:
Input: 2 3 + 4 *
Step 1: Push 2 Stack: [2]
Step 2: Push 3 Stack: [2, 3]
Step 3: + pops 2,3 Stack: [(2+3)]
Step 4: Push 4 Stack: [(2+3), 4]
Step 5: * pops all Stack: [((2+3)*4)]
Result: (2 + 3) * 4
Purpose: Define the vocabulary of the language.
The lexer produces tokens, which are the atomic units of the language.
class TokenType(Enum):
NUMBER = auto() # Numeric literals
PLUS = auto() # +
MINUS = auto() # -
MULT = auto() # *
DIV = auto() # /
EOF = auto() # End marker
@dataclass(frozen=True)
class Token:
type: TokenType
value: str
line: int
column: intWhat does auto() do?
auto() is a helper from Python's enum module that automatically generates unique values:
NUMBER = auto() # Gets value 1
PLUS = auto() # Gets value 2
MINUS = auto() # Gets value 3
# ... etcThe actual numeric values don't matter—you never compare TokenType.PLUS == 2. You only compare token.type == TokenType.PLUS. Using auto() prevents duplicate values and makes it easy to add/remove/reorder token types.
Purpose: Convert raw text into a stream of tokens.
Input: String of characters
Output: List of Token objects
"5 3 +"
↓ Lexer
[Token(NUMBER, "5"), Token(NUMBER, "3"), Token(PLUS, "+"), Token(EOF, "")]The lexer:
- Scans character by character
- Groups characters into meaningful units (tokens)
- Tracks position (line, column) for error reporting
- Ignores whitespace
class Lexer:
def tokenize(self) -> list[Token]:
"""Convert input text to token stream."""
tokens: list[Token] = []
while not self._at_end():
self._skip_whitespace()
if self._at_end():
break
tokens.append(self._scan_token())
tokens.append(Token(TokenType.EOF, "", self.line, self.column))
return tokensKey point: The lexer doesn't validate syntax. It just produces tokens. "5 +" tokenizes successfully—the parser catches the missing operand.
Purpose: Define the tree structure that represents parsed expressions.
The parser produces AST nodes. AST = Abstract Syntax Tree.
@dataclass(frozen=True)
class Number(ASTNode):
value: str
@dataclass(frozen=True)
class BinaryOp(ASTNode):
operator: str
left: Expr # Child node
right: Expr # Child nodeFor input 2 3 + 4 * (meaning (2 + 3) * 4), the AST is:
BinaryOp(*) ← root
/ \
BinaryOp(+) Number(4)
/ \
Number(2) Number(3)
Purpose: Build an AST from tokens.
Input: List of Token objects
Output: AST root node (Expr)
[Token(NUMBER, "5"), Token(NUMBER, "3"), Token(PLUS, "+")]
↓ Parser
BinaryOp(operator="+", left=Number("5"), right=Number("3"))The parser:
- Consumes tokens according to grammar rules
- Builds tree structure representing the expression
- Validates syntax (e.g.,
"5 +"is invalid—missing operand)
class Parser:
def parse(self) -> Expr:
"""Convert tokens to AST using stack-based RPN parsing."""
stack: list[Expr] = []
for token in tokens:
if token.type == TokenType.NUMBER:
stack.append(Number(token.value))
elif token.type in (PLUS, MINUS, MULT, DIV):
right = stack.pop()
left = stack.pop()
stack.append(BinaryOp(op, left, right))
return stack[0]Why separate lexer and parser?
Separation of concerns:
- Lexer handles lexical structure (what are the words?)
- Parser handles syntactic structure (how do words combine?)
Purpose: Convert AST to LaTeX output.
Input: AST root node Output: LaTeX string
Uses the visitor pattern with Python's @singledispatchmethod:
class LaTeXGenerator:
@singledispatchmethod
def _visit(self, node: Expr) -> str:
"""Base case—raises error for unknown node types."""
raise NotImplementedError(f"No visitor for {type(node).__name__}")
@_visit.register
def _visit_number(self, node: Number) -> str:
"""Called when node is a Number."""
return node.value
@_visit.register
def _visit_binary_op(self, node: BinaryOp) -> str:
"""Called when node is a BinaryOp."""
left = self._visit(node.left) # Recursively visit children
right = self._visit(node.right)
return f"{left} {op_latex} {right}"How the visitor pattern works:
@singledispatchmethodmarks_visitas the dispatcher@_visit.registerregisters handlers for specific types- Python dispatches based on the argument's type at runtime:
_visit(Number(...))→ calls_visit_number_visit(BinaryOp(...))→ calls_visit_binary_op
Why use the visitor pattern?
Without it (bad—violates Open/Closed Principle):
def generate(self, node):
if isinstance(node, Number):
return node.value
elif isinstance(node, BinaryOp):
...
elif isinstance(node, Exponent): # Must modify for each new type!
...With visitor pattern (good—extensible):
@_visit.register
def _visit_exponent(self, node: Exponent) -> str: # Just add new method!
return f"{self._visit(node.base)}^{{{self._visit(node.exponent)}}}"Depth-first traversal:
The visitor traverses the AST depth-first (children before parent):
_visit(BinaryOp(*)) ← Start at root
├─ _visit(BinaryOp(+)) ← Go left first (depth-first)
│ ├─ _visit(Number(2)) → "2"
│ ├─ _visit(Number(3)) → "3"
│ └─ returns "2 + 3"
├─ _visit(Number(4)) → "4" ← Then right
└─ returns "( 2 + 3 ) \times 4" ← Process parent last
This is post-order traversal: children are processed before parents. This is natural for code generation because you need the child strings before you can combine them.
Purpose: Provide context-aware error messages with source location.
Error: Unexpected character '@'
1 | 5 3 @
^
Shows:
- Line number
- Source context
- Caret (^) pointing to exact error column
Purpose: Orchestrate the pipeline and handle I/O.
def main():
text = read_input()
tokens = Lexer(text).tokenize() # Text → Tokens
ast = Parser(tokens).parse() # Tokens → AST
latex = LaTeXGenerator().generate(ast) # AST → LaTeX
write_output(latex)This project follows txt2tex's type annotation conventions:
- Function signatures: Always fully annotated (parameters + return type)
- Collection variables: Explicit annotations (
tokens: list[Token] = []) - Simple locals: Use inference when type is clear from return type (
tokens = lexer.tokenize())
The type checkers (mypy, pyright) infer types from return annotations, so explicit local annotations are only needed when they add clarity.
hatch run testhatch run check # lint + type + test
hatch run lint # ruff check
hatch run format # ruff format
hatch run type # mypy strict modeThe following features are intentionally not implemented. Use these exercises to practice extending the rpn2tex codebase, which will prepare you for contributing to txt2tex.
For each exercise: Write unit tests in tests/test_rpn2tex.py to verify your implementation. Test each layer (lexer, parser, generator) independently, then add integration tests for the full pipeline. Run tests with hatch run test.
Goal: Support 2 3 ^ → $2^{3}$
Difficulty: Easy
Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py
Hints:
- Follow the pattern of existing binary operators for the token and lexer
- Exponentiation needs its own AST node type (not
BinaryOp) because LaTeX syntax differs - LaTeX exponents use
^{...}syntax
Test Cases:
2 3 ^→$2^{3}$2 10 ^→$2^{10}$2 3 4 + ^→$2^{( 3 + 4 )}$(consider: when does an exponent need parentheses?)
Goal: Support 9 sqrt → $\sqrt{9}$
Difficulty: Medium (introduces unary operators)
Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py
Hints:
- Unlike
+or^, this operator takes only one operand - The lexer needs to recognize multi-character keywords, not just single characters
- LaTeX uses
\sqrt{...}syntax
Test Cases:
9 sqrt→$\sqrt{9}$2 3 + sqrt→$\sqrt{( 2 + 3 )}$
Goal: Support 8 3 root → $\sqrt[3]{8}$
Difficulty: Hard (two operands, special LaTeX syntax)
Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py
Hints:
- This takes two operands but isn't like other binary operators
- Think carefully about stack order: which operand is the radicand vs the index?
- LaTeX uses
\sqrt[index]{radicand}syntax
Test Cases:
8 3 root→$\sqrt[3]{8}$16 4 root→$\sqrt[4]{16}$2 3 + 2 root→$\sqrt[2]{( 2 + 3 )}$
After completing these exercises, you're ready to explore txt2tex:
- Compare implementations: See how txt2tex handles the same concepts at scale
- Study the parser: txt2tex uses recursive descent instead of stack-based parsing
- Explore AST nodes: txt2tex has 30+ node types for Z notation
- Read DESIGN.md: Understand the architectural decisions
MIT - Same as txt2tex