Skip to content

A simple transpiler tutorial, potentially relevant for developers wishing to work on the txt2tex project.

License

Notifications You must be signed in to change notification settings

jmf-pobox/rpn2tex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rpn2tex

An educational example demonstrating the architecture of txt2tex.

This minimal implementation converts Reverse Polish Notation (RPN) expressions to LaTeX math mode, showcasing the same pipeline architecture used in the full txt2tex system.

Purpose

This project helps you understand the txt2tex codebase by providing:

  1. A simplified implementation of the same concepts
  2. A hands-on exercise to extend the functionality
  3. Clear documentation explaining each component

Architecture Overview

rpn2tex follows the same pipeline as txt2tex:

Input Text → Lexer → Tokens → Parser → AST → Generator → LaTeX

File Correspondence

rpn2tex File txt2tex File Purpose
tokens.py tokens.py Token types and Token dataclass
lexer.py lexer.py Tokenizes input text
ast_nodes.py ast_nodes.py AST node definitions
parser.py parser.py Builds AST from tokens
latex_gen.py latex_gen.py Converts AST to LaTeX
errors.py errors.py Error formatting
cli.py cli.py Command-line interface

Key Differences

Aspect rpn2tex txt2tex
Parser type Stack-based (RPN) Recursive descent (infix)
Token types ~6 ~100+
AST nodes 2 30+
Features Basic arithmetic Full Z notation

Installation

# From the rpn2tex directory
pip install -e .

# Or use hatch for development
cd rpn2tex
hatch shell

Quick Start

Command Line

# Convert RPN file to LaTeX
rpn2tex examples/basic.rpn

# Output: $5 + 3$

rpn2tex examples/complex.rpn

# Output: $( 2 + 3 ) \times 4$

# Save to file
rpn2tex examples/basic.rpn -o output.tex

# Read from stdin
echo "10 2 / 5 +" | rpn2tex -

Python API

from rpn2tex.lexer import Lexer
from rpn2tex.parser import Parser
from rpn2tex.latex_gen import LaTeXGenerator

# Pipeline: text → tokens → AST → LaTeX
text = "5 3 + 2 *"
tokens = Lexer(text).tokenize()
ast = Parser(tokens).parse()
latex = LaTeXGenerator().generate(ast)

print(latex)  # $( 5 + 3 ) \times 2$

RPN Syntax Reference

Reverse Polish Notation places operators after their operands:

Infix (Standard) RPN LaTeX Output
5 + 3 5 3 + $5 + 3$
5 - 3 5 3 - $5 - 3$
5 * 3 5 3 * $5 \times 3$
6 / 2 6 2 / $6 \div 2$
(2 + 3) * 4 2 3 + 4 * $( 2 + 3 ) \times 4$
2 + 3 * 4 3 4 * 2 + $2 + 3 \times 4$

How RPN Works

RPN uses a stack to evaluate expressions:

Input: 2 3 + 4 *

Step 1: Push 2      Stack: [2]
Step 2: Push 3      Stack: [2, 3]
Step 3: + pops 2,3  Stack: [(2+3)]
Step 4: Push 4      Stack: [(2+3), 4]
Step 5: * pops all  Stack: [((2+3)*4)]

Result: (2 + 3) * 4

Component Walkthrough

1. Tokens (tokens.py)

Purpose: Define the vocabulary of the language.

The lexer produces tokens, which are the atomic units of the language.

class TokenType(Enum):
    NUMBER = auto()   # Numeric literals
    PLUS = auto()     # +
    MINUS = auto()    # -
    MULT = auto()     # *
    DIV = auto()      # /
    EOF = auto()      # End marker

@dataclass(frozen=True)
class Token:
    type: TokenType
    value: str
    line: int
    column: int

What does auto() do?

auto() is a helper from Python's enum module that automatically generates unique values:

NUMBER = auto()  # Gets value 1
PLUS = auto()    # Gets value 2
MINUS = auto()   # Gets value 3
# ... etc

The actual numeric values don't matter—you never compare TokenType.PLUS == 2. You only compare token.type == TokenType.PLUS. Using auto() prevents duplicate values and makes it easy to add/remove/reorder token types.

2. Lexer (lexer.py)

Purpose: Convert raw text into a stream of tokens.

Input: String of characters Output: List of Token objects

"5 3 +"Lexer
[Token(NUMBER, "5"), Token(NUMBER, "3"), Token(PLUS, "+"), Token(EOF, "")]

The lexer:

  • Scans character by character
  • Groups characters into meaningful units (tokens)
  • Tracks position (line, column) for error reporting
  • Ignores whitespace
class Lexer:
    def tokenize(self) -> list[Token]:
        """Convert input text to token stream."""
        tokens: list[Token] = []
        while not self._at_end():
            self._skip_whitespace()
            if self._at_end():
                break
            tokens.append(self._scan_token())
        tokens.append(Token(TokenType.EOF, "", self.line, self.column))
        return tokens

Key point: The lexer doesn't validate syntax. It just produces tokens. "5 +" tokenizes successfully—the parser catches the missing operand.

3. AST Nodes (ast_nodes.py)

Purpose: Define the tree structure that represents parsed expressions.

The parser produces AST nodes. AST = Abstract Syntax Tree.

@dataclass(frozen=True)
class Number(ASTNode):
    value: str

@dataclass(frozen=True)
class BinaryOp(ASTNode):
    operator: str
    left: Expr      # Child node
    right: Expr     # Child node

For input 2 3 + 4 * (meaning (2 + 3) * 4), the AST is:

        BinaryOp(*)        ← root
        /        \
   BinaryOp(+)    Number(4)
   /        \
Number(2)  Number(3)

4. Parser (parser.py)

Purpose: Build an AST from tokens.

Input: List of Token objects Output: AST root node (Expr)

[Token(NUMBER, "5"), Token(NUMBER, "3"), Token(PLUS, "+")]
    ↓ Parser
BinaryOp(operator="+", left=Number("5"), right=Number("3"))

The parser:

  • Consumes tokens according to grammar rules
  • Builds tree structure representing the expression
  • Validates syntax (e.g., "5 +" is invalid—missing operand)
class Parser:
    def parse(self) -> Expr:
        """Convert tokens to AST using stack-based RPN parsing."""
        stack: list[Expr] = []
        for token in tokens:
            if token.type == TokenType.NUMBER:
                stack.append(Number(token.value))
            elif token.type in (PLUS, MINUS, MULT, DIV):
                right = stack.pop()
                left = stack.pop()
                stack.append(BinaryOp(op, left, right))
        return stack[0]

Why separate lexer and parser?

Separation of concerns:

  • Lexer handles lexical structure (what are the words?)
  • Parser handles syntactic structure (how do words combine?)

5. LaTeX Generator (latex_gen.py)

Purpose: Convert AST to LaTeX output.

Input: AST root node Output: LaTeX string

Uses the visitor pattern with Python's @singledispatchmethod:

class LaTeXGenerator:
    @singledispatchmethod
    def _visit(self, node: Expr) -> str:
        """Base case—raises error for unknown node types."""
        raise NotImplementedError(f"No visitor for {type(node).__name__}")

    @_visit.register
    def _visit_number(self, node: Number) -> str:
        """Called when node is a Number."""
        return node.value

    @_visit.register
    def _visit_binary_op(self, node: BinaryOp) -> str:
        """Called when node is a BinaryOp."""
        left = self._visit(node.left)    # Recursively visit children
        right = self._visit(node.right)
        return f"{left} {op_latex} {right}"

How the visitor pattern works:

  1. @singledispatchmethod marks _visit as the dispatcher
  2. @_visit.register registers handlers for specific types
  3. Python dispatches based on the argument's type at runtime:
    • _visit(Number(...)) → calls _visit_number
    • _visit(BinaryOp(...)) → calls _visit_binary_op

Why use the visitor pattern?

Without it (bad—violates Open/Closed Principle):

def generate(self, node):
    if isinstance(node, Number):
        return node.value
    elif isinstance(node, BinaryOp):
        ...
    elif isinstance(node, Exponent):  # Must modify for each new type!
        ...

With visitor pattern (good—extensible):

@_visit.register
def _visit_exponent(self, node: Exponent) -> str:  # Just add new method!
    return f"{self._visit(node.base)}^{{{self._visit(node.exponent)}}}"

Depth-first traversal:

The visitor traverses the AST depth-first (children before parent):

_visit(BinaryOp(*))                      ← Start at root
  ├─ _visit(BinaryOp(+))                 ← Go left first (depth-first)
  │    ├─ _visit(Number(2)) → "2"
  │    ├─ _visit(Number(3)) → "3"
  │    └─ returns "2 + 3"
  ├─ _visit(Number(4)) → "4"             ← Then right
  └─ returns "( 2 + 3 ) \times 4"        ← Process parent last

This is post-order traversal: children are processed before parents. This is natural for code generation because you need the child strings before you can combine them.

6. Error Formatter (errors.py)

Purpose: Provide context-aware error messages with source location.

Error: Unexpected character '@'

1 | 5 3 @
        ^

Shows:

  • Line number
  • Source context
  • Caret (^) pointing to exact error column

7. CLI (cli.py)

Purpose: Orchestrate the pipeline and handle I/O.

def main():
    text = read_input()
    tokens = Lexer(text).tokenize()      # Text → Tokens
    ast = Parser(tokens).parse()          # Tokens → AST
    latex = LaTeXGenerator().generate(ast)  # AST → LaTeX
    write_output(latex)

Code Style

This project follows txt2tex's type annotation conventions:

  • Function signatures: Always fully annotated (parameters + return type)
  • Collection variables: Explicit annotations (tokens: list[Token] = [])
  • Simple locals: Use inference when type is clear from return type (tokens = lexer.tokenize())

The type checkers (mypy, pyright) infer types from return annotations, so explicit local annotations are only needed when they add clarity.

Development

Running Tests

hatch run test

Running Quality Checks

hatch run check       # lint + type + test
hatch run lint        # ruff check
hatch run format      # ruff format
hatch run type        # mypy strict mode

Programming Exercises

The following features are intentionally not implemented. Use these exercises to practice extending the rpn2tex codebase, which will prepare you for contributing to txt2tex.

For each exercise: Write unit tests in tests/test_rpn2tex.py to verify your implementation. Test each layer (lexer, parser, generator) independently, then add integration tests for the full pipeline. Run tests with hatch run test.

Exercise 1: Add Exponentiation (^)

Goal: Support 2 3 ^$2^{3}$

Difficulty: Easy

Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py

Hints:

  • Follow the pattern of existing binary operators for the token and lexer
  • Exponentiation needs its own AST node type (not BinaryOp) because LaTeX syntax differs
  • LaTeX exponents use ^{...} syntax

Test Cases:

  • 2 3 ^$2^{3}$
  • 2 10 ^$2^{10}$
  • 2 3 4 + ^$2^{( 3 + 4 )}$ (consider: when does an exponent need parentheses?)

Exercise 2: Add Square Root (sqrt)

Goal: Support 9 sqrt$\sqrt{9}$

Difficulty: Medium (introduces unary operators)

Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py

Hints:

  • Unlike + or ^, this operator takes only one operand
  • The lexer needs to recognize multi-character keywords, not just single characters
  • LaTeX uses \sqrt{...} syntax

Test Cases:

  • 9 sqrt$\sqrt{9}$
  • 2 3 + sqrt$\sqrt{( 2 + 3 )}$

Exercise 3: Add Nth Root (root)

Goal: Support 8 3 root$\sqrt[3]{8}$

Difficulty: Hard (two operands, special LaTeX syntax)

Files to modify: tokens.py, lexer.py, ast_nodes.py, parser.py, latex_gen.py

Hints:

  • This takes two operands but isn't like other binary operators
  • Think carefully about stack order: which operand is the radicand vs the index?
  • LaTeX uses \sqrt[index]{radicand} syntax

Test Cases:

  • 8 3 root$\sqrt[3]{8}$
  • 16 4 root$\sqrt[4]{16}$
  • 2 3 + 2 root$\sqrt[2]{( 2 + 3 )}$

Next Steps

After completing these exercises, you're ready to explore txt2tex:

  1. Compare implementations: See how txt2tex handles the same concepts at scale
  2. Study the parser: txt2tex uses recursive descent instead of stack-based parsing
  3. Explore AST nodes: txt2tex has 30+ node types for Z notation
  4. Read DESIGN.md: Understand the architectural decisions

License

MIT - Same as txt2tex

About

A simple transpiler tutorial, potentially relevant for developers wishing to work on the txt2tex project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages