Extract structured data from messy Markdown into Pydantic v2 models.
Built for resilience against common LLM output quirks: triple-backtick wrappers, trailing prose, incomplete tables, malformed JSON, and more. One line of code turns chaotic Markdown into validated, typed Python objects.
- One-liner API --
MDConverter(Model).parse_tables(md)gets you started in one line - Markdown tables -- pipe-delimited tables become lists of Pydantic models
- JSON blocks -- fenced and inline JSON, with recovery for trailing commas, single quotes, unquoted keys, and truncated output
- YAML blocks -- fenced YAML code blocks (requires
pyyaml) - Auto-detect --
parse()tries code blocks first, then tables - Yes/No bool coercion --
"Yes","No","Y","N","true","false","on","off"all map tobool - Null sentinel handling -- empty cells,
"N/A","NA","null","-","—"becomeNonefor optional fields - Table selection -- filter tables by heading or index in multi-table documents
- LLM-resilient -- handles unclosed code fences, trailing prose, extra backticks, and nested structures
- Pydantic v2 native -- leverages Pydantic's own type coercion (str to int, str to float, str to datetime, etc.)
- Lightweight -- only dependency is
pydantic>=2.0.0
pip install md2pydanticOr with uv:
uv add md2pydanticOptional extras:
pip install md2pydantic[yaml] # YAML block support (pyyaml)
pip install md2pydantic[pandas] # DataFrame conversion (pandas)Requires Python 3.10+.
from pydantic import BaseModel
from md2pydantic import MDConverter
class Product(BaseModel):
name: str
price: float
in_stock: bool
markdown = """
Here are the products currently available:
| name | price | in_stock |
|------------|-------|----------|
| Widget | 9.99 | Yes |
| Gadget | 24.50 | No |
"""
products = MDConverter(Product).parse_tables(markdown)
# [Product(name='Widget', price=9.99, in_stock=True),
# Product(name='Gadget', price=24.5, in_stock=False)]Pydantic handles the str to float coercion. md2pydantic handles "Yes" / "No" to bool.
from pydantic import BaseModel
from md2pydantic import MDConverter
class ServerConfig(BaseModel):
host: str
port: int
debug: bool
markdown = '''Sure! Here is the server configuration:
```json
{
"host": "localhost",
"port": 8080,
"debug": true,
}Let me know if you need anything else! '''
config = MDConverter(ServerConfig).parse_json(markdown)
Notice the trailing comma after `true` -- md2pydantic fixes that automatically.
### Parse a YAML Block
```python
from pydantic import BaseModel
from md2pydantic import MDConverter
class ServerConfig(BaseModel):
host: str
port: int
debug: bool
markdown = '''Here is your config:
```yaml
host: api.example.com
port: 443
debug: false
'''
config = MDConverter(ServerConfig).parse_yaml(markdown)
Requires `pyyaml`: install with `pip install md2pydantic[yaml]`.
### Auto-Detect Format
```python
from md2pydantic import MDConverter
# parse() tries JSON/YAML code blocks first, then falls back to tables
result = MDConverter(ServerConfig).parse(markdown)
Returns a single model instance for code blocks, or a list for tables and JSON arrays.
When a document contains multiple tables, filter by the preceding Markdown heading:
from pydantic import BaseModel
from md2pydantic import MDConverter
class User(BaseModel):
name: str
age: int
active: bool
markdown = """
## Current Staff
| name | age | active |
|-------|-----|--------|
| Alice | 30 | Yes |
## Former Staff
| name | age | active |
|-------|-----|--------|
| Bob | 25 | No |
| Eve | 35 | No |
"""
current = MDConverter(User).parse_tables(markdown, heading="Current Staff")
# [User(name='Alice', age=30, active=True)]
former = MDConverter(User).parse_tables(markdown, heading="Former Staff")
# [User(name='Bob', age=25, active=False), User(name='Eve', age=35, active=False)]Heading matching is case-insensitive and supports substring matches. You can also select by index with index=0.
Empty cells and common null placeholders become None for optional fields:
class Employee(BaseModel):
name: str
department: str
salary: float | None = None
markdown = """
| name | department | salary |
|-------|-------------|--------|
| Alice | Engineering | 95000 |
| Bob | Marketing | N/A |
| Carol | Sales | - |
"""
employees = MDConverter(Employee).parse_tables(markdown)
# employees[0].salary == 95000.0
# employees[1].salary is None (from "N/A")
# employees[2].salary is None (from "-")Recognized null sentinels: "" (empty), "N/A", "NA", "null", "-", "—". Matching is case-insensitive.
from md2pydantic import MDConverter, ExtractionError
try:
result = MDConverter(MyModel).parse_tables("no tables here")
except ExtractionError as e:
print(e) # Human-readable summary with line numbers
print(e.errors) # List of typed error detailsExtractionError is raised when:
- No structured data is found in the input
- Structured data is found but none of it validates against the model
Each error in .errors is either a TransformError (parsing failed) or ModelValidationError (Pydantic rejected the data), both with source location info.
When parsing tables with mixed valid/invalid rows, use partial=True to get both:
from md2pydantic import MDConverter, PartialResult
result = MDConverter(User).parse_tables(markdown, partial=True)
# result.data → list of valid User instances
# result.errors → list of typed errors with row locations
# result.has_errors → True if any rows failedExtractionError inherits from MD2PydanticError, so you can catch either.
| Format | Method | Fenced | Inline | Recovery |
|---|---|---|---|---|
| Markdown tables | parse_tables() |
-- | Yes | Padded/truncated columns |
| JSON | parse_json() |
Yes | Yes | Trailing commas, single quotes, unquoted keys, truncated JSON |
| YAML | parse_yaml() |
Yes | -- | -- |
| Auto-detect | parse() |
Yes | Yes | All of the above |
Create a converter bound to a Pydantic v2 BaseModel subclass.
converter = MDConverter(MyModel)Extract Markdown tables and return validated model instances (one per row).
index-- only parse the table at this 0-based position (applied after heading filter)heading-- only parse tables under headings matching this substring (case-insensitive)- Raises
ExtractionErrorif no tables are found or no rows validate
Extract a JSON code block and return a single validated model instance. Tries each JSON block in document order, returning the first that validates.
- Raises
ExtractionErrorif no JSON blocks are found or none validate
Extract a YAML code block and return a single validated model instance.
- Raises
ExtractionErrorif no YAML blocks are found or none validate - Requires
pyyaml(pip install md2pydantic[yaml])
Auto-detect format. Tries code blocks (JSON/YAML) first, then tables.
- Raises
ExtractionErrorif no structured data is found or none validates
| Exception | Parent | Description |
|---|---|---|
MD2PydanticError |
Exception |
Base exception for the library |
ExtractionError |
MD2PydanticError |
No data found or validation failed. Has .errors attribute. |
md2pydantic follows a Seek, Clean, Validate pipeline:
-
Scanner -- Uses regex and heuristics to identify candidate blocks (JSON, YAML, Markdown tables) within the input. Handles triple-backtick enclosures, unclosed fences, and trailing prose.
-
Transformer -- Converts raw blocks into Python dictionaries. Fixes malformed JSON (trailing commas, single quotes, unquoted keys, truncated output). Converts table rows into dicts using headers as keys.
-
Validator -- Passes dictionaries to your Pydantic model. Pre-processes Yes/No booleans and null sentinels before handing off to Pydantic's native coercion engine.
git clone https://github.com/FelipeMorandini/md2pydantic.git
cd md2pydantic
uv sync --extra dev
uv run pytest # run tests
uv run ruff check . # lint
uv run ruff format . # format
uv run mypy src/md2pydantic # type checkSee CONTRIBUTING.md for more details.