md2pydantic

Extract structured data from messy Markdown into Pydantic v2 models.

Built for resilience against common LLM output quirks: triple-backtick wrappers, trailing prose, incomplete tables, malformed JSON, and more. One line of code turns chaotic Markdown into validated, typed Python objects.

Features

One-liner API -- MDConverter(Model).parse_tables(md) gets you started in one line
Markdown tables -- pipe-delimited tables become lists of Pydantic models
JSON blocks -- fenced and inline JSON, with recovery for trailing commas, single quotes, unquoted keys, and truncated output
YAML blocks -- fenced YAML code blocks (requires pyyaml)
Auto-detect -- parse() tries code blocks first, then tables
Yes/No bool coercion -- "Yes", "No", "Y", "N", "true", "false", "on", "off" all map to bool
Null sentinel handling -- empty cells, "N/A", "NA", "null", "-", "—" become None for optional fields
Table selection -- filter tables by heading or index in multi-table documents
LLM-resilient -- handles unclosed code fences, trailing prose, extra backticks, and nested structures
Pydantic v2 native -- leverages Pydantic's own type coercion (str to int, str to float, str to datetime, etc.)
Lightweight -- only dependency is pydantic>=2.0.0

Installation

pip install md2pydantic

Or with uv:

uv add md2pydantic

Optional extras:

pip install md2pydantic[yaml]    # YAML block support (pyyaml)
pip install md2pydantic[pandas]  # DataFrame conversion (pandas)

Requires Python 3.10+.

Quick Start

Parse a Markdown Table

from pydantic import BaseModel
from md2pydantic import MDConverter

class Product(BaseModel):
    name: str
    price: float
    in_stock: bool

markdown = """
Here are the products currently available:

| name       | price | in_stock |
|------------|-------|----------|
| Widget     | 9.99  | Yes      |
| Gadget     | 24.50 | No       |
"""

products = MDConverter(Product).parse_tables(markdown)
# [Product(name='Widget', price=9.99, in_stock=True),
#  Product(name='Gadget', price=24.5, in_stock=False)]

Pydantic handles the str to float coercion. md2pydantic handles "Yes" / "No" to bool.

Parse a JSON Block

from pydantic import BaseModel
from md2pydantic import MDConverter

class ServerConfig(BaseModel):
    host: str
    port: int
    debug: bool

markdown = '''Sure! Here is the server configuration:

```json
{
    "host": "localhost",
    "port": 8080,
    "debug": true,
}

Let me know if you need anything else! '''

config = MDConverter(ServerConfig).parse_json(markdown)

ServerConfig(host='localhost', port=8080, debug=True)


Notice the trailing comma after `true` -- md2pydantic fixes that automatically.

### Parse a YAML Block

```python
from pydantic import BaseModel
from md2pydantic import MDConverter

class ServerConfig(BaseModel):
    host: str
    port: int
    debug: bool

markdown = '''Here is your config:

```yaml
host: api.example.com
port: 443
debug: false

'''

config = MDConverter(ServerConfig).parse_yaml(markdown)

ServerConfig(host='api.example.com', port=443, debug=False)


Requires `pyyaml`: install with `pip install md2pydantic[yaml]`.

### Auto-Detect Format

```python
from md2pydantic import MDConverter

# parse() tries JSON/YAML code blocks first, then falls back to tables
result = MDConverter(ServerConfig).parse(markdown)

Returns a single model instance for code blocks, or a list for tables and JSON arrays.

Select Tables by Heading

When a document contains multiple tables, filter by the preceding Markdown heading:

from pydantic import BaseModel
from md2pydantic import MDConverter

class User(BaseModel):
    name: str
    age: int
    active: bool

markdown = """
## Current Staff

| name  | age | active |
|-------|-----|--------|
| Alice | 30  | Yes    |

## Former Staff

| name  | age | active |
|-------|-----|--------|
| Bob   | 25  | No     |
| Eve   | 35  | No     |
"""

current = MDConverter(User).parse_tables(markdown, heading="Current Staff")
# [User(name='Alice', age=30, active=True)]

former = MDConverter(User).parse_tables(markdown, heading="Former Staff")
# [User(name='Bob', age=25, active=False), User(name='Eve', age=35, active=False)]

Heading matching is case-insensitive and supports substring matches. You can also select by index with index=0.

Handle Null Sentinels

Empty cells and common null placeholders become None for optional fields:

class Employee(BaseModel):
    name: str
    department: str
    salary: float | None = None

markdown = """
| name  | department  | salary |
|-------|-------------|--------|
| Alice | Engineering | 95000  |
| Bob   | Marketing   | N/A    |
| Carol | Sales       | -      |
"""

employees = MDConverter(Employee).parse_tables(markdown)
# employees[0].salary == 95000.0
# employees[1].salary is None  (from "N/A")
# employees[2].salary is None  (from "-")

Recognized null sentinels: "" (empty), "N/A", "NA", "null", "-", "—". Matching is case-insensitive.

Error Handling

from md2pydantic import MDConverter, ExtractionError

try:
    result = MDConverter(MyModel).parse_tables("no tables here")
except ExtractionError as e:
    print(e)            # Human-readable summary with line numbers
    print(e.errors)     # List of typed error details

ExtractionError is raised when:

No structured data is found in the input
Structured data is found but none of it validates against the model

Each error in .errors is either a TransformError (parsing failed) or ModelValidationError (Pydantic rejected the data), both with source location info.

Partial Results

When parsing tables with mixed valid/invalid rows, use partial=True to get both:

from md2pydantic import MDConverter, PartialResult

result = MDConverter(User).parse_tables(markdown, partial=True)
# result.data → list of valid User instances
# result.errors → list of typed errors with row locations
# result.has_errors → True if any rows failed

ExtractionError inherits from MD2PydanticError, so you can catch either.

Supported Formats

Format	Method	Fenced	Inline	Recovery
Markdown tables	`parse_tables()`	--	Yes	Padded/truncated columns
JSON	`parse_json()`	Yes	Yes	Trailing commas, single quotes, unquoted keys, truncated JSON
YAML	`parse_yaml()`	Yes	--	--
Auto-detect	`parse()`	Yes	Yes	All of the above

API Reference

`MDConverter(model)`

Create a converter bound to a Pydantic v2 BaseModel subclass.

converter = MDConverter(MyModel)

`converter.parse_tables(markdown, *, index=None, heading=None) -> list[T]`

Extract Markdown tables and return validated model instances (one per row).

index -- only parse the table at this 0-based position (applied after heading filter)
heading -- only parse tables under headings matching this substring (case-insensitive)
Raises ExtractionError if no tables are found or no rows validate

`converter.parse_json(markdown) -> T`

Extract a JSON code block and return a single validated model instance. Tries each JSON block in document order, returning the first that validates.

Raises ExtractionError if no JSON blocks are found or none validate

`converter.parse_yaml(markdown) -> T`

Extract a YAML code block and return a single validated model instance.

Raises ExtractionError if no YAML blocks are found or none validate
Requires pyyaml (pip install md2pydantic[yaml])

`converter.parse(markdown) -> T | list[T]`

Auto-detect format. Tries code blocks (JSON/YAML) first, then tables.

Raises ExtractionError if no structured data is found or none validates

Exceptions

Exception	Parent	Description
`MD2PydanticError`	`Exception`	Base exception for the library
`ExtractionError`	`MD2PydanticError`	No data found or validation failed. Has `.errors` attribute.

How It Works

md2pydantic follows a Seek, Clean, Validate pipeline:

Scanner -- Uses regex and heuristics to identify candidate blocks (JSON, YAML, Markdown tables) within the input. Handles triple-backtick enclosures, unclosed fences, and trailing prose.
Transformer -- Converts raw blocks into Python dictionaries. Fixes malformed JSON (trailing commas, single quotes, unquoted keys, truncated output). Converts table rows into dicts using headers as keys.
Validator -- Passes dictionaries to your Pydantic model. Pre-processes Yes/No booleans and null sentinels before handing off to Pydantic's native coercion engine.

Development

git clone https://github.com/FelipeMorandini/md2pydantic.git
cd md2pydantic
uv sync --extra dev

uv run pytest              # run tests
uv run ruff check .        # lint
uv run ruff format .       # format
uv run mypy src/md2pydantic  # type check

See CONTRIBUTING.md for more details.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
docs		docs
src/md2pydantic		src/md2pydantic
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

md2pydantic

Features

Installation

Quick Start

Parse a Markdown Table

Parse a JSON Block

ServerConfig(host='localhost', port=8080, debug=True)

ServerConfig(host='api.example.com', port=443, debug=False)

Select Tables by Heading

Handle Null Sentinels

Error Handling

Partial Results

Supported Formats

API Reference

`MDConverter(model)`

`converter.parse_tables(markdown, *, index=None, heading=None) -> list[T]`

`converter.parse_json(markdown) -> T`

`converter.parse_yaml(markdown) -> T`

`converter.parse(markdown) -> T | list[T]`

Exceptions

How It Works

Development

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

md2pydantic

Features

Installation

Quick Start

Parse a Markdown Table

Parse a JSON Block

ServerConfig(host='localhost', port=8080, debug=True)

ServerConfig(host='api.example.com', port=443, debug=False)

Select Tables by Heading

Handle Null Sentinels

Error Handling

Partial Results

Supported Formats

API Reference

MDConverter(model)

converter.parse_tables(markdown, *, index=None, heading=None) -> list[T]

converter.parse_json(markdown) -> T

converter.parse_yaml(markdown) -> T

converter.parse(markdown) -> T | list[T]

Exceptions

How It Works

Development

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`MDConverter(model)`

`converter.parse_tables(markdown, *, index=None, heading=None) -> list[T]`

`converter.parse_json(markdown) -> T`

`converter.parse_yaml(markdown) -> T`

`converter.parse(markdown) -> T | list[T]`

Packages