Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 98 additions & 70 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# mdvs — Markdown Validation & Search

<div align="center">

[![CI](https://github.com/edochi/mdvs/actions/workflows/ci.yml/badge.svg)](https://github.com/edochi/mdvs/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/mdvs)](https://crates.io/crates/mdvs)
[![downloads](https://img.shields.io/crates/d/mdvs)](https://crates.io/crates/mdvs)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-2021-orange.svg)](https://www.rust-lang.org/)
[![Docs](https://img.shields.io/badge/docs-mdBook-green.svg)](https://edochi.github.io/mdvs/)

</div>

<div align="center">

:x: A Document Database
Expand All @@ -13,7 +19,27 @@

</div>

mdvs infers a schema from your frontmatter, validates it, and gives you semantic search with SQL filtering. Single binary, no cloud, no setup.
Schema inference, frontmatter validation, and semantic search for markdown directories. Single binary, no cloud, no setup.

## Why mdvs?

Markdown files can have a YAML block at the top called **frontmatter** — structured fields that describe the document:

```markdown
---
title: Rust Tips
tags: [rust, programming]
draft: false
---

# Rust Tips

Your content here...
```

`title`, `tags`, and `draft` are frontmatter fields. Most tools treat these as flat text or ignore them entirely. mdvs sees structure — your directories, your fields, your types. It infers which fields belong in which directories, validates that they're consistent, and lets you search everything with natural language and SQL.

No config to write. No schema to define. Point it at a directory and it figures it out.

## Install

Expand All @@ -37,118 +63,120 @@ cd mdvs
cargo install --path .
```

## Quick Start

```bash
# Initialize: scans your files, infers a schema, builds a search index
mdvs init ~/notes
## How it works

# Search with natural language
mdvs search "how to handle errors in rust"
mdvs treats your markdown directory as a database — and your directory structure as part of the schema.

# Filter results with SQL on frontmatter fields
mdvs search "async patterns" --where "draft = false" --limit 5
Consider a simple knowledge base:

# Validate frontmatter against the inferred schema
mdvs check
```
notes/
├── blog/
│ ├── rust-tips.md ← title, tags, draft
│ └── half-baked-idea.md ← title, draft
├── team/
│ ├── alice.md ← title, role, email
│ └── bob.md ← title, role
└── meetings/
└── weekly.md ← title, date, attendees
```

That's it. No config files to write, no models to download manually, no services to start.

## Features

### Schema inference
Different directories, different fields. mdvs sees this.

mdvs scans your markdown files and infers a typed schema from frontmatter — field names, types (boolean, integer, float, string, arrays, nested objects), which directories they appear in, and which ones are required. The schema is written to `mdvs.toml` and can be customized.
### Infer

```bash
mdvs init ~/notes
# Discovered 10 fields across 496 files
# tags String[] (required in ["**"])
# draft Boolean (allowed in ["blog/**"])
# year Integer (required in ["articles/**"])
# ...
mdvs init notes/
```

### Frontmatter validation

Check your files against the schema — catch missing required fields, wrong types, and fields that appear where they shouldn't.
mdvs scans every file, extracts frontmatter, and infers which fields belong where:

```bash
mdvs check
# blog/draft.md: missing required field 'tags'
# blog/old-post.md: field 'year' expected Integer, got String
```
Initialized 5 files — 7 field(s)

"title" String 5/5 required everywhere
"draft" Boolean 2/5 only in blog/
"tags" String[] 1/5 only in blog/
"role" String 2/5 required in team/
"email" String 1/5 only in team/
"date" String 1/5 only in meetings/
"attendees" String[] 1/5 only in meetings/
```

### Semantic search
`draft` belongs in `blog/`. `role` belongs in `team/`. The directory structure is the schema.

Instant vector search using lightweight static embeddings ([Model2Vec](https://github.com/MinishLab/model2vec)). The default model is 8MB — no GPU, no API keys, no network access needed at query time.
### Validate

```bash
mdvs search "distributed consensus algorithms"
0.72 notes/raft.md
0.68 notes/paxos.md
0.61 blog/distributed-systems.md
```
Two new files appear — both without `role`:

All commands support `--output json` for scripting and pipelines:
```
notes/
├── blog/
│ └── new-post.md ← title, draft (no role)
├── team/
│ └── charlie.md ← title (no role)
└── ...
```

```bash
mdvs search "distributed consensus" --output json
mdvs check notes/
```

```json
{
"hits": [
{ "filename": "notes/raft.md", "score": 0.72 },
{ "filename": "notes/paxos.md", "score": 0.68 },
{ "filename": "blog/distributed-systems.md", "score": 0.61 }
]
}
```
1 violation — "role" MissingRequired in team/charlie.md
```

### SQL filtering
`charlie.md` is missing `role` — but `new-post.md` isn't flagged. mdvs knows `role` belongs in `team/`, not in `blog/`.

Filter search results on any frontmatter field using SQL syntax, powered by [DataFusion](https://datafusion.apache.org/).
### Search

```bash
mdvs search "rust" --where "draft = false AND year >= 2024"
mdvs search "recipes" --where "tags IS NOT NULL" --limit 5
mdvs search "weekly sync" notes/
```

### Incremental builds
```
1 meetings/weekly.md 0.82
2 team/alice.md 0.45
```

Only changed files are re-embedded. Unchanged files keep their existing chunks and embeddings. If nothing changed, the model isn't even loaded.
Filter with SQL on frontmatter fields:

```bash
mdvs build
# Built index: 3 new, 1 edited, 492 unchanged, 0 removed (4 files embedded)
mdvs search "rust" notes/ --where "draft = false"
```

No config files to write. No models to download manually. No services to start.

> **Try it yourself!** Clone the repo and explore a richer example — 43 files across 8 directories, with type widening, nullable fields, nested objects, and deliberate edge cases:
> ```bash
> git clone https://github.com/edochi/mdvs.git
> cd mdvs
> mdvs init example_kb/
> mdvs search "experiment" example_kb/
> ```

## Features

- **Schema inference** — types (boolean, integer, float, string, arrays, nested objects), path constraints (allowed/required per directory), nullable detection. All automatic.
- **Frontmatter validation** — wrong types, disallowed fields, missing required fields, null violations. Four independent checks, path-aware.
- **Semantic search** — instant vector search using lightweight [Model2Vec](https://minish.ai/) static embeddings. Default model is ~30MB. No GPU, no API keys.
- **SQL filtering** — `--where` clauses on any frontmatter field, powered by [DataFusion](https://datafusion.apache.org/). Arrays, nested objects, LIKE, IS NULL — full SQL.
- **Incremental builds** — only changed files are re-embedded. Unchanged files keep their chunks. If nothing changed, the model isn't even loaded.
- **Auto pipeline** — `search` auto-builds the index. `build` auto-updates the schema. One command does everything: `mdvs search "query"`.
- **JSON output** — all commands support `--output json` for scripting and CI.

## Commands

| Command | Description |
|---------|-------------|
| `init` | Scan files, infer schema, write `mdvs.toml`, optionally build index |
| `init` | Scan files, infer schema, write `mdvs.toml` |
| `check` | Validate frontmatter against schema |
| `update` | Re-scan and update field definitions |
| `build` | Validate + embed + write search index |
| `search` | Semantic search with optional SQL filtering |
| `info` | Show config and index status |
| `clean` | Delete search index |

## How it works

mdvs treats your markdown directory like a database:

- **`init`** scans your files and infers a schema from frontmatter — like `CREATE TABLE`
- **`check`** validates every file against that schema — like constraint checking
- **`update`** detects new fields as your files evolve — like `ALTER TABLE`
- **`build`** chunks and embeds your content into a local Parquet index
- **`search`** queries that index with SQL filtering on metadata — like `SELECT ... WHERE ... ORDER BY similarity`

Two artifacts: `mdvs.toml` (committed, your schema) and `.mdvs/` (gitignored, the search index).

## Documentation

Full documentation at [edochi.github.io/mdvs](https://edochi.github.io/mdvs/).
Expand Down
17 changes: 16 additions & 1 deletion book/src/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,21 @@ That command did three things:
2. **Inferred** 37 typed fields — strings, integers, floats, booleans, arrays, even a nested object (`calibration`)
3. **Wrote** `mdvs.toml` with the inferred schema

Notice the third column: `draft` appears in 8/43 files — all in `blog/`. `sensor_type` in 3/43 — all in `projects/alpha/notes/`. mdvs captured not just the types, but *where* each field belongs. Run `mdvs init example_kb -v` to see the full path patterns.

Here's what a field definition looks like in `mdvs.toml`:

```toml
[[fields.field]]
name = "sensor_type"
type = "String"
allowed = ["projects/alpha/notes/**"]
required = ["projects/alpha/notes/**"]
nullable = false
```

This means `sensor_type` is allowed only in experiment notes, and required there. If it appears in a blog post, `check` will flag it. If it's missing from an experiment note, `check` will flag that too.

One artifact is created by `init`: **`mdvs.toml`** — the schema file. Commit this to version control. The `.mdvs/` directory (search index) is created later on first `build` or `search`.

## Validate
Expand All @@ -101,7 +116,7 @@ mdvs check example_kb
Checked 43 files — no violations
```

Since `mdvs init` just inferred the schema from these same files, everything passes. The power of `check` comes after you tighten the schema.
Since `mdvs init` just inferred the schema from these same files, everything passes. The power of `check` comes after you tighten the schema — or when files drift from it. Try adding `sensor_type: SPR-A1` to a blog post — mdvs will flag it as `Disallowed` because that field doesn't belong there.

### What violations look like

Expand Down
19 changes: 16 additions & 3 deletions book/src/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,29 @@ tags: # String[]

mdvs recognizes these types automatically. When it scans your files, it infers the type of each field from the values it finds — no configuration needed.

## Directory-aware schema

mdvs infers a three-dimensional schema from your files:

- **Types** — boolean, integer, float, string, arrays, nested objects. Inferred automatically, with widening when files disagree.
- **Paths** — which fields belong in which directories. `draft` only in `blog/`, `sensor_type` only in `projects/alpha/notes/`. Captured as `allowed` and `required` glob patterns.
- **Nullability** — whether a field can be null. Tracked per field.

This means different directories can have different fields with different constraints — all inferred automatically from your existing files.

> **Tightest fit:** `mdvs init` infers the strictest schema that's consistent with your existing files. A field is inferred as *allowed* in a directory if at least one file there has it. It's inferred as *required* if every file there has it. These rules propagate up — if every subdirectory requires a field, the parent directory does too. The result is the tightest set of constraints where `check` still returns zero violations. You can always loosen them later.

## Two layers

mdvs has two distinct capabilities that work independently:

**Validation** — Scan your files, infer what frontmatter fields exist, where they appear, and what types they have. Write the result to `mdvs.toml`. Then validate files against that schema. No model, no index, nothing to download.
**Validation** — Scan your files, infer what frontmatter fields exist, which directories they appear in, and what types they have. Write the result to `mdvs.toml`. Then validate files against that schema. No model, no index, nothing to download.

**Search** — Chunk your markdown, embed it with a lightweight local model, store the vectors in Parquet files in `.mdvs/`, and query with natural language. Filter results on any frontmatter field using standard SQL.

You need validation without search? Run `mdvs init --suppress-auto-build`, customise the fields in `mdvs.toml`, and run `mdvs check` to validate your files.
You need validation without search? Run `mdvs init`, customize the fields in `mdvs.toml`, and run `mdvs check`.

You want search without validation? Just run `mdvs init` and `mdvs search` to get going. The inferred schema is used to extract metadata for search results, but you don't have to worry about it if you don't want to.
You want search without validation? Just run `mdvs init` and `mdvs search`. The inferred schema is used to extract metadata for search results, but you don't have to worry about it if you don't want to.

Use them together for the best experience, or separately if that's what you need.

Expand All @@ -53,6 +65,7 @@ You can think of mdvs as a layer on top of your markdown files that gives you da
| Concept | Database | mdvs |
|---|---|---|
| Define structure | `CREATE TABLE` | `mdvs init` |
| Per-table columns | Different columns per table | Per-directory fields via `allowed`/`required` globs |
| Enforce constraints | Constraint validation | `mdvs check` |
| Evolve structure | `ALTER TABLE` | `mdvs update` |
| Create an index | `CREATE INDEX` | `mdvs build` |
Expand Down
70 changes: 70 additions & 0 deletions docs/spec/todos/TODO-0118.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
id: 118
title: "Rework README and book intro to show directory-aware schema"
status: todo
priority: high
created: 2026-03-17
depends_on: []
blocks: []
---

# TODO-0118: Rework README and book intro to show directory-aware schema

## Summary

The README and book introduction don't surface mdvs's core differentiator: it infers schema constraints from directory structure — different fields, different requirements, different types per directory. This is unique among all competitors and should be immediately visible.

## Problem

The current README Quick Start uses generic examples (`mdvs init ~/notes`) that don't show path-awareness. The check output (`missing required field 'tags'`) could be any flat linter. Nothing shows "this field isn't allowed in this directory" or "this field is required only in this path."

The competitive analysis (March 2026) confirmed: no other tool combines schema inference + path-based validation + search. But the README reads like a feature list, not a demonstration.

## Approach: show, don't tell

Instead of a "Why mdvs?" sales pitch, demonstrate the three dimensions in action with a tiny, self-explanatory example.

### README

Create a minimal example structure (5-6 files, 3 directories) that makes the point instantly:

```
notes/
├── blog/
│ ├── my-post.md ← title, tags, draft
│ └── idea.md ← title, draft
├── people/
│ ├── alice.md ← title, role, email
│ └── bob.md ← title, role
└── meetings/
└── standup.md ← title, date, attendees
```

Three commands, three "aha" moments:
1. **Init** — output shows `draft` only in `blog/`, `role` only in `people/`, `attendees` only in `meetings/`. User sees: "it understood my structure."
2. **Check** — catches a structural mistake (e.g., `draft` leaked into `people/`, or missing `role`). User sees: "it caught something a flat linter wouldn't."
3. **Search** — finds relevant content. User sees: "it works."

This replaces the current Quick Start section. Keep it tight — the whole example should be scrollable without pausing.

### Book intro / getting-started

The full `example_kb` walkthrough goes deeper:
- Show the complete directory tree
- Show the inferred schema with all path patterns
- Walk through type widening, nullable, nested objects
- Show violations that only make sense with path awareness
- Show search with `--where` filtering by path and metadata

### Considerations

- The README example should be a real, runnable mini-vault (maybe ship it as `examples/quick-start/` in the repo, or generate it in the Quick Start instructions)
- The output examples must be real captured output, not pseudocode
- The book already has `example_kb` — it needs better framing ("notice how `drift_rate` is required only in `projects/alpha/notes/`"), not new content

## Files

- `README.md` — rework Quick Start section
- `book/src/getting-started.md` — reframe around directory-aware schema
- `book/src/introduction.md` — update pitch if needed
- Possibly create `examples/quick-start/` with the mini-vault
1 change: 1 addition & 0 deletions docs/spec/todos/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,4 @@
| [0115](TODO-0115.md) | Embed asciinema recordings in mdBook for interactive demos | todo | medium | 2026-03-17 |
| [0116](TODO-0116.md) | Trim DataFusion default features to reduce binary size | todo | medium | 2026-03-17 |
| [0117](TODO-0117.md) | Fix null values skipping Disallowed and NullNotAllowed checks | done | high | 2026-03-17 |
| [0118](TODO-0118.md) | Rework README and book intro to show directory-aware schema | todo | high | 2026-03-17 |
Loading