diff --git a/README.md b/README.md index cc8f206..8f514eb 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,16 @@ # mdvs — Markdown Validation & Search +
+ [![CI](https://github.com/edochi/mdvs/actions/workflows/ci.yml/badge.svg)](https://github.com/edochi/mdvs/actions/workflows/ci.yml) +[![crates.io](https://img.shields.io/crates/v/mdvs)](https://crates.io/crates/mdvs) +[![downloads](https://img.shields.io/crates/d/mdvs)](https://crates.io/crates/mdvs) [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Rust](https://img.shields.io/badge/rust-2021-orange.svg)](https://www.rust-lang.org/) [![Docs](https://img.shields.io/badge/docs-mdBook-green.svg)](https://edochi.github.io/mdvs/) +
+
:x: A Document Database @@ -13,7 +19,27 @@
-mdvs infers a schema from your frontmatter, validates it, and gives you semantic search with SQL filtering. Single binary, no cloud, no setup. +Schema inference, frontmatter validation, and semantic search for markdown directories. Single binary, no cloud, no setup. + +## Why mdvs? + +Markdown files can have a YAML block at the top called **frontmatter** — structured fields that describe the document: + +```markdown +--- +title: Rust Tips +tags: [rust, programming] +draft: false +--- + +# Rust Tips + +Your content here... +``` + +`title`, `tags`, and `draft` are frontmatter fields. Most tools treat these as flat text or ignore them entirely. mdvs sees structure — your directories, your fields, your types. It infers which fields belong in which directories, validates that they're consistent, and lets you search everything with natural language and SQL. + +No config to write. No schema to define. Point it at a directory and it figures it out. ## Install @@ -37,99 +63,113 @@ cd mdvs cargo install --path . ``` -## Quick Start - -```bash -# Initialize: scans your files, infers a schema, builds a search index -mdvs init ~/notes +## How it works -# Search with natural language -mdvs search "how to handle errors in rust" +mdvs treats your markdown directory as a database — and your directory structure as part of the schema. -# Filter results with SQL on frontmatter fields -mdvs search "async patterns" --where "draft = false" --limit 5 +Consider a simple knowledge base: -# Validate frontmatter against the inferred schema -mdvs check +``` +notes/ +├── blog/ +│ ├── rust-tips.md ← title, tags, draft +│ └── half-baked-idea.md ← title, draft +├── team/ +│ ├── alice.md ← title, role, email +│ └── bob.md ← title, role +└── meetings/ + └── weekly.md ← title, date, attendees ``` -That's it. No config files to write, no models to download manually, no services to start. - -## Features - -### Schema inference +Different directories, different fields. mdvs sees this. -mdvs scans your markdown files and infers a typed schema from frontmatter — field names, types (boolean, integer, float, string, arrays, nested objects), which directories they appear in, and which ones are required. The schema is written to `mdvs.toml` and can be customized. +### Infer ```bash -mdvs init ~/notes -# Discovered 10 fields across 496 files -# tags String[] (required in ["**"]) -# draft Boolean (allowed in ["blog/**"]) -# year Integer (required in ["articles/**"]) -# ... +mdvs init notes/ ``` -### Frontmatter validation - -Check your files against the schema — catch missing required fields, wrong types, and fields that appear where they shouldn't. +mdvs scans every file, extracts frontmatter, and infers which fields belong where: -```bash -mdvs check -# blog/draft.md: missing required field 'tags' -# blog/old-post.md: field 'year' expected Integer, got String +``` +Initialized 5 files — 7 field(s) + + "title" String 5/5 required everywhere + "draft" Boolean 2/5 only in blog/ + "tags" String[] 1/5 only in blog/ + "role" String 2/5 required in team/ + "email" String 1/5 only in team/ + "date" String 1/5 only in meetings/ + "attendees" String[] 1/5 only in meetings/ ``` -### Semantic search +`draft` belongs in `blog/`. `role` belongs in `team/`. The directory structure is the schema. -Instant vector search using lightweight static embeddings ([Model2Vec](https://github.com/MinishLab/model2vec)). The default model is 8MB — no GPU, no API keys, no network access needed at query time. +### Validate -```bash -mdvs search "distributed consensus algorithms" -0.72 notes/raft.md -0.68 notes/paxos.md -0.61 blog/distributed-systems.md -``` +Two new files appear — both without `role`: -All commands support `--output json` for scripting and pipelines: +``` +notes/ +├── blog/ +│ └── new-post.md ← title, draft (no role) +├── team/ +│ └── charlie.md ← title (no role) +└── ... +``` ```bash -mdvs search "distributed consensus" --output json +mdvs check notes/ ``` -```json -{ - "hits": [ - { "filename": "notes/raft.md", "score": 0.72 }, - { "filename": "notes/paxos.md", "score": 0.68 }, - { "filename": "blog/distributed-systems.md", "score": 0.61 } - ] -} +``` +1 violation — "role" MissingRequired in team/charlie.md ``` -### SQL filtering +`charlie.md` is missing `role` — but `new-post.md` isn't flagged. mdvs knows `role` belongs in `team/`, not in `blog/`. -Filter search results on any frontmatter field using SQL syntax, powered by [DataFusion](https://datafusion.apache.org/). +### Search ```bash -mdvs search "rust" --where "draft = false AND year >= 2024" -mdvs search "recipes" --where "tags IS NOT NULL" --limit 5 +mdvs search "weekly sync" notes/ ``` -### Incremental builds +``` +1 meetings/weekly.md 0.82 +2 team/alice.md 0.45 +``` -Only changed files are re-embedded. Unchanged files keep their existing chunks and embeddings. If nothing changed, the model isn't even loaded. +Filter with SQL on frontmatter fields: ```bash -mdvs build -# Built index: 3 new, 1 edited, 492 unchanged, 0 removed (4 files embedded) +mdvs search "rust" notes/ --where "draft = false" ``` +No config files to write. No models to download manually. No services to start. + +> **Try it yourself!** Clone the repo and explore a richer example — 43 files across 8 directories, with type widening, nullable fields, nested objects, and deliberate edge cases: +> ```bash +> git clone https://github.com/edochi/mdvs.git +> cd mdvs +> mdvs init example_kb/ +> mdvs search "experiment" example_kb/ +> ``` + +## Features + +- **Schema inference** — types (boolean, integer, float, string, arrays, nested objects), path constraints (allowed/required per directory), nullable detection. All automatic. +- **Frontmatter validation** — wrong types, disallowed fields, missing required fields, null violations. Four independent checks, path-aware. +- **Semantic search** — instant vector search using lightweight [Model2Vec](https://minish.ai/) static embeddings. Default model is ~30MB. No GPU, no API keys. +- **SQL filtering** — `--where` clauses on any frontmatter field, powered by [DataFusion](https://datafusion.apache.org/). Arrays, nested objects, LIKE, IS NULL — full SQL. +- **Incremental builds** — only changed files are re-embedded. Unchanged files keep their chunks. If nothing changed, the model isn't even loaded. +- **Auto pipeline** — `search` auto-builds the index. `build` auto-updates the schema. One command does everything: `mdvs search "query"`. +- **JSON output** — all commands support `--output json` for scripting and CI. + ## Commands | Command | Description | |---------|-------------| -| `init` | Scan files, infer schema, write `mdvs.toml`, optionally build index | +| `init` | Scan files, infer schema, write `mdvs.toml` | | `check` | Validate frontmatter against schema | | `update` | Re-scan and update field definitions | | `build` | Validate + embed + write search index | @@ -137,18 +177,6 @@ mdvs build | `info` | Show config and index status | | `clean` | Delete search index | -## How it works - -mdvs treats your markdown directory like a database: - -- **`init`** scans your files and infers a schema from frontmatter — like `CREATE TABLE` -- **`check`** validates every file against that schema — like constraint checking -- **`update`** detects new fields as your files evolve — like `ALTER TABLE` -- **`build`** chunks and embeds your content into a local Parquet index -- **`search`** queries that index with SQL filtering on metadata — like `SELECT ... WHERE ... ORDER BY similarity` - -Two artifacts: `mdvs.toml` (committed, your schema) and `.mdvs/` (gitignored, the search index). - ## Documentation Full documentation at [edochi.github.io/mdvs](https://edochi.github.io/mdvs/). diff --git a/book/src/getting-started.md b/book/src/getting-started.md index 415e003..7f5c5b8 100644 --- a/book/src/getting-started.md +++ b/book/src/getting-started.md @@ -87,6 +87,21 @@ That command did three things: 2. **Inferred** 37 typed fields — strings, integers, floats, booleans, arrays, even a nested object (`calibration`) 3. **Wrote** `mdvs.toml` with the inferred schema +Notice the third column: `draft` appears in 8/43 files — all in `blog/`. `sensor_type` in 3/43 — all in `projects/alpha/notes/`. mdvs captured not just the types, but *where* each field belongs. Run `mdvs init example_kb -v` to see the full path patterns. + +Here's what a field definition looks like in `mdvs.toml`: + +```toml +[[fields.field]] +name = "sensor_type" +type = "String" +allowed = ["projects/alpha/notes/**"] +required = ["projects/alpha/notes/**"] +nullable = false +``` + +This means `sensor_type` is allowed only in experiment notes, and required there. If it appears in a blog post, `check` will flag it. If it's missing from an experiment note, `check` will flag that too. + One artifact is created by `init`: **`mdvs.toml`** — the schema file. Commit this to version control. The `.mdvs/` directory (search index) is created later on first `build` or `search`. ## Validate @@ -101,7 +116,7 @@ mdvs check example_kb Checked 43 files — no violations ``` -Since `mdvs init` just inferred the schema from these same files, everything passes. The power of `check` comes after you tighten the schema. +Since `mdvs init` just inferred the schema from these same files, everything passes. The power of `check` comes after you tighten the schema — or when files drift from it. Try adding `sensor_type: SPR-A1` to a blog post — mdvs will flag it as `Disallowed` because that field doesn't belong there. ### What violations look like diff --git a/book/src/introduction.md b/book/src/introduction.md index 144f889..81d0b67 100644 --- a/book/src/introduction.md +++ b/book/src/introduction.md @@ -32,17 +32,29 @@ tags: # String[] mdvs recognizes these types automatically. When it scans your files, it infers the type of each field from the values it finds — no configuration needed. +## Directory-aware schema + +mdvs infers a three-dimensional schema from your files: + +- **Types** — boolean, integer, float, string, arrays, nested objects. Inferred automatically, with widening when files disagree. +- **Paths** — which fields belong in which directories. `draft` only in `blog/`, `sensor_type` only in `projects/alpha/notes/`. Captured as `allowed` and `required` glob patterns. +- **Nullability** — whether a field can be null. Tracked per field. + +This means different directories can have different fields with different constraints — all inferred automatically from your existing files. + +> **Tightest fit:** `mdvs init` infers the strictest schema that's consistent with your existing files. A field is inferred as *allowed* in a directory if at least one file there has it. It's inferred as *required* if every file there has it. These rules propagate up — if every subdirectory requires a field, the parent directory does too. The result is the tightest set of constraints where `check` still returns zero violations. You can always loosen them later. + ## Two layers mdvs has two distinct capabilities that work independently: -**Validation** — Scan your files, infer what frontmatter fields exist, where they appear, and what types they have. Write the result to `mdvs.toml`. Then validate files against that schema. No model, no index, nothing to download. +**Validation** — Scan your files, infer what frontmatter fields exist, which directories they appear in, and what types they have. Write the result to `mdvs.toml`. Then validate files against that schema. No model, no index, nothing to download. **Search** — Chunk your markdown, embed it with a lightweight local model, store the vectors in Parquet files in `.mdvs/`, and query with natural language. Filter results on any frontmatter field using standard SQL. -You need validation without search? Run `mdvs init --suppress-auto-build`, customise the fields in `mdvs.toml`, and run `mdvs check` to validate your files. +You need validation without search? Run `mdvs init`, customize the fields in `mdvs.toml`, and run `mdvs check`. -You want search without validation? Just run `mdvs init` and `mdvs search` to get going. The inferred schema is used to extract metadata for search results, but you don't have to worry about it if you don't want to. +You want search without validation? Just run `mdvs init` and `mdvs search`. The inferred schema is used to extract metadata for search results, but you don't have to worry about it if you don't want to. Use them together for the best experience, or separately if that's what you need. @@ -53,6 +65,7 @@ You can think of mdvs as a layer on top of your markdown files that gives you da | Concept | Database | mdvs | |---|---|---| | Define structure | `CREATE TABLE` | `mdvs init` | +| Per-table columns | Different columns per table | Per-directory fields via `allowed`/`required` globs | | Enforce constraints | Constraint validation | `mdvs check` | | Evolve structure | `ALTER TABLE` | `mdvs update` | | Create an index | `CREATE INDEX` | `mdvs build` | diff --git a/docs/spec/todos/TODO-0118.md b/docs/spec/todos/TODO-0118.md new file mode 100644 index 0000000..f4eaab5 --- /dev/null +++ b/docs/spec/todos/TODO-0118.md @@ -0,0 +1,70 @@ +--- +id: 118 +title: "Rework README and book intro to show directory-aware schema" +status: todo +priority: high +created: 2026-03-17 +depends_on: [] +blocks: [] +--- + +# TODO-0118: Rework README and book intro to show directory-aware schema + +## Summary + +The README and book introduction don't surface mdvs's core differentiator: it infers schema constraints from directory structure — different fields, different requirements, different types per directory. This is unique among all competitors and should be immediately visible. + +## Problem + +The current README Quick Start uses generic examples (`mdvs init ~/notes`) that don't show path-awareness. The check output (`missing required field 'tags'`) could be any flat linter. Nothing shows "this field isn't allowed in this directory" or "this field is required only in this path." + +The competitive analysis (March 2026) confirmed: no other tool combines schema inference + path-based validation + search. But the README reads like a feature list, not a demonstration. + +## Approach: show, don't tell + +Instead of a "Why mdvs?" sales pitch, demonstrate the three dimensions in action with a tiny, self-explanatory example. + +### README + +Create a minimal example structure (5-6 files, 3 directories) that makes the point instantly: + +``` +notes/ +├── blog/ +│ ├── my-post.md ← title, tags, draft +│ └── idea.md ← title, draft +├── people/ +│ ├── alice.md ← title, role, email +│ └── bob.md ← title, role +└── meetings/ + └── standup.md ← title, date, attendees +``` + +Three commands, three "aha" moments: +1. **Init** — output shows `draft` only in `blog/`, `role` only in `people/`, `attendees` only in `meetings/`. User sees: "it understood my structure." +2. **Check** — catches a structural mistake (e.g., `draft` leaked into `people/`, or missing `role`). User sees: "it caught something a flat linter wouldn't." +3. **Search** — finds relevant content. User sees: "it works." + +This replaces the current Quick Start section. Keep it tight — the whole example should be scrollable without pausing. + +### Book intro / getting-started + +The full `example_kb` walkthrough goes deeper: +- Show the complete directory tree +- Show the inferred schema with all path patterns +- Walk through type widening, nullable, nested objects +- Show violations that only make sense with path awareness +- Show search with `--where` filtering by path and metadata + +### Considerations + +- The README example should be a real, runnable mini-vault (maybe ship it as `examples/quick-start/` in the repo, or generate it in the Quick Start instructions) +- The output examples must be real captured output, not pseudocode +- The book already has `example_kb` — it needs better framing ("notice how `drift_rate` is required only in `projects/alpha/notes/`"), not new content + +## Files + +- `README.md` — rework Quick Start section +- `book/src/getting-started.md` — reframe around directory-aware schema +- `book/src/introduction.md` — update pitch if needed +- Possibly create `examples/quick-start/` with the mini-vault diff --git a/docs/spec/todos/index.md b/docs/spec/todos/index.md index f735be5..30ef699 100644 --- a/docs/spec/todos/index.md +++ b/docs/spec/todos/index.md @@ -119,3 +119,4 @@ | [0115](TODO-0115.md) | Embed asciinema recordings in mdBook for interactive demos | todo | medium | 2026-03-17 | | [0116](TODO-0116.md) | Trim DataFusion default features to reduce binary size | todo | medium | 2026-03-17 | | [0117](TODO-0117.md) | Fix null values skipping Disallowed and NullNotAllowed checks | done | high | 2026-03-17 | +| [0118](TODO-0118.md) | Rework README and book intro to show directory-aware schema | todo | high | 2026-03-17 |