bo-sql

A read-optimized, in-memory, single-threaded SQL engine in C++.

Goal

Build a read-optimized, in-memory, single-threaded SQL engine that:

Loads a few CSV tables into a columnar store (no updates after load).
Supports a minimal SQL subset: SELECT … FROM … WHERE …, INNER JOIN (equi-join), GROUP BY (SUM, COUNT, AVG), ORDER BY, LIMIT, and basic expressions.
Types: INT64, DOUBLE, STRING (dictionary-encoded), DATE (as int32 yyyymmdd).
Deterministic execution with a small cost-based heuristic (or rule-based) optimizer.
Vectorized execution (batch size configurable, default 4096 rows).
CLI REPL: load tables, run SQL, pretty-print results, EXPLAIN.

Non-goals: No transactions, WAL, MVCC, indexes, deletes/updates, subqueries, window functions, outer joins, UDFs, spill-to-disk.

Why?

Why not?
Learn how a SQL engine works end-to-end.
Practice C++20, modern build systems (Meson), testing (Catch2), CI (GitHub Actions).
Explore design tradeoffs: columnar vs row, vectorized vs tuple

Dependencies

C++20 compiler (e.g., clang or gcc)
Meson build system
Ninja build tool
Catch2 (included as subproject)

Getting Started

Building

Install dependencies:
- Meson: pip install meson
- Ninja: brew install ninja (macOS) or apt install ninja-build (Ubuntu)

Clone the repository and build:

git clone <repo-url>
cd bo-sql
meson setup build
ninja -C build

Running

After building, run the CLI:

./build/bq [csvfile] [--sql] [--output-format <csv|md>]

Modes:

Interactive REPL: ./build/bq or ./build/bo-sql csvfile (loads CSV if provided)
SQL from stdin: ./build/bq [csvfile] --sql (loads CSV from file or stdin, reads SQL from stdin)
--output-format: Stub for future output formatting (currently ignored)

Commands in REPL:

LOAD TABLE name FROM 'file.csv';
SHOW TABLES;
DESCRIBE table_name;
EXPLAIN SELECT ...;
SELECT ...;

Examples:

Interactive REPL with CSV loaded:

./build/bq test.csv
> SELECT * FROM table LIMIT 5;

Execute SQL on piped CSV:

cat test.csv | ./build/bq --sql
SELECT COUNT(*) FROM table;

Load CSV and execute piped SQL:

echo "SELECT * FROM table WHERE id > 10;" | ./build/bq test.csv --sql

Development Milestones

M0 – Skeleton (½–1 day)

Meson/CMake project; engine/, parser/, planner/, exec/, storage/, catalog/, cli/, tests/.
Common types (i64, f64, StrId, Date32) and ColumnVector.

M1 – Storage + Catalog (1–2 days)

LOAD CSV into columnar, dictionary-encoded table.
Catalog tracks tables, columns, types, stats (min/max, NDV estimate, row count).

M2 – Parser + AST (1–2 days)

Handwritten recursive-descent for the subset.
AST nodes for SelectStmt, TableRef, Join, Expr.

M3 – Logical algebra IR (1 day)

Nodes: LogicalScan, LogicalFilter, LogicalProject, LogicalHashJoin, LogicalAggregate, LogicalOrder, LogicalLimit.

M4 – Physical plan + Vectorized exec (3–4 days)

Vectorized Volcano operators: Scan, Selection, Project, HashJoin, HashAggregate, OrderBy, Limit.

M5 – Optimizer (2–3 days)

Rule-based: Predicate pushdown, projection pruning, join reordering.

M6 – REPL + EXPLAIN + Tests (2–3 days)

CLI REPL, EXPLAIN, unit tests.

Sample Queries

Q1: Revenue by day

SELECT order_date, SUM(total) AS rev
FROM orders
WHERE status = 'COMPLETE' AND order_date BETWEEN 20240101 AND 20240131
GROUP BY order_date
ORDER BY order_date;

Q2: Top SKUs by revenue

SELECT l.sku, SUM(l.qty * l.price) AS rev
FROM lineitem l JOIN orders o ON l.order_id = o.order_id
WHERE o.status = 'COMPLETE'
GROUP BY l.sku
ORDER BY rev DESC
LIMIT 20;

Design Choices

Columnar + vectorized: shows memory bandwidth, CPU cache, SIMDable loops, simple predicate pushdown.
Dictionary encoding for strings: design ID vs payload columns and equality predicate acceleration.
Rule-based optimizer: implement real rewrites and simple stats; enough to compare plans.
Hash join + hash agg only: one great path > many mediocre ones.
In-memory only: ignore buffer manager & logging; still real for analytics

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
include		include
src		src
subprojects		subprojects
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
meson.build		meson.build
smoke.sh		smoke.sh
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bo-sql

Goal

Why?

Dependencies

Getting Started

Building

Running

Development Milestones

M0 – Skeleton (½–1 day)

M1 – Storage + Catalog (1–2 days)

M2 – Parser + AST (1–2 days)

M3 – Logical algebra IR (1 day)

M4 – Physical plan + Vectorized exec (3–4 days)

M5 – Optimizer (2–3 days)

M6 – REPL + EXPLAIN + Tests (2–3 days)

Sample Queries

Design Choices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bo-sql

Goal

Why?

Dependencies

Getting Started

Building

Running

Development Milestones

M0 – Skeleton (½–1 day)

M1 – Storage + Catalog (1–2 days)

M2 – Parser + AST (1–2 days)

M3 – Logical algebra IR (1 day)

M4 – Physical plan + Vectorized exec (3–4 days)

M5 – Optimizer (2–3 days)

M6 – REPL + EXPLAIN + Tests (2–3 days)

Sample Queries

Design Choices

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages