A read-optimized, in-memory, single-threaded SQL engine in C++.
Build a read-optimized, in-memory, single-threaded SQL engine that:
- Loads a few CSV tables into a columnar store (no updates after load).
- Supports a minimal SQL subset: SELECT … FROM … WHERE …, INNER JOIN (equi-join), GROUP BY (SUM, COUNT, AVG), ORDER BY, LIMIT, and basic expressions.
- Types: INT64, DOUBLE, STRING (dictionary-encoded), DATE (as int32 yyyymmdd).
- Deterministic execution with a small cost-based heuristic (or rule-based) optimizer.
- Vectorized execution (batch size configurable, default 4096 rows).
- CLI REPL: load tables, run SQL, pretty-print results, EXPLAIN.
Non-goals: No transactions, WAL, MVCC, indexes, deletes/updates, subqueries, window functions, outer joins, UDFs, spill-to-disk.
- Why not?
- Learn how a SQL engine works end-to-end.
- Practice C++20, modern build systems (Meson), testing (Catch2), CI (GitHub Actions).
- Explore design tradeoffs: columnar vs row, vectorized vs tuple
- C++20 compiler (e.g., clang or gcc)
- Meson build system
- Ninja build tool
- Catch2 (included as subproject)
-
Install dependencies:
- Meson:
pip install meson - Ninja:
brew install ninja(macOS) orapt install ninja-build(Ubuntu)
- Meson:
-
Clone the repository and build:
git clone <repo-url> cd bo-sql meson setup build ninja -C build
After building, run the CLI:
./build/bq [csvfile] [--sql] [--output-format <csv|md>]Modes:
- Interactive REPL:
./build/bqor./build/bo-sql csvfile(loads CSV if provided) - SQL from stdin:
./build/bq [csvfile] --sql(loads CSV from file or stdin, reads SQL from stdin) --output-format: Stub for future output formatting (currently ignored)
Commands in REPL:
LOAD TABLE name FROM 'file.csv';SHOW TABLES;DESCRIBE table_name;EXPLAIN SELECT ...;SELECT ...;
Examples:
-
Interactive REPL with CSV loaded:
./build/bq test.csv > SELECT * FROM table LIMIT 5;
-
Execute SQL on piped CSV:
cat test.csv | ./build/bq --sql SELECT COUNT(*) FROM table;
-
Load CSV and execute piped SQL:
echo "SELECT * FROM table WHERE id > 10;" | ./build/bq test.csv --sql
- Meson/CMake project; engine/, parser/, planner/, exec/, storage/, catalog/, cli/, tests/.
- Common types (i64, f64, StrId, Date32) and ColumnVector.
- LOAD CSV into columnar, dictionary-encoded table.
- Catalog tracks tables, columns, types, stats (min/max, NDV estimate, row count).
- Handwritten recursive-descent for the subset.
- AST nodes for SelectStmt, TableRef, Join, Expr.
- Nodes: LogicalScan, LogicalFilter, LogicalProject, LogicalHashJoin, LogicalAggregate, LogicalOrder, LogicalLimit.
- Vectorized Volcano operators: Scan, Selection, Project, HashJoin, HashAggregate, OrderBy, Limit.
- Rule-based: Predicate pushdown, projection pruning, join reordering.
- CLI REPL, EXPLAIN, unit tests.
Q1: Revenue by day
SELECT order_date, SUM(total) AS rev
FROM orders
WHERE status = 'COMPLETE' AND order_date BETWEEN 20240101 AND 20240131
GROUP BY order_date
ORDER BY order_date;Q2: Top SKUs by revenue
SELECT l.sku, SUM(l.qty * l.price) AS rev
FROM lineitem l JOIN orders o ON l.order_id = o.order_id
WHERE o.status = 'COMPLETE'
GROUP BY l.sku
ORDER BY rev DESC
LIMIT 20;- Columnar + vectorized: shows memory bandwidth, CPU cache, SIMDable loops, simple predicate pushdown.
- Dictionary encoding for strings: design ID vs payload columns and equality predicate acceleration.
- Rule-based optimizer: implement real rewrites and simple stats; enough to compare plans.
- Hash join + hash agg only: one great path > many mediocre ones.
- In-memory only: ignore buffer manager & logging; still real for analytics