Skip to content

Add tokens and lexer!#10

Merged
ujaandas merged 59 commits intomainfrom
feat/lang/lexer
Apr 23, 2026
Merged

Add tokens and lexer!#10
ujaandas merged 59 commits intomainfrom
feat/lang/lexer

Conversation

@ujaandas
Copy link
Copy Markdown
Owner

This PR adds the first-steps and initial foundation for the invariants lang, the lexer! The lexer is responsible for converting raw invariants code into a series of tokens.

The set of tokens consists of all possible symbols, literals and otherwise keywords of importance, such as spec, ==, +, Boolean, etc... A literal, in this context, is the "literal" value associated with a token, and is either a std::monostate, for tokens with no literal values (eg; operators, keywords, punctuation, nulls), a std::string for identifiers and string-type variables, int for integers, double for "numbers" (as per the OpenAPI spec), or bool for booleans. Some helper operators such as ostream<< and equality checks have also been implemented. For the most part, this is just a fairly straightforward representation of data.

The lexer class is more interesting, as it parses our text and converts it into a stream of tokens. The implementation thereof is fairly straightforward, as thanks to our token struct inheriting from uint8, we can use a switch/case table which devolves into a jump-table post-compilation for what I imagine would be a significant performance boost as opposed to using a map to match characters to tokens. We, at most, lookahead 2 characters.

A very basic supported (scannable) input might look like this:

spec User {
   field age: Integer
   check age >= 18
}

Also, a very, very, very basic interpreter has been added. Some more work needs to be done surrounding the actual CLI and "framework" of the language, such as adding a REPL, proper error types, etc... but that is out of scope of this branch and can be tackled much later as those are more niceties than anything.

Also, we make extensive use of GTest and (ideally) have tested most common paths.

Copilot AI review requested due to automatic review settings April 23, 2026 22:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the initial lexer/token infrastructure (plus a minimal interpreter) for the invariants language, along with a Nix/prek workflow to build and run tests.

Changes:

  • Added TokenType/Token model with printing, equality, and literal support.
  • Implemented a Lexer that scans source text into a token stream and added GTest coverage for tokens/lexing.
  • Added a minimal Interpreter that runs the lexer and prints tokens; wired new libs/tests into CMake and added nix run .#test.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
prek.toml Switches to local prek hooks and adds Nix-based configure/format/tidy/cppcheck/test hooks.
flake.nix Adds dev-test app and exposes nix run .#test.
README.md Documents running the test suite via Nix app.
lang/src/lexer/token.hpp Adds token types, literal variant, and token API.
lang/src/lexer/token.cpp Implements token formatting and comparison.
lang/src/lexer/lexer.hpp Declares lexer and keyword table.
lang/src/lexer/lexer.cpp Implements scanning logic for tokens, literals, comments, whitespace.
lang/src/interp/interpreter.hpp Declares minimal interpreter skeleton.
lang/src/interp/interpreter.cpp Implements run() by lexing and printing tokens.
lang/src/**/CMakeLists.txt Builds lexer/interpreter libs and wires them into the main executable.
lang/tests/** Adds GTest suites for token/lexer and a basic interpreter test; updates test CMake structure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lang/src/lexer/token.cpp
#include "token.hpp"

#include <ostream>
#include <string>
Comment thread lang/src/lexer/lexer.hpp
Comment on lines +1 to +5
#pragma once

#include <string>
#include <unordered_map>
#include <vector>
Comment thread lang/src/lexer/lexer.cpp Outdated
Comment thread lang/src/lexer/lexer.cpp
Comment on lines +192 to +196
addToken(type, false);
return;
}

addToken(type);
Comment thread lang/src/interp/interpreter.hpp Outdated
Comment on lines +3 to +12
#include <cstddef>
#include <string_view>

namespace invariants::interpreter {

class Interpreter {
private:
bool hadErr = false;
void report(std::size_t line, std::string_view where, std::string_view msg);

Comment thread lang/src/interp/interpreter.cpp Outdated
Comment on lines +5 to +15
// #include <string_view>

#include "lexer.hpp"

namespace invariants::interpreter {

// void Interpreter::report(std::size_t line, std::string_view where,
// std::string_view msg) {
// std::println("[line %d] Error %s : %s", line, where, msg);
// hadErr = true;
// }
Comment thread lang/src/lexer/token.cpp Outdated
Comment on lines +59 to +60
return !(this->type == other.type && this->lexeme == other.lexeme &&
this->literal == other.literal && this->line == other.line);
Comment thread lang/src/lexer/token.hpp
Comment on lines +88 to +92
bool operator!=(const Token& other) const;
std::string toString() const;

friend std::ostream& operator<<(std::ostream& os, const Token& token);
};
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 22:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the initial C++ implementation of the invariants language lexer/token model (plus a minimal interpreter stub) and wires them into the build/test + developer tooling.

Changes:

  • Add TokenType/Token (with literals, formatting, and equality) and a Lexer that scans source text into tokens.
  • Add initial Interpreter::run() that lexes input and prints the token stream.
  • Add GoogleTest coverage for token and lexer behavior; update CMake/Nix/prek tooling to build and run tests.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
prek.toml Switch hooks to local/system hooks invoking Nix commands (configure/format/tidy/cppcheck/test).
lang/src/lexer/token.hpp Introduces token types, literal representation, and Token API.
lang/src/lexer/token.cpp Implements token formatting and equality helpers.
lang/src/lexer/lexer.hpp Declares lexer and keyword table.
lang/src/lexer/lexer.cpp Implements scanning logic for punctuation/operators/literals/keywords/comments.
lang/src/lexer/CMakeLists.txt Builds invariants_lexer library.
lang/src/interp/interpreter.hpp / interpreter.cpp Adds minimal interpreter that runs the lexer and prints tokens.
lang/src/interp/CMakeLists.txt Builds invariants_interp library.
lang/src/CMakeLists.txt / lang/CMakeLists.txt Adds subdirectories and links hello_world to lexer.
lang/tests/lexer/* + lang/tests/interp/* + lang/tests/CMakeLists.txt Adds/organizes lexer/token/interpreter tests and discovery.
flake.nix Adds nix run .#test app and wires it into flake outputs.
README.md Documents nix run .#test and notes about impure scripts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lang/src/lexer/token.cpp
Comment on lines +10 to +25
std::string literalToString(const Literal& lit) {
return std::visit(
[](const auto& value) -> std::string {
using T = std::decay_t<decltype(value)>;

if constexpr (std::is_same_v<T, std::monostate>) {
return "null";
} else if constexpr (std::is_same_v<T, std::string>) {
return value;
} else if constexpr (std::is_same_v<T, double>) {
return std::to_string(value);
} else if constexpr (std::is_same_v<T, int>) {
return std::to_string(value);
} else if constexpr (std::is_same_v<T, bool>) {
return value ? "true" : "false";
}
Comment thread lang/src/lexer/lexer.hpp
Comment on lines +13 to +44
const std::string source;
std::vector<Token> tokens;
inline static const std::unordered_map<std::string_view, TokenType> keywords{
{"spec", TokenType::KW_SPEC},
{"field", TokenType::KW_FIELD},
{"check", TokenType::KW_CHECK},
{"invariant", TokenType::KW_INVARIANT},
{"Boolean", TokenType::KW_BOOLEAN},
{"true", TokenType::LIT_BOOLEAN_T},
{"false", TokenType::LIT_BOOLEAN_F},
{"Array", TokenType::KW_ARRAY},
{"Null", TokenType::KW_NULL},
{"null", TokenType::LIT_NULL},
{"String", TokenType::KW_STRING},
{"Number", TokenType::KW_NUMBER},
{"Integer", TokenType::KW_INTEGER},
{"IN", TokenType::KW_IN},
{"NIN", TokenType::KW_NOT_IN},
{"NI", TokenType::KW_CONTAINS},
};
size_t start = 0;
size_t curr = 0;
size_t line = 1;

void scanToken();
char advance();
void addToken(TokenType type);
void addToken(TokenType type, Literal literal);

public:
explicit Lexer(std::string_view source);
std::vector<Token> scanTokens();
Comment thread lang/src/lexer/lexer.cpp
Comment on lines +179 to +197
std::string text = source.substr(start, curr - start);

auto it = keywords.find(text);
TokenType type =
(it != keywords.end()) ? it->second : TokenType::LIT_IDENTIFIER;

// Check if boolean and if so, add relevant literals
if (type == TokenType::LIT_BOOLEAN_T) {
addToken(type, true);
return;
}

if (type == TokenType::LIT_BOOLEAN_F) {
addToken(type, false);
return;
}

addToken(type);
}
Comment thread flake.nix
Comment on lines 18 to 29
dev-configure = pkgs.writeShellApplication {
name = "dev-configure";
meta.description = "Configure clangd environment.";
runtimeInputs = with pkgs; [
clang
cmake
ninja
];
text = ''
set -euo pipefail
cmake -S lang -B .nix-dev/build
'';
Comment thread flake.nix
Comment on lines +32 to +44
dev-test = pkgs.writeShellApplication {
name = "dev-test";
meta.description = "Run test suite.";
runtimeInputs = with pkgs; [
cmake
ninja
];
text = ''
set -euo pipefail
cmake -S lang -B .nix-dev/build
cmake --build .nix-dev/build
ctest --test-dir .nix-dev/build --output-on-failure
'';
Comment thread lang/src/lexer/token.hpp
Comment on lines +90 to +92

friend std::ostream& operator<<(std::ostream& os, const Token& token);
};
Comment thread lang/src/lexer/token.hpp
Comment on lines +70 to +75
using Literal = std::variant<std::monostate, // null
std::string, // identifiers + strings
int, // integers
double, // numbers
bool // booleans
>;
@ujaandas ujaandas merged commit 2d728ab into main Apr 23, 2026
1 check passed
@ujaandas ujaandas deleted the feat/lang/lexer branch April 23, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants