This document is the authoritative design for NotPlusPlus, a source-code interpreter for a small, explicit, well-defined subset of real C++.
NotPlusPlus is not a new language with C++-like syntax. It is a system that accepts actual C++ source text and interprets programs whose constructs fall entirely within a supported subset of the C++ language. Programs outside that subset are rejected with precise diagnostics.
This document defines:
- the product goal
- the supported language subset
- semantic rules
- architecture
- internal representations
- execution model
- diagnostic behavior
- implementation plan
- testing strategy
- explicit non-goals
- future evolution constraints
NotPlusPlus
NotPlusPlus shall:
- accept source code written in genuine C++ syntax
- lex and parse it according to a subset-compatible grammar
- resolve declarations and names according to subset-compatible C++ rules
- type-check the program according to the subset’s semantics
- interpret the program directly without producing native machine code
- execute a well-formed
mainfunction and produce observable program output
NotPlusPlus is a subset C++ interpreter, not a compiler, transpiler, static analyzer, or language invention exercise.
The correct mental model is:
“Interpret actual C++ source that belongs to a strict, documented subset of ISO C++.”
This distinction is critical. The parser and semantic rules must align with real C++ wherever the subset overlaps the language, rather than inventing alternate rules for convenience.
A user writes a small C++ program using only supported constructs, for example:
int add(int a, int b) {
return a + b;
}
int main() {
int x = add(2, 3);
if (x > 4) {
print(x);
}
return 0;
}NotPlusPlus parses and interprets this program according to the documented subset semantics.
NotPlusPlus shall not attempt to “mostly parse C++” or “best-effort emulate unsupported constructs.” Unsupported features are not partially recognized and ignored. They are rejected.
This project succeeds by being:
- strict
- explicit
- deterministic
- semantically coherent
- faithful where supported
If a construct is supported, its spelling and semantics must correspond to real C++ as closely as practical within the subset.
Every accepted construct must be documented. Everything else is unsupported.
Unsupported constructs shall not be reinterpreted under custom rules.
Bad:
- accepting
std::cout << x;and secretly treating it asprint(x);
Good:
- either support real semantics for a narrow form of expression involving
std::cout, or reject it
For version 1, the recommended design is to reject std::cout entirely and provide a built-in function print(...) defined as part of the interpreter’s runtime environment, because a normal function call is valid C++ syntax.
Subset selection must deliberately avoid notorious C++ ambiguities and front-end complexity where possible.
Given the same source and interpreter build, behavior must be stable and reproducible.
A rejected program must receive actionable diagnostics with source locations and stable error categories.
The implementation shall be staged:
- source management
- lexing
- preprocessing boundary handling
- parsing
- AST formation
- semantic analysis
- interpretation
No phase may embed undocumented behavior from a later phase unless explicitly designed.
The initial supported subset shall include:
- a single translation unit
- zero or more function definitions
- optional function declarations
- no separate compilation
- no headers beyond a limited interpreter-provided prelude model
intboolvoid- fixed-size one-dimensional arrays of supported element type, optionally deferred to phase 2
- integer literals
- boolean literals
trueandfalse - identifier references
- parenthesized expressions
- unary operators:
+,-,! - binary arithmetic:
+,-,*,/,% - binary comparison:
<,<=,>,>=,==,!= - logical operators:
&&,|| - assignment:
= - compound assignment:
+=,-=,*=,/=,%= - function call
- array indexing if arrays are enabled
- comma operator is unsupported in expressions except where grammar requires comma separators
- expression statement
- declaration statement
- compound statement / block
ifif/elsewhileforreturnbreakcontinue
- local variable declarations
- function declarations and definitions
- block scope
- function parameter declarations
- optional local array declarations
- built-in function
print(int) - built-in function
print(bool)optional - built-in function
println(int)optional - built-ins shall be ordinary function names in the global namespace from the program's perspective
- interpret starting from
int main() - allow
int main()and possiblyint main(int, bool)only if intentionally designed; default is onlyint main()
- preprocessing beyond minimal policy handling
- macros
- includes with real header loading
- namespaces
- classes
- structs
- enums
- references
- pointers
- dynamic allocation
- strings
- floating point
- character types
- casts
- overload resolution beyond built-in support model
- templates
- exceptions
- function pointers
- lambdas
- recursion limits beyond implementation-defined stack protection
- user-defined operators
- declarations requiring full declarator complexity
switchdo-whileconst,constexpr,static,extern,volatile,mutable- global variables, at least in version 1 baseline
This section is normative.
The input to NotPlusPlus is a UTF-8 text file treated as a single C++ source file.
The implementation may restrict accepted characters to ASCII plus standard whitespace for version 1.
Line endings:
\nmandatory support\r\nnormalized support recommended
A program consists of a sequence of top-level declarations. In version 1:
Allowed top-level declarations:
- function declaration
- function definition
- optional built-in declaration injection, performed by interpreter before semantic analysis
Disallowed at top level:
- variable definitions
- namespace declarations
- type definitions
- using directives
- class/struct definitions
- templates
- include directives unless a special preprocessing policy is adopted
The supported keyword set includes:
intboolvoidifelsewhileforreturntruefalse
All other C++ keywords are lexed as keywords if the lexer supports them globally, but any occurrence in syntax outside the supported grammar is rejected as unsupported.
Support:
// line comment/* block comment */
Nested block comments are not supported, matching C/C++ behavior.
Supported literals:
- decimal integer literals, non-suffixed
truefalse
Unsupported:
- hexadecimal
- binary
- octal
- digit separators
- integer suffixes
- character literals
- string literals
- floating literals
- user-defined literals
The integer literal domain shall be bounded by the interpreter integer representation. Recommended baseline: signed 32-bit two’s-complement semantics.
Supported:
intboolvoid
Optional in version 1, but strongly recommended only after core stability.
Supported form:
int a[5];bool flags[10];
Constraints:
- one-dimensional only
- size must be a positive integer literal
- no variable-length arrays
- no array parameters by special adjustment unless explicitly modeled
- no decay to pointer semantics
- no initializer lists in version 1 baseline
voidmay only appear as a function return type- variables and parameters may not have type
void - arrays may not have element type
void - if arrays are supported, assignment between arrays is disallowed
To avoid full C++ declarator complexity, the subset supports only a reduced set of declarator forms.
Allowed:
int x;int x = 5;bool done = false;int arr[5];if arrays enabled
Disallowed:
- multiple declarators in one declaration, e.g.
int a, b; - pointer declarators
- reference declarators
- parenthesized declarators
- initialized arrays
- declarators with qualifiers
Allowed:
int f(int a, int b);int f(int a, int b) { ... }void g();bool h(bool x) { ... }
Disallowed:
- default parameters
- variadic parameters
- function overloading
- member functions
- trailing return types
- noexcept, attributes, requires clauses
- cv/ref qualifiers
- templates
Supported:
{
int x = 1;
x = x + 1;
}Each block creates a new lexical scope.
A declaration statement is a supported local variable declaration followed by ;.
Any supported expression followed by ;.
Supported:
if (cond) stmt
if (cond) stmt else stmtCondition must be of type bool, or int if integer-to-bool contextual conversion is allowed by policy. Recommended baseline: allow both, matching C++ contextual conversion to bool.
Supported:
while (cond) stmtSupported:
for (init; cond; step) stmtVersion 1 recommended support:
initmay be empty, an expression statement without trailing semicolon inside the syntax, or a single supported variable declarationcondmay be empty or a supported expressionstepmay be empty or a supported expression
Examples:
for (int i = 0; i < 10; i = i + 1) { ... }
for (; x < 10; ) { ... }Supported:
return;
return expr;Rules:
return;only valid invoidfunctionsreturn expr;required for non-void functions- expression type must be convertible to function return type according to subset rules
Supported:
- identifier
- integer literal
truefalse- parenthesized expression
- function call
- array subscript if arrays enabled
Supported:
+expr-expr!expr
Unsupported:
++--*&sizeofnewdeletestatic_cast- C-style cast
~
Supported:
- multiplicative:
* / % - additive:
+ - - relational:
< <= > >= - equality:
== != - logical and:
&& - logical or:
|| - assignment:
= - compound assignment:
+= -= *= /= %=
Unsupported:
- bitwise operators
- shifts
- comma operator
- member access
- pointer-to-member
- spaceship operator
Assignment is supported only for assignable lvalues:
- variable reference
- array element if arrays enabled
Unsupported:
- chained assignment if parser naturally accepts it via right associativity is allowed only if semantic rules support it; recommended baseline: support it because it follows normal assignment-expression grammar, but it is not required as a primary advertised feature
Calls to declared functions are supported.
Rules:
- exact arity match required
- argument types must be compatible
- no overload resolution
- no implicit function declarations
&& and || must short-circuit exactly as in C++.
This is semantically important and non-negotiable.
This section defines the runtime-visible and compile-time-visible semantics of the supported subset.
The interpreter shall model at least:
- global function scope
- function parameter scope
- block scope
for-init scope if declaration form is used
Variables are resolved lexically, innermost scope first.
Shadowing is allowed across nested scopes.
Example:
int main() {
int x = 1;
{
int x = 2;
print(x); // 2
}
print(x); // 1
return 0;
}Two variables with the same name in the same scope are rejected.
Function declarations may be repeated only if identical in signature and kind. Because overloading is unsupported, any differing function signature with the same name is an error.
The subset is statically typed. Types are determined during semantic analysis.
There is no dynamic typing or value-tag-based operator selection beyond what is already statically determined.
A policy choice is required here. The recommended baseline is:
Allowed:
inttoboolin contextual conversions only, such as conditions and logical operators where C++ would require a bool-like conditionbooltointfor arithmetic contexts only if explicitly aligned with C++ integral promotion semantics
However, to simplify implementation while staying faithful enough, version 1 should adopt:
intandboolare distinct types- arithmetic operators require
int - comparison operators on
intproducebool - equality operators support
int == intandbool == bool - logical operators require operands contextually convertible to bool
- conditions (
if,while,for) require expression contextually convertible to bool - assignment requires exact type match, except possibly
bool = intandint = boolif a limited conversion matrix is adopted
For design clarity, the strictest consistent version is preferred:
- exact-type assignment only
- contextual bool conversion allowed from
boolandint - no other implicit conversions
This gives useful C++ fidelity without opening a large conversion lattice.
Supported:
- default initialization without initializer
- copy initialization with
= expr
For simplicity and defined behavior, the interpreter should not mimic uninitialized local scalar UB in version 1. Instead choose one of:
- strict C++-style UB model for uninitialized reads, detected dynamically
- explicit interpreter rule: reading an uninitialized variable is a runtime error
Recommended:
- every variable has an initialized flag
- declarations without initializer create uninitialized storage
- reading before initialization is a runtime error with source location
This is closer to a practical interpreter and still semantically honest.
If supported:
- elements default to uninitialized
- element read before write is a runtime error
- zero-initialization syntax is unsupported in version 1
Operations are performed on interpreter integers. Recommended baseline semantics:
- 32-bit signed range
- overflow is a runtime error, or implementation-defined wraparound
This needs a deliberate choice because real signed overflow in C++ is UB.
Recommended v1 choice:
- detect overflow and raise runtime error
Rationale:
- deterministic
- easier to debug
- safer
- acceptable for interpreter-defined handling of UB-like conditions
Document this explicitly:
NotPlusPlus does not reproduce all undefined behavior of full C++. Certain UB-prone operations are trapped deterministically at runtime.
This is acceptable because the subset is “real C++ syntax and semantics” only within a constrained executable model; UB emulation is not required.
Division by zero and modulo by zero are runtime errors.
Operands are contextually converted to bool. Evaluation short-circuits.
intrelational comparison is supportedboolrelational comparison is unsupported unless explicitly added; recommended baseline: disallow except equality- equality on same-type operands is supported
Assignment evaluates RHS, converts if allowed, stores value, and yields the assigned value if assignment expressions are expressions in the grammar.
Condition evaluated once. Then branch chosen accordingly.
Standard loop semantics.
Equivalent to C++ subset semantics, not a purely internal custom loop type. The interpreter may implement by direct execution or desugaring.
If desugared, it must preserve:
- init scope
- condition evaluation timing
- step evaluation timing
- block scoping behavior
A return transfers control immediately to the caller.
Reaching end of function:
- for
voidfunction: allowed, implicit return - for non-void function other than
main: semantic error or runtime error
Recommended:
- semantic analysis requires that a non-void function contains at least one syntactically reachable return on all control paths only if control-flow analysis is implemented
- otherwise, reaching end of a non-void function at runtime is a runtime error
For v1, do both:
- conservative static check when trivially obvious
- definitive runtime check at function end
For main, reaching end may return 0 in full C++, but to keep rules simple:
- require explicit
return 0;in version 1, or - allow implicit
return 0;formain
Recommended baseline:
- allow implicit return
0at end ofint main()
Functions may be declared and later defined, or directly defined.
Arguments are evaluated left-to-right as a deliberate subset policy. Full C++ has historically complex sequencing rules. Since this project is an interpreter for a subset, choose a fixed order and document it.
Recommended:
- evaluate function arguments left-to-right
This is slightly stricter than some historical C++ behavior, but deterministic and implementable.
Direct and indirect recursion are supported unless explicitly disabled. Recommended: support recursion.
Implementation shall provide:
- configurable max call depth
- runtime error on stack depth exhaustion
Built-ins are represented as ordinary callable global functions with interpreter-native implementations.
Version 1 required:
void print(int)- optional
void print(bool)
Because overload resolution is unsupported, there are two implementation options:
Single polymorphic built-in outside the user function model. Simpler runtime, less C++-faithful.
Allow limited built-in overloads only, while forbidding user-defined overloads.
Recommended:
- built-ins may have a small internal overload set
- user-defined overloads remain unsupported
This should be explicitly documented as a runtime privilege, not general language support.
The parser need not implement full ISO grammar. It shall implement a reduced grammar that accepts exactly the subset.
The grammar below is normative at the subset level, though implementation may refactor it.
identifier ::= letter (letter | digit | "_")*int_literal ::= digit+"int" "bool" "void" "if" "else" "while" "for" "return" "true" "false""(" ")" "{" "}" "[" "]" ";" "," "="
"+" "-" "*" "/" "%" "!" "&&" "||"
"==" "!=" "<" "<=" ">" ">="translation_unit
::= top_level_decl*
top_level_decl
::= function_decl
| function_def
function_decl
::= type identifier "(" parameter_list_opt ")" ";"
function_def
::= type identifier "(" parameter_list_opt ")" compound_stmt
parameter_list_opt
::= /* empty */
| parameter_list
parameter_list
::= parameter ("," parameter)*
parameter
::= type identifier
| type identifier "[" int_literal "]" // only if array params are supported, otherwise omit
type
::= "int"
| "bool"
| "void"
compound_stmt
::= "{" stmt* "}"
stmt
::= compound_stmt
| decl_stmt
| expr_stmt
| if_stmt
| while_stmt
| for_stmt
| return_stmt
decl_stmt
::= local_var_decl ";"
local_var_decl
::= type identifier
| type identifier "=" expr
| type identifier "[" int_literal "]" // if arrays enabled
expr_stmt
::= expr_opt ";"
expr_opt
::= /* empty */
| expr
if_stmt
::= "if" "(" expr ")" stmt ("else" stmt)?
while_stmt
::= "while" "(" expr ")" stmt
for_stmt
::= "for" "(" for_init ";" expr_opt ";" expr_opt ")" stmt
for_init
::= /* empty */
| expr
| local_var_decl
return_stmt
::= "return" expr_opt ";"
expr
::= assignment_expr
assignment_expr
::= logical_or_expr
| unary_lvalue "=" assignment_expr
logical_or_expr
::= logical_and_expr ("||" logical_and_expr)*
logical_and_expr
::= equality_expr ("&&" equality_expr)*
equality_expr
::= relational_expr (("==" | "!=") relational_expr)*
relational_expr
::= additive_expr (("<" | "<=" | ">" | ">=") additive_expr)*
additive_expr
::= multiplicative_expr (("+" | "-") multiplicative_expr)*
multiplicative_expr
::= unary_expr (("*" | "/" | "%") unary_expr)*
unary_expr
::= primary_expr
| "+" unary_expr
| "-" unary_expr
| "!" unary_expr
primary_expr
::= identifier
| int_literal
| "true"
| "false"
| "(" expr ")"
| call_expr
| array_subscript
call_expr
::= identifier "(" argument_list_opt ")"
argument_list_opt
::= /* empty */
| argument_list
argument_list
::= expr ("," expr)*
array_subscript
::= identifier "[" expr "]"
unary_lvalue
::= identifier
| array_subscript- No expression may start with a type name; this eliminates cast ambiguity in version 1.
- No declaration/expression ambiguity beyond
forinit should remain. - Multiple declarators are excluded to simplify grammar and semantics.
This section is crucial because “actual C++ source text” intersects with the preprocessor.
Recommended baseline:
- NotPlusPlus does not implement the C preprocessor
- Source files containing preprocessing directives are rejected, except optionally a tiny whitelist for built-in headers that are semantically ignored
This is the cleanest design.
The goal is to interpret actual C++ syntax and semantics for a subset. The preprocessor is not part of the core expression/statement/declaration grammar and introduces a separate textual transformation language. Supporting C++ source text does not require full preprocessor support in v1.
If desired, support exactly:
#include <npp>or#include "npp.hpp"
Semantics:
- no real file loading
- interpreter injects declarations for built-ins such as
print
But this should only be added if there is a strong UX reason. Otherwise, it is simpler to treat built-ins as always available.
#define#if,#ifdef, etc.#includeof arbitrary headers#pragma#line
Diagnostic category:
unsupported_preprocessor_directive
NotPlusPlus shall be implemented as a staged front-end plus interpreter:
- Source Manager
- Lexer
- Parser
- AST Builder
- Semantic Analyzer
- Lowered Semantic IR or Direct Annotated AST
- Interpreter Runtime
- Diagnostic Engine
Two viable approaches:
Interpret directly over the AST with semantic annotations.
Pros:
- simpler initial implementation
- fewer intermediate representations
- easier source-location propagation
Cons:
- semantic analysis and runtime concerns may get mixed
- harder to optimize later
Parse to AST, analyze semantically, then lower to a small control-flow/statement IR for interpretation.
Pros:
- cleaner separation
- easier execution engine
- easier constant folding, debugging, tracing
- better long-term maintainability
Cons:
- more engineering upfront
Recommended architecture: hybrid.
- Parse into a high-level AST
- Perform semantic analysis on AST and produce a resolved, typed semantic model
- Lower expressions/statements/functions into a typed executable IR for interpretation after semantic analysis succeeds
This keeps the parser close to source while giving the runtime a cleaner structure.
Responsibilities:
- own file contents
- map byte offsets to line/column
- produce source spans
- provide excerpt rendering for diagnostics
Responsibilities:
- tokenize source
- skip comments and whitespace
- recognize keywords and operators
- report invalid tokens
- attach source spans to tokens
Responsibilities:
- consume token stream
- build AST
- distinguish declaration forms from expression forms within subset grammar
- recover from syntax errors where practical
Responsibilities:
- preserve source structure and spans
- represent declarations, statements, expressions, and types
- remain syntax-level, not runtime-level
Responsibilities:
- symbol table construction
- declaration validation
- name resolution
- type checking
- lvalue/rvalue classification
- function signature registration
- built-in injection
- subset rule enforcement
- unsupported construct detection
Responsibilities:
- transform semantically valid AST into execution-friendly nodes
- eliminate parse-only artifacts
- make control flow explicit
- store resolved declaration IDs and type IDs
Responsibilities:
- manage call stack
- manage variable storage
- evaluate expressions
- execute statements
- invoke built-ins
- detect runtime errors
Responsibilities:
- collect and render compile-time diagnostics
- report runtime errors with stack trace and source spans
- provide stable error codes
Every token and AST node shall carry a source span:
- file id
- start offset
- end offset
Derived on demand:
- line
- column
Each token:
- kind
- lexeme slice or interned content
- source span
Token kinds include:
- identifiers
- literals
- keywords
- punctuators
- eof
A representative AST model:
- list of top-level declarations
FunctionDeclFunctionDefParamDeclVarDecl
CompoundStmtDeclStmtExprStmtIfStmtWhileStmtForStmtReturnStmt
IntLiteralExprBoolLiteralExprNameExprUnaryExprBinaryExprAssignExprCallExprSubscriptExprParenExpr
BuiltinType(Int | Bool | Void)ArrayType(element_type, size)
Every expression node shall later carry:
- resolved type
- value category: lvalue or rvalue
- maybe constant-value metadata if constant folding is added
Fields:
- name
- return type
- parameter types
- declaration span
- definition pointer if defined
- builtin flag
- builtin handler id if builtin
Fields:
- name
- type
- scope id
- declaration span
- storage class category: local / parameter
- runtime slot index
Represent types structurally:
IntBoolVoidArray(TypeId element, uint32 size)
Intern types in a central table for canonical equality.
Recommended IR granularity:
- name
- return type
- parameter slots
- body block
- block
- local declaration
- store
- if
- while
- for or lowered-for
- return
- expr statement
- literal
- load local
- unary op
- binary op
- short-circuit logical
- call resolved function id
- subscript load/store address form if arrays enabled
Important: use separate lvalue-capable nodes or addressable references for assignable expressions.
Use recursive descent.
This is the correct choice for the subset because:
- grammar is controlled
- precedence handling is straightforward
- diagnostics are readable
- implementation is easy to maintain
Top-level parse logic:
- parse type
- parse identifier
- if next token is
(, parse function declaration/definition - otherwise reject, because top-level non-function declarations are unsupported
Local scope parse logic:
- if token begins a supported type specifier, parse declaration statement
- else parse expression statement
Because casts, user-defined types, and elaborate declarators are excluded, this remains unambiguous.
Use precedence climbing or hand-written precedence functions. Recommended:
- dedicated functions per precedence level
This makes associativity clear:
- assignment right-associative
- others left-associative
Parser should recover at:
;}- top-level declaration boundaries
Recovery is important for multi-error reporting in source files.
Semantic analysis should be split into at least three passes.
- collect all top-level function declarations and definitions
- register built-ins
- detect duplicate function names/signatures
For each function:
- establish parameter scope
- analyze statements and expressions
- resolve identifiers
- check types
- assign local storage slots
- verify
mainexists with valid signature - verify every non-builtin called function exists
- verify definitions for declared-but-called functions
- verify no unsupported unresolved forms remain
Use nested scope tables:
- each scope has parent
- variables inserted locally
- functions stored globally
Implementation detail:
- do not store functions in the same namespace structure as variables unless later needed for shadowing/lookup fidelity
- since local functions are unsupported, a separate global function table is simpler
When encountering an identifier expression:
- search local scope chain for variable
- if expression form is a call, resolve as function in global function table
- otherwise error if no variable found
A bare function name as value is unsupported because function pointers are unsupported.
Representative rules:
Operand must be int, result int
Operand must be contextually convertible to bool, result bool
Both operands int, result int
Both operands int, result bool
Both operands same supported scalar type, result bool
Operands contextually convertible to bool, result bool
LHS must be assignable lvalue RHS must be same type or explicitly allowed conversion Result type is LHS type
Function must exist Arity must match Each argument type must match parameter type
Base must be array lvalue
Index must be int
Result is lvalue of element type
Need explicit value category classification.
Lvalues:
- variable references
- array element expressions
Rvalues:
- literals
- arithmetic expressions
- comparison expressions
- function calls returning non-array scalar values
- parenthesized lvalues may preserve lvalue if desired, but for v1 this can be simplified only if parser/analysis tracks it properly
Recommended:
- preserve lvalue-ness through parentheses
Full control-flow analysis is not required for v1. Provide:
- simple structural check where obvious
- runtime guard on falling off end of non-void function
Example runtime guard:
- if function body completes without
Return, emit runtime error “control reached end of non-void function”
The parser and semantic analyzer must produce specific diagnostics when unsupported but recognizable constructs are used.
Examples:
const int x = 1;→ unsupported type qualifierint* p;→ unsupported pointer declaratornamespace std {}→ unsupported namespace declarationx++;→ unsupported operator
This is better than generic parse failure when the construct is lexically recognizable.
NotPlusPlus interprets one executable function at a time using a call stack.
Execution starts at main.
Recommended runtime value enum:
Int(i32)Bool(bool)Array(ArrayObjectId)or inline array storage reference
Avoid boxing every scalar if performance matters, but correctness is primary.
Each function activation record has local storage slots. Each local variable symbol is assigned a slot index during semantic analysis or IR lowering.
A frame contains:
- function id
- slots vector
- maybe scope metadata if block-lifetime destruction ever matters
Each slot contains:
- type id
- initialized flag
- value or array object reference
Because variable lifetime is lexical and there are no destructors in v1, there are two implementation strategies:
Push/pop runtime maps for each block.
Assign each declaration a unique frame slot, valid for the lifetime of the frame; use scope metadata only to block illegal access at compile time.
Recommended:
- fixed slot frame
Rationale:
- simpler runtime
- faster access
- no need to allocate/deallocate per block
- lexical rules already enforced statically
Arrays live in their variable slots.
Call procedure:
- evaluate arguments left-to-right
- create new frame
- initialize parameter slots with argument values
- mark non-parameter locals uninitialized
- execute body
- on return, validate return type and yield value
- pop frame
Use an explicit control-flow result type:
ExecOutcome =
Normal
Break
Continue
Return(Value)
If arrays are supported:
Each array variable slot contains:
- element type
- fixed size
- element storage array
- initialized bitset per element
- indexing performs bounds check
- out-of-range access is runtime error
- array expression does not decay to pointer
- array value passing is unsupported unless array parameters are explicitly modeled
Built-ins are dispatched by function symbol or handler id.
Example:
print(int)writes decimal integer to stdout or interpreter output sinkprint(bool)writestrueorfalse
The runtime must abstract output through an interface for testability:
- real stdout sink
- capture sink for unit tests
- invalid character
- malformed token
- unterminated block comment
- unexpected token
- expected token
- malformed declaration
- malformed expression
- unknown identifier
- redeclaration
- type mismatch
- invalid assignment target
- wrong argument count
- wrong argument type
- missing
main - invalid
mainsignature - unsupported construct
- division by zero
- modulo by zero
- integer overflow if trapped
- uninitialized read
- array bounds violation
- call depth exceeded
- missing return at runtime
- internal interpreter fault
Recommended structure:
- severity
- error code
- primary source span
- human-readable message
- optional notes
- optional related spans
Example:
error[NPP2004]: use of undeclared identifier 'x'
--> sample.cpp:4:12
|
4 | y = x + 1;
| ^
note: no local variable or parameter named 'x' is visible in this scope
Recommended:
NPP1xxxlexicalNPP2xxxsyntaxNPP3xxxsemanticNPP4xxxruntimeNPP9xxxinternal
Runtime errors should emit:
- message
- source span of failing expression/statement
- call stack with function names and call sites where available
Built-ins must be valid C++ function calls, not pseudo-syntax.
This preserves the design goal of accepting real C++ source syntax.
Minimum:
void print(int);
void print(bool);If overload support for built-ins is undesirable, alternative names:
void print_int(int);
void print_bool(bool);However, print overloads are a better user experience and still manageable if isolated to built-ins.
The interpreter internally injects declarations equivalent to:
void print(int);
void print(bool);These declarations are reserved. User code may not redefine them.
int: decimalbool:trueorfalse- no automatic newline unless
printlnbuilt-ins are added
Exactly one valid definition of:
int main()Recommended baseline:
- this is the only valid entry signature in version 1
void main()- parameterized
main - overloaded
main
- explicit
return int_expr;supported - reaching end of
mainreturns0
A recommended Rust implementation layout:
notplusplus/
src/
main.rs
source/
mod.rs # source_manager + span
lex/
mod.rs # lexer entry point
token.rs
lexer.rs
parse/
mod.rs # parser entry point
ast.rs
parser.rs
sema/
mod.rs
types.rs
symbols.rs
scope.rs
sema.rs
ir/
mod.rs
ir.rs
lower.rs
interp/
mod.rs
value.rs
frame.rs
runtime.rs
builtins.rs
diag/
mod.rs
diagnostic.rs
engine.rs
support/
mod.rs
intern.rs
tests/
lexer/
parser/
sema/
runtime/
integration/
docs/
design.md
Cargo.toml
Each directory is a Rust module rooted at mod.rs. Visibility is controlled via pub and pub(crate) — prefer pub(crate) for cross-module interfaces that are not part of any public API. There is no public library crate surface in v1; the binary is the product.
Rust is the primary implementation language for NotPlusPlus.
Reasoning:
- Rust's enum-based ADTs and exhaustive pattern matching map directly and naturally onto the AST, IR, and value representation. Every node kind, every value variant, and every diagnostic category becomes a type-checked variant. Adding or removing a variant produces compile errors at every unhandled match site, which enforces consistency across the pipeline automatically.
- The ownership model eliminates a class of bugs common in hand-written interpreters: use-after-free in value frames, dangling references into scope stacks, and double-free in runtime environments. These are precisely the failure modes that matter in an interpreter managing its own call stack and variable storage.
- Rust has no garbage collector. The interpreter controls its own memory layout for call frames and runtime values, which is preferable for a system that tracks initialization state per variable slot and enforces configurable call-depth limits.
- The
ResultandOptiontypes enforce explicit error handling throughout the pipeline. Diagnostic emission cannot be accidentally silenced; every fallible operation must be handled at the call site. - The Rust ecosystem provides mature support for the diagnostic infrastructure this project requires. Crates such as
mietteandcodespan-reportingprovide span-aware, terminal-formatted error output without bespoke implementation effort. - Rust's test infrastructure —
#[test],#[cfg(test)], and the integration test convention undertests/— maps directly onto the layered test strategy defined in §22 without any additional tooling.
The pipeline stages defined in §3.7 translate to Rust modules as follows. The lexer produces a flat Vec<Token> with span metadata. The parser consumes tokens and produces an owned AST using Box<Expr> and Vec<Stmt> for recursive structure. The semantic analyzer walks the AST and produces a resolved, typed semantic model with symbol tables, scopes, and expression annotations. IR lowering then translates that validated semantic model into executable IR. The interpreter walks the IR using a call stack of Frame values, each holding a slot array for local variables. All inter-stage errors are returned as structured Diagnostic values accumulated in a shared engine rather than panicked or printed inline.
panic! is reserved for genuinely impossible internal states — conditions that represent interpreter bugs, not user program errors. All user-facing failures travel through the diagnostic engine.
C++ was considered for its symbolic symmetry with the project's subject matter. It is rejected because building a correct, safe interpreter runtime in C++ requires disciplined manual memory management that adds implementation risk without design benefit. The project's value is in its semantic correctness, not its implementation language irony.
Python is suitable for early prototyping but is not appropriate as the final implementation language. The absence of static types across the pipeline makes it harder to enforce the invariants that the design depends on — particularly around type-checking, IR lowering, and frame management.
Version 1 shall define deterministic evaluation order even where older/full C++ rules are historically subtle.
Recommended:
- binary operator operands evaluated left-to-right except short-circuit forms, which obey short-circuit
- function call arguments evaluated left-to-right
- assignment evaluates RHS after LHS addressability check but before store
- subscript evaluates base then index
This is a conscious simplification. It must be documented as a subset semantic choice.
NotPlusPlus is not required to emulate all undefined behavior of ISO C++. For supported constructs:
- some UB-like conditions are rejected statically where possible
- some are trapped dynamically with deterministic runtime errors
Examples:
- uninitialized read → runtime error
- signed overflow → runtime error if checked arithmetic chosen
- division by zero → runtime error
This is acceptable because the subset contract is explicit and the interpreter semantics are deterministic.
- allocate or locate variable slot
- if initializer present: evaluate, type-check, store, mark initialized
- else mark uninitialized
- evaluate for side effects
- discard result
- evaluate condition
- contextually convert to bool
- execute one branch
- reevaluate condition before every iteration
- short-circuit semantics inside condition preserved
Logical execution model:
- execute init if present
- if cond present, test it; else treat as true
- execute body
- execute step if present
- repeat
If init is a declaration, its scope includes cond, step, and body, and ends after the loop.
This section is normative if arrays are in scope for v1; otherwise it is phase-2 design.
- local fixed-size arrays only
- element types:
int,bool - one-dimensional only
int a[5];
bool seen[10];
a[0] = 42;
print(a[0]);- storage duration: function activation/frame lifetime
- indexing requires integer index
- bounds checked
- no array-to-pointer decay
- arrays are not first-class assignable values
Strong recommendation for version 1:
- do not support array parameters
Rationale:
- real C++ adjusts array parameters to pointers
- pointers are out of scope
- modeling this faithfully without pointers is awkward
So:
- array types allowed only for local variables
This section must remain explicit for product integrity.
- preprocessing directives
- namespace qualifications like
std::x - member access
a.b - stream insertion
<< - type qualifiers and storage specifiers
- advanced declarators
- pointers/references
- classes/structs
- initializer lists
- string and char literals
- templates
- exceptions
- overloading for user-defined functions
- implicit declarations
- aggregate initialization
The system shall reject unsupported constructs with targeted diagnostics wherever practical.
Example:
std::cout << x;Preferred diagnostic:
error[NPP3018]: stream insertion expressions are unsupported
not merely:
error: expected ';'
- tokenization correctness
- comment handling
- integer literal scanning
- operator scanning
- source span correctness
- function declarations and definitions
- expression precedence
- statement forms
- syntax error recovery
- scope resolution
- shadowing
- redeclaration errors
- type mismatch errors
- return validity
- call resolution
- unsupported construct rejection
- arithmetic
- conditions
- loops
- recursion
- builtin output
- runtime errors
- end-to-end source file execution
- output capture comparison
- diagnostic golden files
Use golden files for:
- diagnostics
- stack traces
- program output
At minimum:
int main() {
int x = 2 + 3 * 4;
print(x);
return 0;
}int main() {
int x = 1;
{
int x = 2;
print(x);
}
print(x);
return 0;
}int add(int a, int b) {
return a + b;
}
int main() {
print(add(10, 20));
return 0;
}int main() {
int i = 0;
while (i < 3) {
print(i);
i = i + 1;
}
return 0;
}int main() {
for (int i = 0; i < 3; i = i + 1) {
print(i);
}
return 0;
}int fact(int n) {
if (n == 0) {
return 1;
}
return n * fact(n - 1);
}
int main() {
print(fact(5));
return 0;
}int main() {
int x;
print(x);
return 0;
}int main() {
int* p;
return 0;
}Deliverables:
- project scaffolding
- source manager
- diagnostics base
- token definitions
Exit criteria:
- build system works
- diagnostic rendering works
Deliverables:
- comments
- identifiers
- literals
- punctuation/operators
- keyword recognition
Exit criteria:
- lexer golden tests pass
Deliverables:
- function parse
- statement parse
- expression precedence parse
- AST generation
Exit criteria:
- parser accepts basic programs
- syntax errors reported correctly
Deliverables:
- function table
- variable scopes
- type checking
mainvalidation- analyzed AST / semantic model for validated programs
Exit criteria:
- semantic test corpus passes
- unsupported constructs rejected accurately
Deliverables:
- typed IR lowering from the analyzed AST / semantic model
- scalar runtime values
- statements and expressions
- function call stack
- returns
- built-ins
Exit criteria:
- arithmetic, control flow, functions work end-to-end
Deliverables:
for- recursion
- stack traces
- runtime error reporting
Exit criteria:
- integration tests stable
Deliverables:
- array declaration
- indexing
- bounds checks
- initialization tracking
Exit criteria:
- array tests stable
Deliverables:
- improved diagnostics
- CLI options
- trace mode or debug dump mode
- documentation synchronization
Exit criteria:
- v1 release candidate
npp program.cpp--dump-tokens--dump-ast--dump-sema--dump-ir--trace-exec--no-color--max-call-depth=N
0: program ran successfully and returned 0- non-zero program return code may map to process exit code if desired
- dedicated interpreter failure codes for diagnostics/runtime failures
Recommended:
- compilation/semantic failure → exit 2
- runtime failure → exit 3
- internal failure → exit 4
- successful program execution → program return code modulo process constraints
NotPlusPlus shall avoid behavior depending on:
- unordered container iteration
- host integer overflow semantics
- locale-dependent formatting
- platform-specific newline handling beyond normalized I/O
All runtime-visible semantics must be deterministic.
Performance is secondary to correctness for v1.
Expected scale:
- single-file programs
- tens to low hundreds of functions
- small recursion depths
- small arrays
- low-latency interpretation for educational/demo/scripting workloads
No optimization pipeline is required.
Source code is untrusted input. The interpreter must guard against:
- infinite recursion causing host stack overflow
- pathological parse recursion where practical
- excessive memory allocation from huge array sizes
- integer overflow in internal indexing
Configurable limits:
- maximum source size
- maximum call depth
- maximum array size
- maximum total allocated runtime storage
Use assertions for impossible states, but surface recoverable user-facing failures as diagnostics or runtime errors.
Risk:
- adding just one more feature like pointers or references causes cascading design complexity
Mitigation:
- freeze v1 subset
- require explicit design amendment for each feature addition
Risk:
- parser convenience may accidentally accept syntax that is not C++
Mitigation:
- every grammar addition must map to real C++ syntax
- no custom statements or operators
Risk:
- too many magic functions create non-C++ semantics
Mitigation:
- keep built-ins minimal
- model them as ordinary global functions
Risk:
- partial C++ conversion rules become inconsistent
Mitigation:
- keep conversion lattice deliberately tiny and documented
- prefer strict exact-match rules except contextual bool conversion
A source program is accepted by NotPlusPlus v1 if and only if:
- it consists of supported top-level function declarations/definitions
- every declaration, statement, and expression belongs to the supported subset
- type checking succeeds under the subset rules
- exactly one valid entry point
int main()exists - all runtime operations stay within defined execution constraints
A program outside that contract is rejected.
This is the strongest recommended baseline for a coherent first release.
- single translation unit
- comments
int,bool,void- function declarations and definitions
- local variables
- block scope
- integer/boolean literals
- arithmetic/comparison/logical expressions
- assignment
if,while,forreturnint main()- built-in
print(int)andprint(bool) - recursion
- semantic diagnostics
- runtime error handling
- deterministic evaluation order
- array local variables with indexing and bounds checks
- AST and IR dump modes
- stack traces for runtime errors
- fixed slot allocation per frame
- macros
- headers
- namespaces
- pointers/references
- user overloads
- strings
- classes
- templates
- exceptions
int main() {
int x = 10;
int y = 20;
print(x + y);
return 0;
}bool gt(int a, int b) {
return a > b;
}
int main() {
if (gt(5, 3)) {
print(true);
} else {
print(false);
}
return 0;
}int main() {
int x = 0;
for (int i = 0; i < 3; i = i + 1) {
int x = i;
print(x);
}
print(x);
return 0;
}int main() {
int a[3];
a[0] = 4;
a[1] = 5;
a[2] = a[0] + a[1];
print(a[2]);
return 0;
}int main() {
int* p;
return 0;
}Reason: pointers unsupported.
int main() {
print("hello");
return 0;
}Reason: string literals unsupported.
int main() {
std::cout << 1;
return 0;
}Reason: namespaces and stream insertion unsupported.
int main() {
int a = 1, b = 2;
return 0;
}Reason: multi-declarator declarations unsupported in v1.
This document is the design contract for v1. Any feature addition must be recorded as an amendment specifying:
- syntax accepted
- semantics
- diagnostics
- runtime representation impact
- interaction with existing features
- migration impact on tests and docs
No feature should be added informally.
Milestone 3 ends after semantic analysis produces a resolved, typed semantic model over the AST. This model includes function symbols, variable scopes, name-resolution results, and expression type/lvalue annotations.
Typed executable IR lowering is deferred to milestone 4. Milestone 3 therefore validates programs semantically but does not yet require executable IR construction.
The semantic analyzer's variable binding table shall be keyed by ScopeId, not by positional index into a parallel vector. The binding structure shall be a map from ScopeId to a map from name to VarId:
bindings: HashMap<ScopeId, HashMap<String, VarId>>
Using a positional Vec indexed by ScopeId ordinal couples two independent allocation sequences: scope creation in the ScopeTree and entry creation in the bindings table. Any code path that creates a scope without a corresponding bindings push — or vice versa — produces silent index misalignment or a panic. A HashMap<ScopeId, ...> structure makes the association explicit and eliminates this coupling.
Variable lookup (§12.2) and shadowing (§6.1.3) behavior are unchanged. The scope tree's parent chain remains the authority for lexical lookup order. Only the internal storage representation changes.
All semantic analysis code that indexes into self.bindings[scope.0] must be replaced with keyed access. No test semantics change; only the internal data structure changes.
Source files containing preprocessing directives shall be rejected with a targeted diagnostic of category unsupported_preprocessor_directive, not with a generic invalid-character error for #.
A line whose first non-whitespace character is # shall be recognized by the lexer as a preprocessing directive line.
The lexer shall emit a diagnostic with a dedicated code in the NPP1xxx range when encountering a #-prefixed directive. The diagnostic message shall identify the specific directive where recognizable (e.g., #include, #define, #ifdef, #pragma) and fall back to a generic "preprocessing directives are unsupported" message otherwise.
Example:
error[NPP1004]: preprocessing directive '#include' is unsupported
--> sample.cpp:1:1
|
1 | #include <iostream>
| ^^^^^^^^^
No preprocessing is performed. The directive line is consumed and skipped after diagnostic emission to allow continued lexing of subsequent source.
This amends §8 (Preprocessing Policy) by specifying the diagnostic mechanism. The lexer's existing invalid-character path for # is superseded by this targeted recognition.
The AnalyzedExprKind::Paren variant shall hold an owned inner expression, not a clone of a separately stored expression. The semantic model shall avoid cloning expression trees for parenthesized expressions.
Paren(Box<AnalyzedExpr>)
The inner expression is moved into the Paren wrapper. No separate copy exists.
Parenthesized expressions preserve the type and lvalue status of the inner expression exactly. This is unchanged from the existing rule (§12.5).
IR lowering already erases parentheses by recursing through Paren nodes. The semantic model representation change does not affect lowered IR or runtime behavior.
Scope binding entries created during semantic analysis persist for the duration of analysis. There is no requirement to deallocate or "pop" binding entries when leaving a scope.
Because ScopeId values are unique and monotonically assigned, stale binding entries from exited scopes are unreachable through the lookup chain (which walks parent links from the current scope). Retaining them is harmless and simplifies the analyzer.
This amendment makes the existing behavior an explicit design choice rather than an accidental omission.
The lookup procedure (§6.1.2, §12.3) shall never consult a scope that is not an ancestor of the current scope. This invariant is enforced by walking the ScopeTree parent chain and is independent of whether binding entries for unrelated scopes exist.
break and continue are supported statements in version 1.
break_stmt ::= "break" ";"
continue_stmt ::= "continue" ";"
break and continue are added to the supported keyword set (§5.3).
break immediately exits the innermost enclosing while or for loop. continue skips the remainder of the current iteration and proceeds to the loop's condition re-evaluation (for while) or step expression followed by condition re-evaluation (for for).
Both are semantic errors if used outside a loop body.
NPP3010:breakorcontinueused outside of a loop body.
The executable IR includes Break(Span) and Continue(Span) statement variants. The interpreter's execution flow enum includes Break and Continue variants alongside Normal and Return.
break and continue interact with for and while loops. They do not interact with if or compound statements beyond propagating through them. A break or continue that escapes a function body is an internal error.
A function that is declared but never defined and never called is not an error. The semantic analyzer shall only emit an error for a declared-but-undefined function when it is referenced in a call expression.
This matches C++ behavior where forward declarations without definitions are permitted as long as no definition is required by the linker. Since NotPlusPlus has no separate compilation, the analogue is call-site usage.
No diagnostic is emitted for unused forward declarations. The existing NPP3012 diagnostic ("function '...' is declared but never defined") is emitted only when a call to such a function is encountered.