Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ build/
# Language specific
__pycache__/
*.pyc
.venv/
node_modules/
target/ # Rust
vendor/ # Go
Expand Down
245 changes: 245 additions & 0 deletions docs/reports/PRODUCT_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# Betti‑RDL Validation & Product Report

Date: 2025‑12‑15

## Executive summary

Betti‑RDL is presented as a deterministic, event‑driven runtime that maps computation onto a fixed 3‑torus lattice to avoid stack growth (“recursion as replacement”) and to enable highly parallel workloads.

In this repo’s current **prototype** implementation, the core “compute” kernel is built on STL containers (`std::priority_queue`, `std::unordered_map`) plus a global `operator new` hook used only for coarse memory accounting. As shipped, the design intent (bounded memory, parallel isolation) is compelling, but the implementation is not yet a strict, mechanically‑enforced O(1) allocator/scheduler.

This ticket validated:
- The C++ Release build and benchmark executables run successfully.
- The “Mega Demo” scenarios execute end‑to‑end with measurable throughput.
- Python (pybind11) and Node.js (N‑API) bindings compile and run end‑to‑end.
- Benchmark claims were compared against measured results on the provided VM.

All raw outputs are saved under `docs/reports/*.txt`.

## Test environment

See `docs/reports/env.txt`.

Highlights:
- CPU: Intel Xeon Platinum 8581C @ 2.10GHz
- Cores/threads available in VM: 3 (single thread per core)
- RAM: ~10 GiB

This is important for interpreting scaling claims that reference 16 threads.

## What was required to make benchmarks meaningful

During validation, two correctness issues were found that made published benchmark numbers misleading:

1. `run(max_events)` semantics in `BettiRDLCompute` / `BettiRDLKernel` were implemented as “run until total events_processed reaches max_events”, which caused repeated `run()` calls to do no work after the first batch.
2. The C API header used `size_t` without including `<stddef.h>` and the CMake project did not enable C, preventing the C API test from compiling on Linux.

These were fixed so that:
- `run(n)` processes up to **n additional** events.
- The deep recursion benchmark actually executes the requested number of steps.

## Objective 1 — Reproduce core benchmarks

### 1) Mega demo (“killer app” scenarios)
Command:
```bash
cd src/cpp_kernel
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./mega_demo
```
Raw output: `docs/reports/mega_demo.txt`

Measured results:

| Scenario | Claimed in README | Measured (this VM) | Notes |
|---|---:|---:|---|
| Logistics swarm (1,000,000 deliveries) | 2.4M deliveries/sec | 4.26M deliveries/sec (235ms) | Implemented as batched inject+run; measures event processing throughput more than a realistic routing model. |
| Silicon cortex (500,000 spikes) | 2.4M spikes/sec | 7.69M spikes/sec (65ms) | Batched inject+run; not a biophysically accurate SNN model yet. |
| Contagion (1,000,000 infection steps) | “0 bytes memory growth” | +24 bytes (1311076B → 1311100B) | Uses a single recursive chain to avoid queue growth; demonstrates “infinite steps without storing 1M events”. |

### 2) Stress test suite
Command:
```bash
./stress_test
```
Raw output: `docs/reports/stress_test.txt`

Measured results:

| Test | Measured result | Repo claim comparison |
|---|---:|---|
| Firehose throughput (5,000,000 events) | 35.7M events/sec (0.14s) | README claims 4.33M EPS peak; measured is higher on this VM, but the “compute” per event is still lightweight. |
| Deep Dive recursion (100,000 dependent events) | 100,000 events processed; +380 bytes net tracked | README claims “0 bytes growth” at scale; this prototype shows small fixed overhead. The memory tracker is not OS RSS; it is a global counter in `Allocator.h`. |
| Swarm (16 threads × 100,000 events) | 133M EPS aggregate (time rounded to 0.01s) | This VM has 3 cores; 16 threads is oversubscribed. Also output interleaves across threads. |

### 3) Parallel scaling efficiency
Command:
```bash
./parallel_scaling_test_v2
```
Raw output: `docs/reports/parallel_scaling_test.txt`

Measured results (1,000,000 events per instance):

| Instances | Throughput (EPS) | Speedup | Efficiency |
|---:|---:|---:|---:|
| 1 | 12.96M | 1.00x | 100% |
| 2 | 24.48M | 1.89x | 94% |
| 4 | 28.98M | 2.24x | 56% |
| 8 | 24.50M | 1.89x | 24% |
| 16 | 12.37M | 0.95x | 6% |

Interpretation:
- Scaling is close to linear up to the **available core count** (here: ~2× is good on a 3‑core VM).
- Above that, oversubscription dominates and throughput falls.
- The current implementation also relies on STL containers and a global allocator hook (`g_memory_used`) that is **not thread‑safe**, which can distort parallel measurements and must be addressed before making strong scaling claims.

## Objective 2 — Test language bindings

### Python (pybind11)
Steps executed:
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e python
python python/example.py
```
Raw output: `docs/reports/python_example.txt`

Status: Works end‑to‑end (spawn, inject, run, read counters).

Limitations observed:
- The Python binding compiles C++ sources directly and does not link against the built `libbetti_rdl_c.so`; packaging/versioning across languages will be harder until a single shared core library is used.
- The prototype overrides global `operator new` (via `Allocator.h`) inside the extension module, which is risky in real Python processes.

### Node.js (N‑API)
Steps executed:
```bash
cd nodejs
npm install
node example.js
```
Raw output: `docs/reports/node_example.txt`

Status: Works end‑to‑end (spawn, inject, run, read counters).

Limitations observed:
- Like Python, the addon compiles C++ directly rather than consuming a stable C ABI library.
- Native addon distribution requires toolchains per platform (typical for N‑API addons but relevant for product packaging).

## Objective 3 — Product angle evaluation

### 1) Agent‑Based Simulation (drones, logistics, trading)
**Strengths**
- Deterministic discrete‑event execution is a strong fit for ABM.
- The contagion demo pattern (drive many steps from a small state footprint) is useful for “simulate huge populations without materializing all agents”, if generalized.

**Realistic use cases**
- Epidemic spread where most agents are homogeneous and can be represented as counters/compartments.
- Logistics / order routing / inventory flow models where event scheduling dominates.
- Market microstructure simulations where determinism and reproducibility matter.

**Performance characteristics**
- Very high single‑instance event throughput in this prototype (tens of M EPS).
- Scaling is good up to available cores; beyond that, oversubscription and current implementation details reduce efficiency.

**Competitive context**
- Many established ABM frameworks exist (Mesa, Repast, MASON, GAMA, AnyLogic, FLAME GPU).
- Differentiation must be: (1) determinism, (2) bounded‑memory recursion/event processing, (3) “fast enough in Python” via a C++ core.

**Challenges / limitations**
- The current data structures are STL‑based and do not enforce bounded memory.
- ToroidalSpace uses a string key map, which is not suitable for a performance‑critical core.

**Feasibility**: High (as a library/runtime for simulation).

### 2) Neuromorphic AI / SNNs
**Strengths**
- Event‑driven runtimes map naturally to spike processing.

**Realistic use cases**
- Research simulators, small‑to‑medium networks, event‑driven inference.

**Competitive context**
- Strong incumbents: Brian2, Nengo, Norse, Lava, SpikingJelly/snnTorch.

**Challenges**
- Needs real neuron/synapse models, plasticity rules, GPU/vectorization, and interoperability with ML tooling.

**Feasibility**: Medium (longer R&D cycle).

### 3) Serverless backend (Node.js, Python services)
**Strengths**
- Determinism and bounded memory are attractive in multi‑tenant environments.

**Competitive context**
- Extremely competitive: V8 isolates, WASM runtimes (Wasmtime), Cloudflare Workers, AWS Lambda, etc.

**Challenges**
- Requires sandboxing, isolation, billing/metering, multi‑tenant scheduling, security hardening, observability.

**Feasibility**: Low in the short term.

### 4) Scientific computing (massive recursion / recursive algorithms)
**Strengths**
- The “Deep Dive” pattern is a clear wedge: run extremely deep iterative/recursive workflows without stack growth.

**Realistic use cases**
- Backtracking search, constraint solving, symbolic execution, tree/graph traversal with bounded memory.
- Deterministic replayable simulations for research.

**Competitive context**
- Many languages mitigate recursion via TCO/trampolines, but general “bounded memory recursion runtime” is uncommon as a drop‑in library.

**Challenges**
- Must prove correctness on real algorithms (DFS, SAT‑like workloads) and provide ergonomic APIs.

**Feasibility**: Medium‑high (library product, but needs a clearer API and examples).

## Primary recommendation

**Primary product angle: Agent‑based / discrete‑event simulation core (Python‑first), positioned as a deterministic high‑throughput event engine with bounded‑memory execution patterns.**

Why this is the best immediate opportunity:
- Fastest time‑to‑market: the demos and bindings already point in this direction.
- Clear buyer/user: simulation engineers, researchers, ops/logistics analysts.
- Value proposition is easy to communicate: reproducibility + high event throughput + bounded memory patterns.
- Lower competitive risk than “serverless platform”; more direct than “neuromorphic AI” which requires heavy domain R&D.

## Secondary recommendations

1. **Scientific recursion/search kernel** as a specialized library layer on top of the same runtime (DFS/backtracking examples, constraint solving).
2. **Neuromorphic/SNN simulation** as a longer‑term vertical once the core scheduling/allocator story is hardened.

## Technical debt / improvements needed (to support the recommendation)

Highest‑impact items:
1. Replace STL containers in the hot path with bounded / preallocated structures (ring buffers, fixed heaps) and/or `std::pmr` backed by a custom arena.
2. Remove or isolate the global `operator new` override; make memory tracking thread‑safe and measure RSS/peak RSS in benchmarks.
3. Make the kernel thread‑safe (or explicitly single‑threaded) and provide a clear concurrency model.
4. Replace `ToroidalSpace` string keys with a flat index (`idx = x + W*(y + H*z)`) and fixed arrays.
5. Provide benchmark CLI options (event counts, thread counts) and report percentile latencies, not just average EPS.
6. Unify bindings around the C API shared library (`libbetti_rdl_c`) so Python/Node/Rust/Go all consume the same core binary.

## Suggested next steps

1. Create a “benchmark harness” executable that runs:
- throughput, latency percentiles, memory peak
- scaling tests up to physical core count
2. Implement a real ABM reference model (e.g., SIR epidemic with parameter sweeps) and publish reproducible results.
3. Package Python wheels (manylinux) and prebuilt Node binaries for key platforms.
4. Add CI tests that run:
- `stress_test` at smaller sizes
- Python and Node example smoke tests

---

### Appendix: raw outputs
- `docs/reports/env.txt`
- `docs/reports/mega_demo.txt`
- `docs/reports/stress_test.txt`
- `docs/reports/parallel_scaling_test.txt`
- `docs/reports/python_example.txt`
- `docs/reports/node_example.txt`
47 changes: 47 additions & 0 deletions docs/reports/env.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Linux engine-0e638352-d8c0-4f3b-9c39-03ec4ee91cbf-66d97b68-xm9kh 6.12.60 #1 SMP Thu Dec 4 16:27:11 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 3
On-line CPU(s) list: 0-2
Vendor ID: GenuineIntel
Model name: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz
CPU family: 6
Model: 207
Thread(s) per core: 1
Core(s) per socket: 3
Socket(s): 1
Stepping: 2
BogoMIPS: 4200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 96 KiB (2 instances)
L1i cache: 64 KiB (2 instances)
L2 cache: 4 MiB (2 instances)
L3 cache: 260 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-2
Vulnerability Gather data sampling: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Vulnerable; BHI: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsa: Not affected
Vulnerability Tsx async abort: Not affected
Vulnerability Vmscape: Not affected

total used free shared buff/cache available
Mem: 9.7Gi 767Mi 8.8Gi 10Mi 244Mi 8.9Gi
Swap: 0B 0B 0B
42 changes: 42 additions & 0 deletions docs/reports/mega_demo.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
Betti-RDL Scale Demos
Simulating massive agent-based workloads...

=================================================
DEMO 1: LOGISTICS SWARM (Smart City)
=================================================
Scenario: 1000000 autonomous drones delivering packages.
Goal: Route around congestion using adaptive RDL delays.
[Metal] ToroidalSpace <32x32x32> Init.
[COMPUTE] Initializing Betti-RDL with real computation...
[SETUP] Initializing 32x32x32 city grid...
[ACTION] Deploying 1000000 drones...
[RESULT] All packages delivered in 235ms.
[METRIC] 4.25532e+06 Deliveries/Sec
[STATUS] Network adapted to congestion continuously.

=================================================
DEMO 2: SILICON CORTEX (Spiking Neural Net)
=================================================
Scenario: 32768 neurons in a 3D lattice.
Goal: Process sensory input spikes via Hebbian learning.
[Metal] ToroidalSpace <32x32x32> Init.
[COMPUTE] Initializing Betti-RDL with real computation...
[SETUP] Growing neural lattice...
[ACTION] Injecting 500000 sensory spikes...
[RESULT] Cortex processed sensory stream in 65ms.
[METRIC] 7.69231e+06 Spikes/Sec
[STATUS] O(1) Memory maintained despite massive firing cascade.

=================================================
DEMO 3: GLOBAL CONTAGION (Patient Zero)
=================================================
Scenario: 1000000 people interacting in tight network.
Goal: Track recursive virus spread without memory explosion.
[Metal] ToroidalSpace <32x32x32> Init.
[COMPUTE] Initializing Betti-RDL with real computation...
[SETUP] Populating world...
[ACTION] Patient Zero infected. Spreading...
[RESULT] Virus spread to 1000000 hosts in 11ms.
[METRIC] 9.09091e+07 Infection-Steps/Sec
[MEMORY] Start: 1311076B -> End: 1311100B
[STATUS] Zero memory growth observed during recursive spread.
23 changes: 23 additions & 0 deletions docs/reports/node_example.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
==================================================
BETTI-RDL NODE.JS EXAMPLE
==================================================

[SETUP] Creating Betti-RDL kernel...
[Metal] ToroidalSpace <32x32x32> Init.
[COMPUTE] Initializing Betti-RDL with real computation...
[SETUP] Spawning 10 processes...
[INJECT] Sending events with values 1, 2, 3...

[COMPUTE] Running distributed counter...

[RESULTS]
Events processed: 3
Current time: 0
Active processes: 10

[VALIDATION]
[OK] O(1) memory maintained
[OK] Real computation performed
[OK] Deterministic execution

==================================================
Loading