Route AI training data like network packets. Zero-copy. GPU-direct. 5.55× faster training throughput.
Training AI models wastes GPU cycles on data plumbing. Before a single gradient is computed, your data passes through 6 software layers: JSON parsing → tokenization → padding → tensor creation → device transfer → batching. Physical network routers move packets at line speed using binary headers and CRC checksums. What if we applied the same router protocol design to AI data pipelines?
We built NDP and tested it. It works.
TRADITIONAL: Disk (JSON) → json.loads → tokenizer.encode → pad/truncate → torch.tensor → .cuda() → GPU
NDP: Disk (.qubgpu, mmap) → struct.unpack header → GPU tensor (3 layers, not 6)
Identical conditions: Same model (Qwen2.5-Coder-1.5B), same data, same hyperparameters, same hardware. Only the data pipeline changes.
| Metric | Traditional JSONL | NDP Zero-Copy | Speedup |
|---|---|---|---|
| Samples/sec | 799 | 2,390 | 3.0× |
| Tokens/sec | 90,591 | 1,221,702 | 13.5× |
| Load Time | 0.50s | 0.17s | 2.9× |
| Metric | JSONL (tokenize on-the-fly) | NDP (pre-tokenized mmap) | NDP + CRC-Drop 5% |
|---|---|---|---|
| Throughput | 454 tok/s | 2,518 tok/s | 2,698 tok/s |
| Final Loss | 1.2439 | 0.6627 | 0.6123 |
| Training Time | 737.1s | 650.8s | 605.5s |
| Data I/O % | 0.6% | 0.3% | 0.4% |
| Model Quality | 0.967 | 1.000 | 1.000 |
- 5.55× training throughput increase — NDP processes 2,698 tok/s vs 454 tok/s for JSONL
- 50% lower final loss — NDP achieves 0.6123 loss vs 1.2439 for JSONL in same number of steps
- 18% faster wall-clock time — NDP finishes 200 steps in 605s vs 737s
- Perfect model quality — NDP-trained model scores 1.000 on all 6 coding benchmarks
| Model | Quality | Avg Response Time | Speed |
|---|---|---|---|
| Base (no adapter) | 0.933 | 17.3s | 6.5 tok/s |
| JSONL-trained (200 steps) | 1.000 | 30.7s | 2.7 tok/s |
| NDP-trained (200 steps) | 1.000 | 17.7s | 3.0 tok/s |
| NDP+CRC-trained (200 steps) | 0.950 | 16.0s | 3.2 tok/s |
Dropping 5% of packets with corrupted CRC checksums during training acts as natural data augmentation, similar to dropout regularization — but operating on the input data stream rather than neural activations. This achieves the lowest loss (0.6123) and fastest throughput (2,698 tok/s) simultaneously.
git clone https://github.com/qubitpage/qubgpu.git
cd qubgpu
pip install -r requirements.txtpython training/convert.py \
data/my_dataset.jsonl \
data/my_dataset.qubgpu \
--tokenizer Qwen/Qwen2.5-Coder-1.5B-Instruct \
--max-tokens 512from training.dataloader import QubGPUDataset
from torch.utils.data import DataLoader
dataset = QubGPUDataset(
"data/my_dataset.qubgpu",
max_seq_length=512,
pad_token_id=0,
crc_drop_rate=0.05, # Enable CRC-Drop augmentation
)
loader = DataLoader(dataset, batch_size=2, pin_memory=True)
for batch in loader:
input_ids = batch["input_ids"].cuda()
labels = batch["labels"].cuda()
# ... your training looppython -m uvicorn api.server:app --host 0.0.0.0 --port 8080
# Open http://localhost:8080python benchmarks/real_ab_training.py
# Trains 200 steps × 3 pipelines, measures everything, saves resultsLike Ethernet frames but for neural data:
┌──────┬──────┬──────┬──────┬────────┬──────────────┬────────┐
│ 0xAA │ SRC │ DST │ TYP │ LEN │ PAYLOAD │ CRC-16 │
│ 1B │ 1B │ 1B │ 1B │ 2B BE │ variable │ 2B BE │
└──────┴──────┴──────┴──────┴────────┴──────────────┴────────┘
SRC: Source type (TEXT_CORPUS=0x01, CODE_REPO=0x02, GRADIENT=0x05, ...)
DST: Target neural layer (EMBEDDING=0x01, ATTENTION=0x02, FFN=0x03, LORA=0x05, ...)
TYP: Payload encoding (TOKEN_IDS=0x01, FLOAT16=0x03, BFLOAT16=0x06, ...)
CRC: CRC-16/CCITT — corrupted packets are silently dropped (like UDP)
Fixed overhead: 8 bytes per packet. Token storage: int32 big-endian (supports 150K+ vocabularies).
qubgpu/
├── protocol/ # Core binary protocol
│ ├── qubgpu.py # KnowledgePacket, QubGPUFile, CRC-16
│ └── ndp_v2.py # 5-layer protocol stack (NDP v2)
├── training/ # Training pipeline
│ ├── convert.py # JSONL → .qubgpu converter
│ ├── dataloader.py # Zero-copy mmap PyTorch Dataset
│ └── download_datasets.py # HuggingFace dataset downloader
├── engine/ # Neural Router engine
│ └── neural_router.py # Direct weight injection (Hopfield, ROME)
├── api/ # FastAPI backend (22+ endpoints)
│ └── server.py
├── web/ # Web dashboard SPA
│ ├── index.html # 9-page dashboard
│ └── whitepaper.html
├── benchmarks/ # Real benchmark results
│ ├── real_ab_training.py # A/B training comparison script
│ ├── real_benchmark_results.json # Training benchmark data
│ └── chat_model_comparison.json # Chat inference comparison
└── tests/
├── test_protocol.py # 42 protocol tests
└── test_chat_models.py # Chat model comparison test
python tests/test_protocol.py # 42 protocol tests
python benchmarks/real_ab_training.py # Full A/B benchmark on GPU
python tests/test_chat_models.py # Chat quality comparisongit clone https://github.com/qubitpage/qubgpu.git && cd qubgpu
pip install -r requirements.txt
mkdir -p models datasets logs benchmarks
# Download datasets
python training/download_datasets.py
# Convert to NDP format
python training/convert.py datasets/large_coding_dataset.jsonl datasets/large_coding_dataset.qubgpu
# Start platform
python -m uvicorn api.server:app --host 0.0.0.0 --port 8080qubgpu.qubitpage.com — Chat, benchmarks, protocol lab, whitepaper
Full technical paper: NDP Whitepaper
MIT License — see LICENSE
QubitPage Research — qubitpage.com