Core serialization library for crowd-pilot's IDE interaction data. Converts IDE events (tab switches, edits, terminal commands, etc.) into conversation format for training language models.
This is a Rust library with:
- Node.js/TypeScript bindings (via napi-rs) - for the VS Code extension (uses character approximation for token counting)
- CLI binary - for batch preprocessing (uses HuggingFace tokenizer via embedded Python)
The serialization logic is the single source of truth, ensuring consistency between runtime inference and training data preprocessing.
crates/core- Core serialization logiccrates/napi- Node.js bindings (@crowd-pilot/serializernpm package)crates/cli- CLI binary for preprocessing (crowd-pilot-serialize)
- Rust 1.70+
- Node.js 18+ (for napi bindings)
- Python 3.9+ with
transformersinstalled (for CLI tokenizer)
cargo build --releasecd crates/napi
npm install
npm run buildcargo build --release -p crowd-pilot-serializeimport { ConversationStateManager } from '@crowd-pilot/serializer';
const manager = new ConversationStateManager({
viewportRadius: 10,
coalesceRadius: 5,
maxTokensPerMessage: 2048,
maxTokensPerTerminalOutput: 256,
});
manager.handleTabEvent('/path/to/file.ts', 'file contents...');
manager.handleContentEvent('/path/to/file.ts', 10, 0, 'inserted text');
const messages = manager.finalizeForModel();crowd-pilot-serialize \
--output-format sed \
--csv-root ./data/sessions \
--output-dir ./output \
--tokenizer "Qwen/Qwen2-7B" \
--chat-template qwen3 \
--max-tokens-per-conversation 8192 \
--max-tokens-per-message 2048 \
--val-ratio 0.1This uses shared CSV coalescing and outputs training.jsonl and validation.jsonl in SED conversation format.
crowd-pilot-serialize \
--output-format zeta \
--csv-root ./data/sessions \
--output-dir ./output-zeta \
--tokenizer "Qwen/Qwen2-7B" \
--chat-template qwen3 \
--zeta-max-editable-tokens 180 \
--zeta-max-context-tokens 350 \
--zeta-diff-context-lines 3This uses shared CSV coalescing and converts each CSV transition into a Zeta training example.
crowd-pilot-serialize \
--output-format sweep \
--csv-root ./data/sessions \
--output-dir ./output-sweep \
--tokenizer "Qwen/Qwen2-7B" \
--chat-template qwen3 \
--sweep-viewport-lines 21 \
--sweep-opened-file-context full \
--sweep-history-center changedThis uses shared CSV coalescing and converts each CSV transition into a Sweep-style next-edit training example.
Sweep applies a hard --max-tokens-per-conversation budget by trimming oldest history first; if even zero-history context does not fit, the sample is dropped.
| Option | Default | Description |
|---|---|---|
--output-format |
sed |
Output format: sed, zeta, or sweep |
--csv-root |
required | Root directory containing per-session CSV files |
--output-dir |
required | Output directory for JSONL files |
--tokenizer |
required | HuggingFace tokenizer name or path |
--chat-template |
required | Chat template: qwen3 or glm45 |
--max-tokens-per-conversation |
8192 | Maximum tokens per conversation chunk (SED/Sweep) |
--max-tokens-per-message |
2048 | Maximum tokens per message (SED only) |
--min-conversation-messages |
5 | Minimum messages to keep a conversation (SED only) |
--viewport-radius |
10 | Lines above/below cursor to show (SED only) |
--system-prompt |
format default | Custom system prompt (SED/Sweep only) |
--coalesce-radius |
5 | Radius for grouping nearby edits |
--val-ratio |
0.10 | Fraction of sessions for validation |
--zeta-max-editable-tokens |
180 | Editable-region token budget (Zeta only) |
--zeta-max-context-tokens |
350 | Context-region token budget (Zeta only) |
--zeta-diff-context-lines |
3 | Unified-diff context lines (Zeta only) |
--sweep-viewport-lines |
21 | Fixed viewport lines for target/history windows (Sweep only) |
--sweep-opened-file-context |
full |
Opened-file context mode: full or viewport (Sweep only) |
--sweep-history-center |
changed |
History viewport centering: changed or cursor (Sweep only) |
crowd-pilot-replay \
--session ./recordings \
--terminal-cols 120 \
--terminal-rows 30 \
--delay-ms 100This replays crowd-code 2.0 recordings and renders editor + terminal viewports. The --session path must be a single source_part_*.tar.gz file or a directory containing those .tar.gz parts. Terminal output is VT-rendered from terminal bytestream events when available; otherwise it falls back to the recorded terminal viewport lines.
Apache 2.0