Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
f700cd7
feat: add EvolveGenerator training entrypoint
joyemang33 Mar 1, 2026
03f8809
feat: modify max_advisor_context_iters
joyemang33 Mar 3, 2026
7a484dc
fix: add PYTHONPATH, n_samples_per_prompt=8, max_advisor_context_iter…
joyemang33 Mar 7, 2026
28f6090
fix: use 2 GPUs, Qwen3.5-9B local path, correct model name handling
joyemang33 Mar 7, 2026
29fd618
fix: add vllm tool/attention flags to match serving setup
joyemang33 Mar 7, 2026
21392e1
fix: change HTTP endpoint port to 8002 (8000 and 8001 occupied)
joyemang33 Mar 7, 2026
87c8ea3
fix: use 4 GPUs
joyemang33 Mar 7, 2026
3ed697e
fix: set UV_CACHE_DIR to /data/qmang/uv_cache
joyemang33 Mar 7, 2026
1c75eec
fix: redirect all outputs/caches to /data/qmang
joyemang33 Mar 7, 2026
8ba21ea
fix: use UV_PROJECT_ENVIRONMENT instead of symlink for venv on /data
joyemang33 Mar 7, 2026
1825858
update training script
joyemang33 Mar 7, 2026
04eb4e7
go
joyemang33 Mar 8, 2026
29a9dc6
111
joyemang33 Mar 8, 2026
56687fd
good
joyemang33 Mar 9, 2026
36d5570
feat: separate solver and advisor
joyemang33 Mar 9, 2026
b4e1a57
clean up unused things
CharlieFRuan Mar 9, 2026
e9d22e0
feat: add logging dir
joyemang33 Mar 9, 2026
e5ef2b5
remove jinja unused
CharlieFRuan Mar 9, 2026
f1f3819
Make script runnable with Qwen3 and non-QMang machine
CharlieFRuan Mar 9, 2026
38883bd
train-train-train
joyemang33 Mar 9, 2026
a78df21
feat:
joyemang33 Mar 9, 2026
bcfec44
update training script
joyemang33 Mar 10, 2026
a6a2f83
update
CharlieFRuan Mar 11, 2026
c09eb7a
runnable on b200 e2e
zwcolin Mar 11, 2026
19f1350
nonfrozen-training
joyemang33 Mar 12, 2026
972be5e
feat: H100 e2e training config with frozen GPT-5 solver and jemalloc/…
joyemang33 Mar 17, 2026
97fec32
[WIP] Set up Qwen3.5
CharlieFRuan Mar 19, 2026
4bfaa9e
more changes for qwen3.5
CharlieFRuan Mar 19, 2026
06ef790
remove claude debug things that broke weight sync
CharlieFRuan Mar 19, 2026
f9ab318
stepwise H100
joyemang33 Mar 21, 2026
d75ab45
log prompt
joyemang33 Mar 22, 2026
59123ef
trivial
joyemang33 Mar 24, 2026
0ef87df
Update script to use step-wise
CharlieFRuan Mar 24, 2026
9ecfa07
Charlie pass on knobs
joyemang33 Mar 24, 2026
396d825
add ckpt interval
joyemang33 Mar 24, 2026
b63bec2
reduce context length in RL training
joyemang33 Mar 25, 2026
c9ae082
finalize H100 recipe
joyemang33 Mar 26, 2026
eba3cae
update recipe
joyemang33 Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions SETUP_NOTES_0310_colin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# Running SkyRL on hyperbolic-14

Machine: `hyperbolic-14` — 8x NVIDIA B200 (183GB each), NFS home at `/mnt/nfs/`, fast local storage at `/data/users/zwcolin/`.

## Key Issues & Solutions

### 1. NFS is too slow for everything

`/mnt/nfs/` has severe performance issues for I/O-heavy workloads (venv creation, package installs, model downloads). All project files, caches, and virtual environments should live on `/data/users/zwcolin/`.

**Project location:** `/data/users/zwcolin/med_agent/`

### 2. Cache directories must point to `/data`

All cache env vars in `~/.bashrc` are togglable via a `CACHE_STORAGE` variable:

```bash
# In ~/.bashrc — set to "local" for /data, "nfs" for /mnt
CACHE_STORAGE="local"
```

This controls `UV_CACHE_DIR`, `HF_HOME`, `TRANSFORMERS_CACHE`, `TMPDIR`, `TRITON_CACHE_DIR`, `FLASHINFER_CACHE_DIR`, and others. After changing, run `source ~/.bashrc`.

**Important:** New shell sessions spawned by Cursor do NOT source `.bashrc`. You must explicitly export all cache vars in any training script or command. See the `train_evolve.sh` zwcolin block for a complete list.

### 3. NFS does not support Unix domain sockets (Ray crash)

Ray creates Unix domain sockets in `TMPDIR`. If `TMPDIR` points to NFS, Ray crashes with:

```
Failed to connect to socket at address:/mnt/nfs/home/zwcolin/.cache_tmp/tmp/ray/session_.../sockets/raylet
```

**Fix:** `TMPDIR` and `RAY_TMPDIR` must point to a local filesystem:
```bash
export TMPDIR=/data/users/zwcolin/.cache/tmp
export RAY_TMPDIR=/data/users/zwcolin/.cache/tmp
```

### 4. B200 GPUs require CUDA 13.0 (cu130) stack

B200 (Blackwell) GPUs have new tensor core architectures. PyTorch/vLLM wheels compiled with CUDA 12.8 (cu128) cause cuBLAS errors:

```
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx(...)
```

This manifests during vLLM's model profile run at `qwen3_5.py:164 in_proj_qkvz(hidden_states)`.

**Fix:** All CUDA packages must use cu130 builds. In `pyproject.toml`:

```toml
[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[[tool.uv.index]]
name = "flashinfer-cu130"
url = "https://flashinfer.ai/whl/cu130"
explicit = true

[tool.uv.sources]
torch = [
{ index = "pytorch-cu130", marker = "sys_platform == 'linux'" },
{ index = "pytorch-cpu", marker = "sys_platform == 'darwin'" },
]
flash-attn = { url = "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu130torch2.10-cp312-cp312-linux_x86_64.whl", marker = "sys_platform == 'linux'" }
flashinfer-jit-cache = { index = "flashinfer-cu130", marker = "sys_platform == 'linux'" }
```

Verify after install:
```bash
python -c "import torch; print(torch.__version__, torch.version.cuda)"
# Expected: 2.10.0+cu130 13.0
```

### 5. Other users' GPU processes

Other users may have long-running processes (e.g. `sglang::scheduler`) occupying GPU memory. Check with `nvidia-smi` before running. These processes are owned by other users and require `sudo` to kill:

```bash
sudo pkill -9 -f sglang
```

### 6. Script defaults point to NFS `$HOME`

`run_gsm8k.sh` and `run_gsm8k_qwen3_5.sh` default several paths to `$HOME/...` which resolves to `/mnt/nfs/home/zwcolin/`. Override them:

| Variable | Default (NFS) | Override (local) |
|---|---|---|
| `DATA_DIR` | `$HOME/data/gsm8k` | `/data/users/zwcolin/med_agent/data/gsm8k` |
| `trainer.ckpt_path` | `$HOME/ckpts/...` | `/data/users/zwcolin/med_agent/ckpts/...` |
| `trainer.export_path` | `$HOME/exports/` | `/data/users/zwcolin/med_agent/exports/` |

### 7. Wandb not configured

The script defaults to `LOGGER=wandb`, which requires `WANDB_API_KEY`. For quick testing, override with `LOGGER=console`.

---

## Qwen3.5 Support

### Required dependency changes (`pyproject.toml`)

The cherry-picked `update` commit bumps versions for Qwen3.5 compatibility. The full set of changes:

| Package | Before | After | Reason |
|---------|--------|-------|--------|
| vllm | 0.16.0 | 0.17.0 | Native Qwen3.5 model support |
| torch | 2.9.1 | 2.10.0 | Required by vllm 0.17.0 |
| transformers | >=4.56.1,<5 | >=5.3.0 | Native `qwen3_5` model type for FSDP |
| flashinfer-python | 0.6.3 | 0.6.4 | Required by vllm 0.17.0 |
| flashinfer-jit-cache | 0.6.3 | 0.6.4 | Matches flashinfer-python |
| accelerate | (unpinned) | >=1.13.0 | Fixes `_is_hf_initialized` TypeError with transformers 5.x |
| requires-python | >=3.11 | >=3.12,<3.13 | flash-attn wheel is cp312-only |

Both `fsdp` AND `megatron` extras must be updated (they each independently pin torch, vllm, flashinfer).

The `flashrl` extra also pins torch — must be bumped to 2.10.0 to match the cu130 index.

An override is needed to force transformers >=5.3.0 past vllm's cap:
```toml
# In override-dependencies:
"transformers>=5.3.0",
```

### flash-attn pre-built wheel

flash-attn has no official wheel for torch 2.10.0. Use community wheels from [mjun0812/flash-attention-prebuild-wheels](https://github.com/mjun0812/flash-attention-prebuild-wheels):

```toml
# In [tool.uv.sources]:
flash-attn = { url = "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu130torch2.10-cp312-cp312-linux_x86_64.whl", marker = "sys_platform == 'linux'" }
```

### flash-linear-attention (fla) — fast path for Qwen3.5

Qwen3.5 has hybrid attention: standard attention layers + **gated delta rule** linear attention layers. Without fast kernels, transformers falls back to pure-Python loops that are ~10x slower.

The fast path requires both:
1. **`flash-linear-attention`** (fla) — Triton kernels for `chunk_gated_delta_rule`
2. **`causal-conv1d`** — CUDA kernels for causal 1D convolution

**Installing flash-linear-attention:**
```bash
# Pure Python/Triton — installs cleanly
uv pip install flash-linear-attention
```

Add to `fsdp` extra in pyproject.toml:
```toml
"flash-linear-attention>=0.4.1; sys_platform == 'linux'",
```

**Installing causal-conv1d:**

`causal-conv1d` requires CUDA compilation (~8 min build). It was originally blocked by a `sys_platform == 'never'` override in `pyproject.toml` to suppress transitive resolution issues.

To enable it:
1. Change the override from `"causal-conv1d; sys_platform == 'never'"` to `"causal-conv1d>=1.4.0; sys_platform == 'linux'"`
2. Add to `fsdp` extra: `"causal-conv1d>=1.4.0; sys_platform == 'linux'"`
3. Run `uv lock && uv sync --extra fsdp` (the CUDA build takes ~8 minutes)

Verify fast path:
```python
python -c "from transformers.models.qwen3_5.modeling_qwen3_5 import is_fast_path_available; print(is_fast_path_available)"
# Expected: True
```

The `model_wrapper.py` will also log:
```
INFO | Qwen3.5 fast path is ENABLED (causal-conv1d + flash-linear-attention)
```

### Code patches (already applied in cherry-pick)

1. **`return_dict=False`** on `apply_chat_template(tokenize=True)` calls — transformers 5.x returns `BatchEncoding` instead of `list[int]`
2. **`vllm_worker.py`** — weight sync layer naming fix for `ForConditionalGeneration` → adds `language_model.` prefix
3. **`model_wrapper.py`** — Qwen3.5 monkey-patches:
- 3D `position_ids` fix (upstream: huggingface/transformers#44399)
- CPU tensor creation fix for `torch_chunk_gated_delta_rule` (commented out since fast path avoids it)

### Debug `import fla` in `main_base.py`

The cherry-picked commit left a debug `import fla` at the top of `main_base.py`. This was removed since `fla` is optional and causes `ModuleNotFoundError` if not installed.

---

## Running GSM8K with Qwen3.5

```bash
cd /data/users/zwcolin/med_agent/radiology/SkyRL

# Prepare data (one-time)
uv run python examples/train/gsm8k/gsm8k_dataset.py \
--output_dir /data/users/zwcolin/med_agent/data/gsm8k

# Run training (all env vars explicit — don't rely on .bashrc)
export UV_CACHE_DIR=/data/users/zwcolin/.cache/uv
export HF_HOME=/data/users/zwcolin/.cache/huggingface
export TMPDIR=/data/users/zwcolin/.cache/tmp
export RAY_TMPDIR=/data/users/zwcolin/.cache/tmp
export TRITON_CACHE_DIR=/data/users/zwcolin/.cache/triton
export FLASHINFER_CACHE_DIR=/data/users/zwcolin/.cache/flashinfer
export TORCH_EXTENSIONS_DIR=/data/users/zwcolin/.cache/torch_extensions
export DEEP_GEMM_CACHE_DIR=/data/users/zwcolin/.cache/deep_gemm

DATA_DIR=/data/users/zwcolin/med_agent/data/gsm8k \
NUM_GPUS=8 \
LOGGER=console \
VLLM_USE_V1=1 \
bash examples/train/gsm8k/run_gsm8k_qwen3_5.sh \
trainer.ckpt_path=/data/users/zwcolin/med_agent/ckpts/gsm8k_qwen3.5_2B_ckpt \
trainer.epochs=1
```

Key differences from the Qwen2.5 `run_gsm8k.sh`:
- Uses `Qwen/Qwen3.5-2B` model
- Sets `Qwen3_5DecoderLayer` for FSDP wrap policy
- Sets `language_model_only=true` (Qwen3.5 uses VL wrapper architecture in vLLM)
- `VLLM_USE_V1=1` works with cu130 stack (was broken with cu128)

---

## Running Evolve Training

The `train_evolve.sh` script has user-specific env var blocks. The zwcolin block (lines 90-105) exports all cache dirs to `/data/users/zwcolin/.cache/...`.

```bash
cd /data/users/zwcolin/med_agent/radiology
bash SkyRL/examples/train/evolve/train_evolve.sh
```

The script:
1. Starts a frozen solver vLLM server on GPUs 4-7
2. Waits for it to be healthy (up to 3 min)
3. Runs advisor FSDP training on GPUs 0-3 with colocated vLLM

---

## Earlier Results (Qwen2.5-1.5B GSM8K)

Training ran successfully with GRPO on Qwen2.5-1.5B-Instruct:

| Metric | Step 1 | Step 2 |
|---|---|---|
| `reward/avg_raw_reward` | 0.137 | 0.193 |
| `reward/avg_pass_at_2` | 0.250 | 0.345 |
| `policy/policy_kl` | 0.0004 | 0.0029 |
| `timing/step` | 46.5s | 22.7s |

Eval at step 0 (pre-training baseline): `pass_at_1 = 0.091`.
Binary file added docs/gsm8k_qwen3.5_2B_metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
124 changes: 124 additions & 0 deletions docs/qwen3_5_fast_path_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Qwen3.5 Fast Path Setup for FSDP Training

## Background

Qwen3.5 is a hybrid architecture with both standard attention layers and **linear attention** (gated delta rule) layers. The linear attention layers require two specialized libraries for fast forward and backward passes:

1. **`flash-linear-attention` (fla)** - Triton kernels for `chunk_gated_delta_rule` and `fused_recurrent_gated_delta_rule`
2. **`causal-conv1d`** - CUDA kernels for causal 1D convolution (`causal_conv1d_fn`, `causal_conv1d_bwd`)

Without both libraries, HuggingFace transformers falls back to pure PyTorch implementations (`torch_chunk_gated_delta_rule` and `F.silu(self.conv1d(...))`) which are **significantly slower** for both forward and backward passes. The check is at:

```python
# transformers/models/qwen3_5/modeling_qwen3_5.py:295
is_fast_path_available = all(
(causal_conv1d_fn, causal_conv1d_update, chunk_gated_delta_rule, fused_recurrent_gated_delta_rule)
)
```

## Current State

- `flash-linear-attention` is vendored at `./flash-linear-attention/` and properly wired as a uv source in `pyproject.toml`. It installs cleanly (pure Python/Triton, no CUDA compilation).
- `causal-conv1d` is vendored at `./causal-conv1d/` but is **blocked** from installation by an override in `pyproject.toml`:
```toml
"causal-conv1d; sys_platform == 'never'",
```
This was originally added to suppress transitive resolution from `flash-linear-attention[conv1d]`.

## The Problem with `causal-conv1d`

`causal-conv1d` requires CUDA compilation (it has `.cu` files in `csrc/`). This creates issues with uv + Ray:

1. **uv build isolation**: `causal-conv1d`'s `pyproject.toml` only declares `setuptools`, `wheel`, `torch` as build requirements, but with `match-runtime = true` for torch, uv can't resolve because `causal-conv1d` doesn't declare static metadata.

2. **`no-build-isolation`**: Works locally but fails on Ray workers because they create fresh venvs that lack setuptools.

3. **Local wheel path**: Ray's working directory packaging doesn't include the 160MB wheel file (excluded by `.gitignore` or size limits), so `path = "./dist/..."` fails on workers.

## Solution: Pre-built Wheel via URL

The correct approach (matching how `flash-attn` is handled) is to host a pre-built wheel and reference it by URL.

### Step 1: Build the wheel locally

```bash
# From repo root, with the venv that has torch installed:
uv build ./causal-conv1d --no-build-isolation --python .venv/bin/python --wheel --out-dir ./dist/
```

This produces `dist/causal_conv1d-1.6.1-cp312-cp312-linux_x86_64.whl` (~160MB).

### Step 2: Host the wheel

Upload to a URL accessible by Ray workers. Options:
- GitHub release (like flash-attn uses `mjun0812/flash-attention-prebuild-wheels`)
- S3/GCS bucket
- Any HTTP server accessible from the cluster

Example: if uploaded to `https://github.com/YOUR_ORG/prebuild-wheels/releases/download/v1.0/causal_conv1d-1.6.1-cp312-cp312-linux_x86_64.whl`

### Step 3: Update `pyproject.toml`

```toml
# In [project.optional-dependencies]
fsdp = [
# ... existing deps ...
"causal-conv1d>=1.4.0; sys_platform == 'linux'",
"flash-linear-attention>=0.4.2; sys_platform == 'linux'",
]

# In override-dependencies: REMOVE the sys_platform == 'never' line for causal-conv1d
# Replace with a real constraint:
"causal-conv1d>=1.4.0; sys_platform == 'linux'",

# In [tool.uv.sources]
causal-conv1d = { url = "https://YOUR_URL/causal_conv1d-1.6.1-cp312-cp312-linux_x86_64.whl", marker = "sys_platform == 'linux'" }
flash-linear-attention = { path = "./flash-linear-attention", editable = true }
```

### Step 4: Verify

```python
python -c "
from transformers.models.qwen3_5.modeling_qwen3_5 import is_fast_path_available
print('is_fast_path_available:', is_fast_path_available)
"
# Should print: is_fast_path_available: True
```

The model_wrapper.py logging (added in this branch) will also confirm:
```
INFO | Qwen3.5 fast path is ENABLED (causal-conv1d + flash-linear-attention)
```

## Alternative: Docker Image

For production, the cleanest approach is to include `causal-conv1d` in the Docker image:

```dockerfile
# In docker/Dockerfile, after CUDA toolkit is installed:
COPY causal-conv1d /tmp/causal-conv1d
RUN cd /tmp/causal-conv1d && pip install --no-build-isolation . && rm -rf /tmp/causal-conv1d
```

This avoids wheel hosting entirely since all Ray workers share the same Docker image.

## Quick Local-Only Test (no Ray workers)

If you just want to verify the fast path works locally (single-node, no Ray worker venvs):

```bash
# Install directly into the venv
uv pip install -e ./causal-conv1d --python .venv/bin/python --no-build-isolation

# Verify
.venv/bin/python -c "from transformers.models.qwen3_5.modeling_qwen3_5 import is_fast_path_available; print(is_fast_path_available)"
# True
```

## Notes

- `causal-conv1d` v1.6.1 does have a `CachedWheelsCommand` that auto-downloads prebuilt wheels from [Dao-AILab/causal-conv1d releases](https://github.com/Dao-AILab/causal-conv1d/releases), but as of March 2026, no wheels exist for torch 2.10.0 + CUDA 12.8.
- The `flash-linear-attention` fla package lists `causal-conv1d>=1.4.0` as an optional dependency under the `conv1d` extra. The `sys_platform == 'never'` override in SkyRL's pyproject.toml suppresses this.
- Prime-RL does not install either `fla` or `causal-conv1d` — they don't have Qwen3.5 linear attention support.
- The torch fallback `torch_chunk_gated_delta_rule` has a known issue with CPU tensor creation during gradient checkpointing recomputation (see commented-out Patch 2 in `model_wrapper.py`). The fla fast path avoids this entirely.
Loading