cellarc_baselines

This repository contains model training and baselines for CellARC:

Dataset repo: https://github.com/mireklzicar/cellarc
Website: https://cellarc.mireklzicar.com/

Wandb Results

Project	Size	Training Mode	W&B URL
`cellarc100k_50e_embedding_small`	small	embedding	https://wandb.ai/lzicar/cellarc100k_50e_embedding_small
`cellarc100k_50e_embedding_medium`	medium	embedding	https://wandb.ai/lzicar/cellarc100k_50e_embedding_medium
`cellarc100k_50e_embedding_large`	large	embedding	https://wandb.ai/lzicar/cellarc100k_50e_embedding_large
`cellarc100k_50e_incontext_small`	small	incontext	https://wandb.ai/lzicar/cellarc100k_50e_incontext_small
`cellarc100k_50e_incontext_medium`	medium	incontext	https://wandb.ai/lzicar/cellarc100k_50e_incontext_medium
`cellarc100k_50e_incontext_large`	large	incontext	https://wandb.ai/lzicar/cellarc100k_50e_incontext_large

A. Basic (Single-GPU) Training

Use python scripts/train.py --config-name train/default with Hydra overrides for architecture/size/mode.

Example (tiny_recursive large embedding with W&B logging):

python scripts/train.py \
  --config-name train/default \
  model.architecture=tiny_recursive \
  model/size=large \
  training.mode=embedding \
  trainer.checkpoints.enabled=true \
  logging.wandb.enabled=true \
  logging.wandb.project=cellarc100k_50e_embedding_large \
  logging.wandb.group=mode_embedding \
  logging.wandb.name=tiny_recursive_large_embedding_single

Smoke test: swap to the lightweight train/smoke config for a 5-step sanity check before long runs:

python scripts/train.py \
  --config-name train/smoke \
  model.architecture=tiny_recursive \
  model/size=small \
  training.mode=incontext \
  trainer.checkpoints.enabled=false \
  logging.wandb.enabled=false

B. Tmux Parallelism (Multi-Run Scheduling)

scripts/train_all_tmux.sh splits independent runs across GPUs via tmux workers.

Launch a curated subset on GPUs 0–3:

bash scripts/train_all_tmux.sh --gpus 0,1,2,3 \
  --run transformer_act:large:embedding \
  --run tiny_recursive:large:embedding \
  --run hrm:large:embedding \
  --run transformer:large:incontext

Status lives under outputs/tmux_runs/<timestamp>; attach to sessions with tmux attach -t train_all_gpu0.
Smoke test: bash scripts/train_all_smoke_test.sh small runs every architecture in both modes with the train/smoke config (5 optimizer steps) before you queue the larger tmux batch; pass small medium to limit sizes.

C. Torch Distributed (Multi-GPU Single Run)

For data-parallel training of one embedding run across 4 GPUs use torchrun (rank 0 handles logging).

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:true \
torchrun --nproc-per-node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 \
  -m scripts.train --config-name train/default \
  model.architecture=tiny_recursive \
  model/size=large \
  training.mode=embedding \
  data.batch_size=96 \
  trainer.gradient_accumulation=2 \
  trainer.checkpoints.enabled=true \
  logging.wandb.enabled=true \
  logging.wandb.project=cellarc100k_50e_embedding_large \
  logging.wandb.group=embedding_ddp \
  logging.wandb.name=tiny_recursive_large_embedding_ddp

Per-rank batch plus gradient accumulation controls memory footprint; set wandb login once before running.

Smoke test: keep the same torchrun invocation but point at train/smoke so the job exits after a handful of steps:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:true \
torchrun --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 \
  -m scripts.train --config-name train/smoke \
  model.architecture=tiny_recursive \
  model/size=small \
  training.mode=embedding \
  logging.wandb.enabled=false \
  trainer.checkpoints.enabled=false

D. Symbolic Baseline Evaluation

Run all bundled symbolic solvers with a single command; results land in outputs/symbolic/:
```
bash scripts/run_symbolic_baselines.sh
```

To smoke-test or target one solver, call the Hydra entry point directly and bound the episode count:

python scripts/eval_symbolic.py \
  baseline.name=copycat \
  eval.max_episodes=16 \
  eval.progress_bar=true

E. LLM Evaluation

Evaluate GPT-based baselines (defaults to gpt-5-2025-08-07) over the 100-episode HF splits; ensure your OpenAI credentials are exported before running:
```
bash scripts/run_gpt_eval.sh
```
Smoke test: limit to 10 episodes and write to a _smoke prediction log by toggling the environment flag:
```
SMOKE_TEST=true bash scripts/run_gpt_eval.sh
```
Batch multiple hosted LLMs via Hydra overrides (or reuse scripts/run_llm_baselines.sh) when you need to sweep model names.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
analysis		analysis
baselines		baselines
configs		configs
results		results
scripts		scripts
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

cellarc_baselines

Wandb Results

A. Basic (Single-GPU) Training

B. Tmux Parallelism (Multi-Run Scheduling)

C. Torch Distributed (Multi-GPU Single Run)

D. Symbolic Baseline Evaluation

E. LLM Evaluation

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

mireklzicar/cellarc_baselines

Folders and files

Latest commit

History

Repository files navigation

cellarc_baselines

Wandb Results

A. Basic (Single-GPU) Training

B. Tmux Parallelism (Multi-Run Scheduling)

C. Torch Distributed (Multi-GPU Single Run)

D. Symbolic Baseline Evaluation

E. LLM Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages