Autonomous research system for optimizing models that detect human events (falls, eating, aggression, unstable gait, etc.) from 17-point pose estimation sequences.
Adapted from: Karpathy's autoresearch
Input: YOLO26 17-keypoint pose sequences
Output: Multi-class event classification
Give an AI agent a pose-based event detection model and let it experiment autonomously overnight. It modifies the model architecture, trains for 5 minutes, checks if accuracy improved, keeps or discards the change, and repeats. You wake up to a log of experiments and (hopefully) a better model.
Key difference from original autoresearch: Instead of training a GPT on text tokens, we train a temporal model (CNN+LSTM/Transformer) on pose keypoint sequences.
The model detects 7 types of human events:
- Fall - Person falling down
- Eating - Hand-to-mouth feeding motion
- Working Together - Multiple people collaborating
- Aggression - Physical altercation or aggressive behavior
- Unstable Gait - Walking with balance issues
- Wandering - Aimless movement patterns
- Sitting/Standing - Posture transitions
YOLO26 17-Keypoint Pose:
1. Nose, 2. Left Eye, 3. Right Eye, 4. Left Ear, 5. Right Ear,
6. Left Shoulder, 7. Right Shoulder, 8. Left Elbow, 9. Right Elbow,
10. Left Wrist, 11. Right Wrist, 12. Left Hip, 13. Right Hip,
14. Left Knee, 15. Right Knee, 16. Left Ankle, 17. Right Ankle
Each keypoint has 3 values: (x, y, confidence)
Per-frame input: 17 keypoints × 3 = 51 floats
Temporal sequence: N frames × 51 floats (typically N=30 for 1 second at 30fps)
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv syncOption A: Use synthetic data (quick start)
# Downloads NTU RGB+D or Kinetics-Skeleton dataset
uv run prepare.py --dataset ntuOption B: Extract poses from your videos
# Install YOLO
pip install ultralytics
# Extract poses
python scripts/extract_poses.py --video data/raw/*.mp4 --output data/processed/# One 5-minute training run
uv run train.pyExpected output:
Training for 5 minutes...
Validation Accuracy: 0.6543
Point your coding agent (Claude, Codex, etc.) at this repo with no permissions and prompt:
Read program.md and let's kick off a new experiment!
The agent will:
- Read
program.md(your research directives) - Modify
train.py(architecture, hyperparameters) - Run experiment (5 minutes)
- Log results to
results.tsv - Repeat overnight (~100 experiments)
pose-autoresearch/
├── README.md # This file
├── program.md # Agent instructions (human edits)
├── train.py # Model under test (agent edits)
├── prepare.py # Data loading (fixed, not modified)
├── pyproject.toml # Dependencies
├── data/
│ ├── raw/ # Raw videos or YOLO outputs
│ ├── processed/ # Preprocessed pose sequences
│ └── labels.csv # Ground truth labels
├── scripts/
│ ├── extract_poses.py # Extract poses from videos
│ ├── label_tool.py # Manual labeling interface
│ └── visualize.py # Visualize pose sequences
├── results.tsv # Experiment log (created by agent)
└── checkpoints/ # Saved model weights
Contains the full model, optimizer, and training loop. Everything is fair game:
- Model architecture (CNN, LSTM, Transformer, attention, etc.)
- Hyperparameters (hidden dims, num layers, dropout, etc.)
- Optimizer (AdamW, SGD, learning rate, weight decay)
- Data augmentation
- Training schedule
Current baseline: CNN + LSTM
Conv1d(51 → 128) → Conv1d(128 → 256) → LSTM(256, 2 layers) → Linear(256 → 7)Constants, data loading, evaluation. Not modified by agent.
- Loads pose sequences from
data/processed/ - Creates train/val/test splits
- Defines evaluation metric (accuracy)
- Fixed 5-minute time budget
Instructions for the autonomous research agent. You iterate on this to improve research strategy.
Primary metric: Validation accuracy
Training always runs for exactly 5 minutes (wall clock), regardless of model size or batch size. This means:
- ~12 experiments per hour
- ~100 experiments overnight (8 hours)
- All experiments are directly comparable
Lower validation loss is better. The agent hill-climbs on this metric.
Use existing pose datasets:
- NTU RGB+D 120 - 120 action classes with skeleton data
- Kinetics-Skeleton - Large-scale video dataset with pose annotations
Map their actions to our event categories.
-
Deploy YOLO26 on camera feeds
from ultralytics import YOLO model = YOLO("yolo26n-pose.pt") results = model(video_stream)
-
Collect pose sequences (30 frames = 1 second at 30fps)
-
Manual labeling using provided tool:
python scripts/label_tool.py --data data/raw/
Recommended: 500+ examples per event class (3,500+ total)
- Fast experiments (~5 min/run)
- Proven for temporal sequence classification
- Easy for agent to modify
- Temporal Transformer - Better long-range dependencies
- Bidirectional LSTM - Look ahead and behind
- Attention Mechanisms - Focus on important keypoints
- Graph Neural Networks - Respect skeleton structure
The agent will autonomously explore these through modification of train.py.
After one night of autonomous research:
commit val_acc description status
a3f91d2 0.6543 Baseline CNN+LSTM keep
b82c4e1 0.6891 2x hidden dim (256→512) keep
c71f3a9 0.6823 Add dropout 0.3 discard
d92e5b2 0.7124 Bidirectional LSTM keep
e13a7c4 0.7456 Add attention layer keep
...This system is the next evolution of the Vistarra fall detection project:
Current Vistarra:
- YOLO pose → Claude Vision API → fall/no fall classification
- Works but slow and expensive per frame
This System:
- YOLO pose → Trained Model → 7 event types
- 100x faster, no API costs, more events
Migration path:
- Extract pose data from Vistarra deployments
- Label events
- Train model with autoresearch
- Replace Claude Vision with trained model
- Deploy to edge devices (NVIDIA Jetson)
- Python: 3.10+
- GPU: NVIDIA GPU recommended (H100 ideal, but works on smaller GPUs)
- Disk: ~10GB for datasets
- RAM: 16GB+
Platform support: Currently requires NVIDIA GPU. CPU/MPS support possible but would require modifications to train.py (or have the agent figure it out!).
Edit prepare.py to define your own event taxonomy:
EVENT_CLASSES = [
"custom_event_1",
"custom_event_2",
# ... etc
]Modify train.py to handle multiple people per frame:
# Input: (batch, num_people, seq_len, 51)
# Process each person separately, then aggregateExport trained model:
import torch
model = PoseEventClassifier()
model.load_state_dict(torch.load("best_model.pt"))
torch.onnx.export(model, dummy_input, "pose_model.onnx")Deploy to edge device:
# On NVIDIA Jetson
trtexec --onnx=pose_model.onnx --saveEngine=pose_model.trtThis work adapts Karpathy's autoresearch framework:
@misc{karpathy2026autoresearch,
author = {Karpathy, Andrej},
title = {autoresearch: AI agents running research on single-GPU nanochat training automatically},
year = {2026},
url = {https://github.com/karpathy/autoresearch}
}MIT
Contributions welcome! Areas of interest:
- Additional event classes
- GNN architecture implementations
- Multi-person tracking
- Real-time deployment tools
- Labeling interface improvements
Creator: Sam Fuller
GitHub: @sfuller3
Project: https://github.com/sfuller3/pose-autoresearch
- Andrej Karpathy for the autoresearch framework
- Ultralytics for YOLO26 pose estimation
- NTU RGB+D and Kinetics teams for pose datasets