mjdrone is a standalone multimodal drone RL project built on top of the mjlab repository.
The project now has three defining characteristics:
- a shared attention-fusion policy backbone that combines front-camera vision with IMU and task-state inputs,
- randomized visual environments so the policy learns across many layouts instead of memorizing one scene,
- a staged curriculum that starts with hover, moves to waypoint tracking, and is intended to extend to gate flight.
The current vision backbone is no longer trained from scratch. The policy now uses a pretrained ResNet image encoder with selective fine-tuning, then fuses those visual features with IMU/state inputs through the same attention-fusion control head.
Today the project contains two tasks:
hover: lift off from the ground and stabilize near a low hover target,waypoint: lift off from the ground and intercept a distinct target tank while avoiding all non-target contacts.
The near-term goal is not gate flight yet. The goal is to build the shared vehicle, sensing, training, and policy stack that gate flight will later depend on.
There is also a play-only inspection mode:
test: loads the shared environment for viewing and disables rotor thrust so the drone stays inert on the ground.
mjdrone/
├── README.md
├── pyproject.toml
└── src/
└── mjdrone/
├── cli.py
├── assets/
│ ├── __init__.py
│ ├── landmarks.py
│ └── quadcopter.py
├── models/
│ ├── __init__.py
│ └── attention_fusion.py
└── tasks/
├── __init__.py
├── hover/
│ ├── __init__.py
│ ├── env_cfg.py
│ ├── mdp.py
│ └── rl_cfg.py
└── waypoint/
├── __init__.py
├── env_cfg.py
├── mdp.py
└── rl_cfg.py
These parts are shared across tasks and are the real core of the project.
The quadcopter asset is defined in src/mjdrone/assets/quadcopter.py.
It includes:
- a floating-base quadcopter body,
- four rotor thrust actuators,
- an
imu_siteused for acceleration, velocity, and gyro sensing, - a front-facing RGB camera,
- visual geometry that is hidden from the onboard camera so the camera does not see the drone itself.
The default onboard camera now uses a landscape image shape of 160 x 96, a corrected roll orientation, and a narrower field of view than before, so target vehicles occupy more useful pixels and the feed displays with the expected horizon orientation.
The default action space is 4D, one normalized thrust command per rotor.
Each task builds three observation groups:
actorcriticcamera
actor and critic contain the low-dimensional state:
- IMU linear acceleration,
- IMU linear velocity,
- IMU angular velocity,
- projected gravity,
- target error terms,
- previous action.
camera contains the front RGB image.
The runner config resolves observations as:
actor = ("actor", "camera")critic = ("critic", "camera")
Both hover and waypoint use the same multimodal attention-fusion backbone implemented in src/mjdrone/models/attention_fusion.py.
The model is intentionally split into a fast control/state path and a visual context path:
- the state path handles IMU and task state,
- the vision path handles the camera image through a pretrained ResNet backbone,
- the fusion block lets state decide what visual evidence matters.
The vision branch now uses:
torchvisionResNet-18 pretrained on ImageNet,- ImageNet input normalization,
- feature extraction up to
layer3, - only the later visual stage (
layer3) left trainable by default, - shared visual encoder weights between actor and critic during PPO.
The IMU is not a side input. It is part of the main control pathway.
The low-dimensional state branch contains:
- linear acceleration,
- linear velocity,
- angular velocity,
- projected gravity,
- target error,
- previous action.
That branch gives the policy direct information about motion, tilt, drift, and recent control effort. Vision provides scene context; the IMU/state branch provides fast stabilization context.
High-level data flow:
camera image
-> pretrained ResNet-18 feature extractor
-> visual tokens + 2D positional encoding
normalized IMU/task state
-> state encoder MLP
-> encoded state latent
-> 4 state-derived query tokens
state-derived query tokens + visual tokens
-> cross-attention
-> attended visual
encoded state latent + attended visual
-> state-conditioned gated fusion
-> fused latent
-> actor head or critic head
Stage-by-stage view:
| Stage | Input | Operation | Output |
|---|---|---|---|
| Vision encoder | 3 x 96 x 160 RGB image |
pretrained ResNet-18 up to layer3 |
visual feature map |
| Vision tokenization | feature map | flatten spatial dimensions | N x 96 visual tokens |
| Positional encoding | token grid | learned 2D position projection | position-aware tokens |
| State encoder | 21D IMU/task vector | MLP 21 -> 64 -> 96 with normalization |
96D state latent |
| Query generation | 96D state latent | linear projection | 4 x 96 query tokens |
| Cross-attention | state queries + visual tokens | multi-head attention | attended visual summary |
| Fusion | state latent + attended visual + pooled visual context | gated fusion block | fused visual latent |
| Final latent | state latent + fused visual | concatenation | 192D policy/value latent |
| Output head | 192D latent | MLP 512 -> 384 -> 256 |
actor actions or critic value |
Current default dimensions for both tasks:
- image input:
3 x 96 x 160 - low-dimensional state input:
21 - state encoder:
21 -> 64 -> 96 - attention width:
96 - attention heads:
8 - query tokens:
4 - final fused latent:
192 - actor hidden layers:
512 -> 384 -> 256 -> 4 - critic hidden layers:
512 -> 384 -> 256 -> 1
If the pretrained ResNet weights are not already cached on your machine, the first training or play run that constructs the model will trigger a torchvision weight download.
That means:
torchvisionmust be installed,- the machine must have internet access for the first pretrained run,
- after the weights are cached locally, later runs reuse them.
Design intent:
- the state branch stays separate until late fusion,
- state-derived queries attend over image tokens, not the reverse,
- the encoded state has a direct path to the output head,
- the policy can stabilize from IMU/state even when vision is noisy,
- vision provides external context for drift correction, orientation references, and later target tracking.
The current project includes visual scene randomization, not full dynamics/domain randomization.
Current visual randomization changes the rendered scene so the camera policy must generalize across layouts. It does not yet change the underlying flight dynamics.
The reusable landmark assets are defined in src/mjdrone/assets/landmarks.py. On environment reset, task code randomizes:
- a multi-block branching road network around the drone,
- ground and road-like coloring,
- tree positions, yaw, and canopy colors,
- car positions, yaw, and body colors,
- billboard positions, yaw, and panel colors.
This makes different parallel environments look different and causes each episode to present a new scene.
Future dynamics/domain randomization is still separate work. That future stage would include things like:
- wind disturbances,
- mass or inertia variation,
- motor lag,
- sensor noise tuning,
- sim-to-real parameter randomization.
Hover is the stabilization task and the base of the curriculum.
The hover task:
- starts with the drone on the ground instead of already airborne,
- samples a local target inside a small 3D region,
- keeps that target low enough to require liftoff without pushing the drone above the scene,
- rewards planar position and altitude tracking,
- rewards staying upright,
- penalizes post-liftoff contact so crashing is explicitly bad instead of only terminating the episode,
- penalizes linear velocity, angular velocity, action rate, and action magnitude,
- terminates on time-out, excessive tilt, post-liftoff contact, or leaving the allowed flight region.
This task is designed to teach basic visual-inertial stabilization before navigation is added.
Main files:
Waypoint builds on the same vehicle, sensors, randomization, and policy backbone as hover.
What changes relative to hover:
- the drone also starts on the ground and must lift off before navigating,
- the target is a single tan tank-like vehicle that is visually distinct from the ordinary cars in the scene,
- the actor no longer receives the true target offset directly and instead must use the camera to identify and pursue the target tank,
- the tank is spawned anywhere in the environment and the drone is initialized facing it so it starts in frame,
- the reward emphasizes progress and fast interception of the target vehicle,
- successful target contact gets a large positive bonus,
- non-target post-liftoff contact gets an explicit negative penalty,
- colliding with the target tank is success,
- any post-liftoff contact with the ground or other obstacles is failure.
So hover teaches the drone to lift off and remain stable; waypoint teaches it to lift off, visually acquire the target tank, and move through clutter without touching anything else.
Main files:
src/mjdrone/tasks/waypoint/env_cfg.pysrc/mjdrone/tasks/waypoint/mdp.pysrc/mjdrone/tasks/waypoint/rl_cfg.py
Training uses:
mjlabenvironment managers,RslRlVecEnvWrapper,- PPO through
MjlabOnPolicyRunner.
Runs are written under:
logs/rsl_rl/<experiment_name>/<timestamp>[_run_name]/
Each run stores:
- checkpoints,
params/env.yaml,params/agent.yaml,- optional training videos when
--videois enabled.
Model checkpoints are saved as:
logs/rsl_rl/<experiment_name>/<timestamp>[_run_name]/model_<iteration>.pt
Current experiment directories by task:
hover:logs/rsl_rl/mjdrone_hover_pretrained_vision/waypoint:logs/rsl_rl/mjdrone_waypoint_pretrained_vision/
Examples:
/home/oorischubert/mjdrone/logs/rsl_rl/mjdrone_hover_pretrained_vision/<timestamp>/model_<iteration>.pt
/home/oorischubert/mjdrone/logs/rsl_rl/mjdrone_waypoint_pretrained_vision/<timestamp>/model_<iteration>.pt
If you want the latest checkpoint for a task, look inside the newest timestamped run directory under that experiment.
The currently retained waypoint run is:
/home/oorischubert/mjdrone/logs/rsl_rl/mjdrone_waypoint_pretrained_vision/2026-03-11_13-02-32/model_3999.pt
Useful commands:
ls /home/oorischubert/mjdrone/logs/rsl_rl
ls /home/oorischubert/mjdrone/logs/rsl_rl/mjdrone_hover_pretrained_vision
ls /home/oorischubert/mjdrone/logs/rsl_rl/mjdrone_waypoint_pretrained_vision
find /home/oorischubert/mjdrone/logs/rsl_rl -name 'model_*.pt' | sortInteractive playback is controlled through mjdrone play.
Default behavior:
--headlessdefaults to false,- if
--headlessis omitted, playback opens a viewer, - if
--headlessis passed, playback runs without a viewer.
Viewer selection:
--viewer auto: preferviserfor camera-based tasks,--viewer native: force the MuJoCo native viewer,--viewer viser: force the browser-based Viser viewer.
For camera-based inspection, viser is the better option because it shows the onboard camera feed in Camera Feeds. The native viewer only shows the external scene.
From mjdrone/:
uv syncThis installs the published mjlab==1.2.0 dependency automatically. A local
mjlab checkout is not required.
uv run mjdrone trainTrain hover explicitly:
uv run mjdrone train --task hoverTrain waypoint explicitly:
uv run mjdrone train --task waypointUseful training flags:
--task hover|waypoint--device cuda:0--num-envs 512--seed 42--max-iterations N--num-steps-per-env N--run-name name--log-root /custom/path--image-width 64--image-height 48--video--video-length 200--video-interval 2000
If --max-iterations or --num-steps-per-env are omitted, task defaults from the runner config are used:
hover:3000iterations,32steps per environmentwaypoint:4000iterations,40steps per environment
Example:
uv run mjdrone train --task hover --device cuda:0 --num-envs 512 --run-name hover_rgb_imu --videoBecause --headless defaults to false, this opens a viewer:
uv run mjdrone playPlay waypoint:
uv run mjdrone play --task waypointIn waypoint play, the environment spawns one visually distinct tan target tank at reset. The drone should lift off, find that vehicle in the camera feed, and collide with it without touching anything else first.
Show the onboard camera feed explicitly:
uv run mjdrone play --viewer viserUseful playback flags:
--task hover|waypoint|test--agent trained|random|zero--checkpoint /path/to/model.pt--device cuda:0--num-envs 1--viewer auto|native|viser--headless--steps 1000--video--video-length 300--image-width 160--image-height 96
Examples:
uv run mjdrone play --agent random
uv run mjdrone play --viewer native
uv run mjdrone play --viewer viser
uv run mjdrone play --task test
uv run mjdrone play --headless --steps 1500To save a video during headless playback:
uv run mjdrone play --headless --video --video-length 300The trees, parked cars, billboards, and ground markings serve two roles:
- they provide visual structure for the front camera,
- and in waypoint they also act as physical obstacles that the drone must avoid after liftoff.
They exist because camera-based hover and waypoint learning are poorly conditioned if the image mostly contains blank ground and sky. The landmarks provide:
- stable texture,
- depth and parallax cues,
- orientation references,
- better visual evidence for drift and yaw correction.
With reset-time randomization, they are also part of the generalization strategy rather than fixed decoration. The target tank is randomized in the same shared scene generator, but it is kept in a forward spawn corridor so the drone can see it immediately at the start of each episode.
The intended progression is:
- Hover
- Waypoint tracking
- Single gate flight
- Multi-gate racing
That order is deliberate. Gate flight should build on a stable hover and navigation stack, not replace it.
This is still an early project.
It does not yet include:
- gate geometry or gate-passage rewards,
- ordered waypoint sequences beyond single active target vehicles,
- dynamics/domain randomization such as wind or motor lag,
- sim-to-real tuning,
- multi-camera perception,
- higher-level controller abstractions above raw rotor thrust,
- temporal memory for deliberate target-search behavior when the target leaves the camera frustum.