Update to New RSL-RL - Enable multi gpu#31
Update to New RSL-RL - Enable multi gpu#31alessandroassirelli98 wants to merge 7 commits intoNVlabs:mainfrom
Conversation
fix missing attribute and log path add player for new rsl_rl remove checkpoint abspath. Updated doc running formatter adding student add check for multi gpu and set right device delete dependency remove install_deps.sh file remove custom rsl_rl fix deprecated attribute save log in robot subfolder configured training for student with new rsl_rl change observations teacher_policy to teacher for rsl_rl compatibility get right checkpoints in the player update code for eval run formatter fix eval script. updated doc add my isaaclab path to intellisense specify python package
|
Thanks for the MR! We really appreciate the contribution. It might take us a bit of time to review the changes thoroughly and test them out before we can merge it in. Thanks for your patience! |
huihuaNvidia2023
left a comment
There was a problem hiding this comment.
First path done, need to verify locally.
There was a problem hiding this comment.
This is nice but assumes a path of IsaacLab. A better way would be using the environmental variable ISAACLAB_PATH. VSCODE does provide utils to use env variable.
There was a problem hiding this comment.
Yeah I was using this for debugging, I think it would be best to just remove it
There was a problem hiding this comment.
We can't host this dataset due to licence issue.
| num_steps_per_env: int = field( | ||
| default=24, metadata={"description": "Number of steps per environment per iteration."} | ||
| num_steps_per_env = 24 | ||
| max_iterations = 10000000 |
There was a problem hiding this comment.
nit: this number is too big. Normally 50k or 80k can give a good result, maybe 100k is enough?
There was a problem hiding this comment.
Thanks a lot for the comprehensive changes, your contribution is much appreciated!
Actually, the latest IsaacLab 2.1 comes with the version of RSL RL that has distillation and a bunch other new features.
If we upgrade to v2.1.0, we could potentially make it even nicer. To jump from v2.0.0 to v2.1.0, the changes are minimum, only a few data names need to be updated.
It may worth to give it a shot.
There was a problem hiding this comment.
For this I kept the same iteration number as the original version, but I can reduce it no problem.
There was a problem hiding this comment.
Yeah so this PR already uses the version of RL_RL that has distillation. The only reason why I did not use it and I was installing rsl separately is because in the distillation algorithm rsl was not using a gradient cap, which was instead used in the original implementation shipped with HOVER. Now rsl has it, but as far as i know these changes have not been reflected in isaaclab which still uses rsl-rl-lib 2.3.1.
There was a problem hiding this comment.
But this pr was mostly to include the new library so that we can use the multi gpu support. I know it's not super clean, but before spending too much time on it I wanted to know if you have any advice on how to proceed.
I'm very open to suggestions and I can work on it, so if you have ideas please share✌️
There was a problem hiding this comment.
Thanks for mentioning the detail. Yeah, this works for me. Your PR is still the main piece. We can worry about the Lab conversion later.
| num_steps_per_env: int = field( | ||
| default=24, metadata={"description": "Number of steps per environment per iteration."} | ||
| num_steps_per_env = 24 | ||
| max_iterations = 10000000 |
There was a problem hiding this comment.
Thanks for mentioning the detail. Yeah, this works for me. Your PR is still the main piece. We can worry about the Lab conversion later.
| It is possible to train on a node with multiple gpus or multiple nodes with multiple gpues. | ||
| In order to train the teacher on a machine with 4 gpus: | ||
| ```bash | ||
| python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 scripts/rsl_rl/train_teacher.py --num_envs=1024 --headless --distributed |
There was a problem hiding this comment.
train_teacher.py -> train_teacher_policy.py
There was a problem hiding this comment.
can you please also update to use joint_friction_coeff here
There was a problem hiding this comment.
On another note, there are several unit tests are failing in the core, inference_env and isaac_lab_wrapper folders. Could you please take a look?
| # Add max_grad_norm to the configuration | ||
| # As it has not been added to isaaclab yet | ||
| max_grad_norm: float = MISSING | ||
| pass |
|
Hey there! I checked the tests. |
Update to New RSL-RL - Enable multi-gpu
Description
This pull request introduces significant improvements to the project by fully integrating the latest configuration of the RSL-RL library. The update removes the need for a custom version, enabling a more streamlined and modular workflow.
Summary of Changes
🆕 New Files
scripts/rsl_rl/cli_args.pyandscripts/rsl_rl/student_policy_cfg.pyto support the new configuration workflow.✏️ Modified Files
README.md: Updated to reflect the latest usage instructions and improvements.neural_wbc/: Modified multiple components, including environment wrappers, inference environments, and policy trainers, to align with the updated RSL-RL interface.scripts/rsl_rl/: Revised training and evaluation scripts to support the latest features of the library.pyproject.toml&requirements.txt: Updated dependencies to match the new setup.🗑️ Removed Files
third_party/rsl_rl/, as the project now uses the upstream RSL-RL library directly.Purpose
The main goals of this update are:
Testing
Let me know if there are any concerns or suggestions for further improvement.