1
1
# Scripts Library
2
2
3
- This page contains ready -to-use scripts for the Euler cluster container workflow.
3
+ Ready -to-use scripts for the Euler cluster, organized by workflow section. All scripts have been tested on the Euler cluster with the RSL group allocation .
4
4
5
- ## Test Scripts
5
+ ## 📁 Scripts Organization
6
6
7
- All scripts have been tested on the Euler cluster with the RSL group allocation.
7
+ ```
8
+ scripts/
9
+ ├── getting-started/ # Initial setup scripts
10
+ ├── data-management/ # Storage and quota management
11
+ ├── python-environments/ # ML training examples
12
+ ├── computing-guide/ # Job submission templates
13
+ └── container-workflow/ # Container deployment scripts
14
+ ```
8
15
9
- ### Python Test Script
16
+ ## 🚀 Getting Started Scripts
10
17
11
- ** [ hello_cluster.py] ( scripts/hello_cluster.py ) **
18
+ ### Setup Verification
19
+ ** [ test_group_membership.sh] ( scripts/getting-started/test_group_membership.sh ) **
12
20
13
- A comprehensive GPU test script that :
14
- - Detects available GPUs and CUDA version
15
- - Performs matrix multiplication on GPU
16
- - Saves results to output directory
17
- - Reports system information
21
+ Verifies RSL group membership and creates all necessary directories :
22
+ ``` bash
23
+ wget https://raw.githubusercontent.com/leggedrobotics/euler-cluster-guide/main/docs/scripts/getting-started/test_group_membership.sh
24
+ bash test_group_membership.sh
25
+ ```
18
26
19
- ### Docker Configuration
27
+ ## 💾 Data Management Scripts
20
28
21
- ** [ Dockerfile] ( scripts/Dockerfile ) **
29
+ ### Storage Quota Check
30
+ ** [ test_storage_quotas.sh] ( scripts/data-management/test_storage_quotas.sh ) **
22
31
23
- A minimal GPU-enabled Docker image with :
24
- - CUDA 11.8 runtime
25
- - PyTorch 2.0.1 with CUDA support
26
- - Python 3.10
32
+ Comprehensive storage verification script that :
33
+ - Checks all storage paths and creates missing directories
34
+ - Displays current usage and quotas
35
+ - Tests ` $TMPDIR ` functionality in job context
27
36
28
- ### SLURM Job Script
37
+ ## 🐍 Python & ML Training Scripts
29
38
30
- ** [ test_job_project.sh] ( scripts/test_job_project.sh ) **
39
+ ### ML Training Example
40
+ ** [ fake_train.py] ( scripts/python-environments/fake_train.py ) ** | ** [ test_full_training_job.sh] ( scripts/python-environments/test_full_training_job.sh ) **
31
41
32
- Optimized job submission script that :
33
- - Extracts container to local scratch for performance
34
- - Allocates GPU resources
35
- - Saves results to project partition
36
- - Reports timing information
42
+ Complete ML training workflow example including :
43
+ - Simulated training with checkpointing
44
+ - Progress tracking and logging
45
+ - Resource monitoring
46
+ - Proper use of local scratch for data
37
47
38
- ## Additional Examples
48
+ ## 💻 Computing Scripts
39
49
40
- ### Multi-GPU Training Script
50
+ ### Basic Job Templates
41
51
52
+ - ** [ test_cpu_job.sh] ( scripts/computing-guide/test_cpu_job.sh ) ** - Basic CPU job submission
53
+ - ** [ test_gpu_job.sh] ( scripts/computing-guide/test_gpu_job.sh ) ** - GPU allocation test
54
+ - ** [ test_gpu_specific.sh] ( scripts/computing-guide/test_gpu_specific.sh ) ** - Request specific GPU type (RTX 4090)
55
+ - ** [ test_array_job.sh] ( scripts/computing-guide/test_array_job.sh ) ** - Array job for parameter sweeps
56
+
57
+ ### Advanced Templates
58
+
59
+ #### Multi-GPU Training
42
60
``` bash
43
61
#! /bin/bash
44
62
# SBATCH --job-name=multi-gpu-train
@@ -53,7 +71,7 @@ Optimized job submission script that:
53
71
54
72
module load eth_proxy
55
73
56
- # Extract container
74
+ # Extract container to local scratch
57
75
tar -xf /cluster/work/rsl/$USER /containers/training.tar -C $TMPDIR
58
76
59
77
# Run distributed training
@@ -67,51 +85,29 @@ singularity exec \
67
85
train.py --distributed
68
86
```
69
87
70
- ### Interactive Development Session
71
-
88
+ #### Interactive Development Session
72
89
``` bash
73
90
# Request interactive GPU session
74
91
srun --gpus=1 --mem=32G --tmp=50G --time=2:00:00 --pty bash
75
92
76
- # Extract container
93
+ # In the session, extract and use container
77
94
tar -xf /cluster/work/rsl/$USER /containers/dev.tar -C $TMPDIR
78
95
79
- # Enter container shell
80
96
singularity shell --nv \
81
97
--bind /cluster/project/rsl/$USER :/project \
82
98
--bind /cluster/scratch/$USER :/data \
83
99
$TMPDIR /dev.sif
84
100
```
85
101
86
- ### Batch Processing Script
102
+ ## 📦 Container Workflow Scripts
87
103
88
- ``` bash
89
- #! /bin/bash
90
- # SBATCH --array=1-100
91
- # SBATCH --job-name=batch-process
92
- # SBATCH --output=logs/job_%A_%a.out
93
- # SBATCH --error=logs/job_%A_%a.err
94
- # SBATCH --time=1:00:00
95
- # SBATCH --gpus=1
96
- # SBATCH --tmp=50G
97
-
98
- module load eth_proxy
99
-
100
- # Extract container once
101
- tar -xf /cluster/work/rsl/$USER /containers/processor.tar -C $TMPDIR
102
-
103
- # Process specific file based on array index
104
- singularity exec --nv \
105
- --bind /cluster/scratch/$USER /input:/input:ro \
106
- --bind /cluster/project/rsl/$USER /output:/output \
107
- $TMPDIR /processor.sif \
108
- python3 process.py --file /input/data_${SLURM_ARRAY_TASK_ID} .txt
109
- ```
110
-
111
- ## Helper Scripts
112
-
113
- ### Container Build and Deploy
104
+ ### Container Test Suite
105
+ - ** [ Dockerfile] ( scripts/container-workflow/Dockerfile ) ** - GPU-enabled Docker image with CUDA 11.8
106
+ - ** [ hello_cluster.py] ( scripts/container-workflow/hello_cluster.py ) ** - GPU functionality test
107
+ - ** [ test_job_project.sh] ( scripts/container-workflow/test_job_project.sh ) ** - Complete container job
108
+ - ** [ test_container_extraction.sh] ( scripts/container-workflow/test_container_extraction.sh ) ** - Extraction timing test
114
109
110
+ ### Build and Deploy Helper
115
111
``` bash
116
112
#! /bin/bash
117
113
# build_and_deploy.sh
@@ -128,7 +124,7 @@ docker build -t ${IMAGE_NAME}:${VERSION} .
128
124
129
125
# Convert to Singularity
130
126
echo " Converting to Singularity..."
131
- APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot \
127
+ apptainer build --sandbox --fakeroot \
132
128
${IMAGE_NAME} -${VERSION} .sif \
133
129
docker-daemon://${IMAGE_NAME} :${VERSION}
134
130
@@ -144,8 +140,9 @@ scp ${IMAGE_NAME}-${VERSION}.tar.gz \
144
140
echo " Done! Container available as ${IMAGE_NAME} -${VERSION} .tar.gz"
145
141
```
146
142
147
- ### Resource Monitor
143
+ ## 🔧 Utility Scripts
148
144
145
+ ### Job Resource Monitor
149
146
``` bash
150
147
#! /bin/bash
151
148
# monitor_job.sh
@@ -165,7 +162,7 @@ while true; do
165
162
echo -e " \n=== Resource Usage ==="
166
163
sstat -j $JOB_ID --format=JobID,MaxRSS,MaxDiskRead,MaxDiskWrite
167
164
168
- # Get node name
165
+ # Get node name and check GPU
169
166
NODE=$( squeue -j $JOB_ID -h -o %N)
170
167
if [ ! -z " $NODE " ]; then
171
168
echo -e " \n=== GPU Usage on $NODE ==="
@@ -176,15 +173,40 @@ while true; do
176
173
done
177
174
```
178
175
179
- ## Download All Scripts
176
+ ### Batch Job Status Check
177
+ ``` bash
178
+ #! /bin/bash
179
+ # check_jobs.sh
180
180
181
- You can download all scripts as a ZIP file or clone the repository:
181
+ echo " === Your Current Jobs ==="
182
+ squeue -u $USER --format=" %.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
183
+
184
+ echo -e " \n=== Recently Completed Jobs ==="
185
+ sacct -u $USER --starttime=$( date -d ' 1 day ago' +%Y-%m-%d) \
186
+ --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS
187
+
188
+ echo -e " \n=== Storage Usage ==="
189
+ lquota
190
+ ```
191
+
192
+ ## 📥 Download Scripts
193
+
194
+ Clone the entire repository to get all scripts:
182
195
183
196
``` bash
184
197
git clone https://github.com/leggedrobotics/euler-cluster-guide.git
185
198
cd euler-cluster-guide/docs/scripts
199
+
200
+ # Make all scripts executable
201
+ find . -name " *.sh" -type f -exec chmod +x {} \;
202
+ ```
203
+
204
+ Or download individual scripts:
205
+ ``` bash
206
+ # Example: Download the GPU test job
207
+ wget https://raw.githubusercontent.com/leggedrobotics/euler-cluster-guide/main/docs/scripts/computing-guide/test_gpu_job.sh
186
208
```
187
209
188
210
---
189
211
190
- [ Back to Home] ( / ) | [ Container Workflow] ( /container-workflow ) | [ Troubleshooting] ( /troubleshooting )
212
+ [ Back to Home] ( / ) | [ Computing Guide ] ( /computing-guide ) | [ Container Workflow] ( /container-workflow ) | [ Troubleshooting] ( /troubleshooting )
0 commit comments