Skip to content

Commit 271cd39

Browse files
author
Idate96
committed
Update scripts.md with new section-based organization
1 parent 1ae413a commit 271cd39

File tree

1 file changed

+85
-63
lines changed

1 file changed

+85
-63
lines changed

docs/scripts.md

Lines changed: 85 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,62 @@
11
# Scripts Library
22

3-
This page contains ready-to-use scripts for the Euler cluster container workflow.
3+
Ready-to-use scripts for the Euler cluster, organized by workflow section. All scripts have been tested on the Euler cluster with the RSL group allocation.
44

5-
## Test Scripts
5+
## 📁 Scripts Organization
66

7-
All scripts have been tested on the Euler cluster with the RSL group allocation.
7+
```
8+
scripts/
9+
├── getting-started/ # Initial setup scripts
10+
├── data-management/ # Storage and quota management
11+
├── python-environments/ # ML training examples
12+
├── computing-guide/ # Job submission templates
13+
└── container-workflow/ # Container deployment scripts
14+
```
815

9-
### Python Test Script
16+
## 🚀 Getting Started Scripts
1017

11-
**[hello_cluster.py](scripts/hello_cluster.py)**
18+
### Setup Verification
19+
**[test_group_membership.sh](scripts/getting-started/test_group_membership.sh)**
1220

13-
A comprehensive GPU test script that:
14-
- Detects available GPUs and CUDA version
15-
- Performs matrix multiplication on GPU
16-
- Saves results to output directory
17-
- Reports system information
21+
Verifies RSL group membership and creates all necessary directories:
22+
```bash
23+
wget https://raw.githubusercontent.com/leggedrobotics/euler-cluster-guide/main/docs/scripts/getting-started/test_group_membership.sh
24+
bash test_group_membership.sh
25+
```
1826

19-
### Docker Configuration
27+
## 💾 Data Management Scripts
2028

21-
**[Dockerfile](scripts/Dockerfile)**
29+
### Storage Quota Check
30+
**[test_storage_quotas.sh](scripts/data-management/test_storage_quotas.sh)**
2231

23-
A minimal GPU-enabled Docker image with:
24-
- CUDA 11.8 runtime
25-
- PyTorch 2.0.1 with CUDA support
26-
- Python 3.10
32+
Comprehensive storage verification script that:
33+
- Checks all storage paths and creates missing directories
34+
- Displays current usage and quotas
35+
- Tests `$TMPDIR` functionality in job context
2736

28-
### SLURM Job Script
37+
## 🐍 Python & ML Training Scripts
2938

30-
**[test_job_project.sh](scripts/test_job_project.sh)**
39+
### ML Training Example
40+
**[fake_train.py](scripts/python-environments/fake_train.py)** | **[test_full_training_job.sh](scripts/python-environments/test_full_training_job.sh)**
3141

32-
Optimized job submission script that:
33-
- Extracts container to local scratch for performance
34-
- Allocates GPU resources
35-
- Saves results to project partition
36-
- Reports timing information
42+
Complete ML training workflow example including:
43+
- Simulated training with checkpointing
44+
- Progress tracking and logging
45+
- Resource monitoring
46+
- Proper use of local scratch for data
3747

38-
## Additional Examples
48+
## 💻 Computing Scripts
3949

40-
### Multi-GPU Training Script
50+
### Basic Job Templates
4151

52+
- **[test_cpu_job.sh](scripts/computing-guide/test_cpu_job.sh)** - Basic CPU job submission
53+
- **[test_gpu_job.sh](scripts/computing-guide/test_gpu_job.sh)** - GPU allocation test
54+
- **[test_gpu_specific.sh](scripts/computing-guide/test_gpu_specific.sh)** - Request specific GPU type (RTX 4090)
55+
- **[test_array_job.sh](scripts/computing-guide/test_array_job.sh)** - Array job for parameter sweeps
56+
57+
### Advanced Templates
58+
59+
#### Multi-GPU Training
4260
```bash
4361
#!/bin/bash
4462
#SBATCH --job-name=multi-gpu-train
@@ -53,7 +71,7 @@ Optimized job submission script that:
5371

5472
module load eth_proxy
5573

56-
# Extract container
74+
# Extract container to local scratch
5775
tar -xf /cluster/work/rsl/$USER/containers/training.tar -C $TMPDIR
5876

5977
# Run distributed training
@@ -67,51 +85,29 @@ singularity exec \
6785
train.py --distributed
6886
```
6987

70-
### Interactive Development Session
71-
88+
#### Interactive Development Session
7289
```bash
7390
# Request interactive GPU session
7491
srun --gpus=1 --mem=32G --tmp=50G --time=2:00:00 --pty bash
7592

76-
# Extract container
93+
# In the session, extract and use container
7794
tar -xf /cluster/work/rsl/$USER/containers/dev.tar -C $TMPDIR
7895

79-
# Enter container shell
8096
singularity shell --nv \
8197
--bind /cluster/project/rsl/$USER:/project \
8298
--bind /cluster/scratch/$USER:/data \
8399
$TMPDIR/dev.sif
84100
```
85101

86-
### Batch Processing Script
102+
## 📦 Container Workflow Scripts
87103

88-
```bash
89-
#!/bin/bash
90-
#SBATCH --array=1-100
91-
#SBATCH --job-name=batch-process
92-
#SBATCH --output=logs/job_%A_%a.out
93-
#SBATCH --error=logs/job_%A_%a.err
94-
#SBATCH --time=1:00:00
95-
#SBATCH --gpus=1
96-
#SBATCH --tmp=50G
97-
98-
module load eth_proxy
99-
100-
# Extract container once
101-
tar -xf /cluster/work/rsl/$USER/containers/processor.tar -C $TMPDIR
102-
103-
# Process specific file based on array index
104-
singularity exec --nv \
105-
--bind /cluster/scratch/$USER/input:/input:ro \
106-
--bind /cluster/project/rsl/$USER/output:/output \
107-
$TMPDIR/processor.sif \
108-
python3 process.py --file /input/data_${SLURM_ARRAY_TASK_ID}.txt
109-
```
110-
111-
## Helper Scripts
112-
113-
### Container Build and Deploy
104+
### Container Test Suite
105+
- **[Dockerfile](scripts/container-workflow/Dockerfile)** - GPU-enabled Docker image with CUDA 11.8
106+
- **[hello_cluster.py](scripts/container-workflow/hello_cluster.py)** - GPU functionality test
107+
- **[test_job_project.sh](scripts/container-workflow/test_job_project.sh)** - Complete container job
108+
- **[test_container_extraction.sh](scripts/container-workflow/test_container_extraction.sh)** - Extraction timing test
114109

110+
### Build and Deploy Helper
115111
```bash
116112
#!/bin/bash
117113
# build_and_deploy.sh
@@ -128,7 +124,7 @@ docker build -t ${IMAGE_NAME}:${VERSION} .
128124

129125
# Convert to Singularity
130126
echo "Converting to Singularity..."
131-
APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot \
127+
apptainer build --sandbox --fakeroot \
132128
${IMAGE_NAME}-${VERSION}.sif \
133129
docker-daemon://${IMAGE_NAME}:${VERSION}
134130

@@ -144,8 +140,9 @@ scp ${IMAGE_NAME}-${VERSION}.tar.gz \
144140
echo "Done! Container available as ${IMAGE_NAME}-${VERSION}.tar.gz"
145141
```
146142

147-
### Resource Monitor
143+
## 🔧 Utility Scripts
148144

145+
### Job Resource Monitor
149146
```bash
150147
#!/bin/bash
151148
# monitor_job.sh
@@ -165,7 +162,7 @@ while true; do
165162
echo -e "\n=== Resource Usage ==="
166163
sstat -j $JOB_ID --format=JobID,MaxRSS,MaxDiskRead,MaxDiskWrite
167164

168-
# Get node name
165+
# Get node name and check GPU
169166
NODE=$(squeue -j $JOB_ID -h -o %N)
170167
if [ ! -z "$NODE" ]; then
171168
echo -e "\n=== GPU Usage on $NODE ==="
@@ -176,15 +173,40 @@ while true; do
176173
done
177174
```
178175

179-
## Download All Scripts
176+
### Batch Job Status Check
177+
```bash
178+
#!/bin/bash
179+
# check_jobs.sh
180180

181-
You can download all scripts as a ZIP file or clone the repository:
181+
echo "=== Your Current Jobs ==="
182+
squeue -u $USER --format="%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
183+
184+
echo -e "\n=== Recently Completed Jobs ==="
185+
sacct -u $USER --starttime=$(date -d '1 day ago' +%Y-%m-%d) \
186+
--format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS
187+
188+
echo -e "\n=== Storage Usage ==="
189+
lquota
190+
```
191+
192+
## 📥 Download Scripts
193+
194+
Clone the entire repository to get all scripts:
182195

183196
```bash
184197
git clone https://github.com/leggedrobotics/euler-cluster-guide.git
185198
cd euler-cluster-guide/docs/scripts
199+
200+
# Make all scripts executable
201+
find . -name "*.sh" -type f -exec chmod +x {} \;
202+
```
203+
204+
Or download individual scripts:
205+
```bash
206+
# Example: Download the GPU test job
207+
wget https://raw.githubusercontent.com/leggedrobotics/euler-cluster-guide/main/docs/scripts/computing-guide/test_gpu_job.sh
186208
```
187209

188210
---
189211

190-
[Back to Home](/) | [Container Workflow](/container-workflow) | [Troubleshooting](/troubleshooting)
212+
[Back to Home](/) | [Computing Guide](/computing-guide) | [Container Workflow](/container-workflow) | [Troubleshooting](/troubleshooting)

0 commit comments

Comments
 (0)