Skip to content

Commit b9e0a91

Browse files
author
Idate96
committed
Split complete guide into separate sections with improved navigation
- Created getting-started.md (sections 1-3) - Created data-management.md (section 4) - Created python-environments.md (sections 5 & 8) - Created computing-guide.md (sections 6-7) - Updated mkdocs.yml with hierarchical navigation - Redesigned index.md with quick access to all sections
1 parent 6ef1da3 commit b9e0a91

13 files changed

+2800
-23
lines changed

docs/complete-guide.md

Lines changed: 1331 additions & 3 deletions
Large diffs are not rendered by default.

docs/computing-guide.md

Lines changed: 622 additions & 0 deletions
Large diffs are not rendered by default.

docs/data-management.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Data Management on Euler
2+
3+
Effective data management is critical when working on the Euler Cluster, particularly for machine learning workflows that involve large datasets and model outputs. This section explains the available storage options and their proper usage.
4+
5+
---
6+
7+
## 📁 Home Directory (`/cluster/home/$USER`)
8+
9+
- **Quota**: 45 GB
10+
- **Inodes**: ~450,000 files
11+
- **Persistence**: Permanent (not purged)
12+
- **Use Case**: Ideal for storing source code, small configuration files, scripts, and lightweight development tools.
13+
14+
---
15+
16+
## ⚡ Scratch Directory (`/cluster/scratch/$USER` or `$SCRATCH`)
17+
18+
- **Quota**: 2.5 TB
19+
- **Inodes**: 1 M
20+
- **Persistence**: Temporary (data is deleted if not accessed for ~15 days)
21+
- **Use Case**: For storing datasets and temporary training outputs.
22+
- **Recommended Dataset storage format**: Use **tar/zip/[HDF5](https://www.hdfgroup.org/solutions/hdf5/)/[WebDataset](https://github.com/webdataset/webdataset)**.
23+
24+
25+
---
26+
27+
## 📦 Project Directory (`/cluster/project/rsl/$USER`)
28+
29+
- **Quota**: ≤ 75 GB
30+
- **Inodes**: ~300,000
31+
- **Use Case**: Conda environments, software packages
32+
33+
---
34+
35+
## 📂 Work Directory (`/cluster/work/rsl/$USER`)
36+
37+
- **Quota**: ≤ 150 GB
38+
- **Inodes**: ~30,000
39+
- **Use Case**: Saving results, large output files, tar files, singularity images. Avoid storing too many small files.
40+
41+
> In exceptional cases we can approve more storage space. For this, ask your supervisor to contact `patelm@ethz.ch`.
42+
43+
## 📂 Local Scratch Directory (`$TMPDIR`)
44+
45+
- **Quota**: upto 800 GB
46+
- **Inodes**: Very High
47+
- **Use Case**: Datasets and containers for a training run.
48+
49+
## ❗ Quota Violations:
50+
51+
- You shall receive an email if you violate any of the above limits.
52+
- You can type `lquota` in the terminal to check your used storage space for `Home` and `Scratch` directories.
53+
- For usage of `Project` and `Work` directories you can run:
54+
```bash
55+
(head -n 5 && grep -w $USER) < /cluster/work/rsl/.rsl_user_data_usage.txt
56+
(head -n 5 && grep -w $USER) < /cluster/project/rsl/.rsl_user_data_usage.txt
57+
```
58+
Note: This wont show the per-user quota limit which is enforced by RSL ! Refer to the table below for the quota limits.
59+
60+
### 🎯 FAQ: What is the difference between the `Project` and `Work` Directories and why is it necessary to make use of both?
61+
62+
Basically, both `Project` and `Work` are persistent storages (meaning the data is not deleted automatically); however, the use cases are different. When you have lots of small files, for example, conda environments, you should store them in the `Project` directory as it has a higher capacity for # of inodes. On the other hand, when you have larger files such as model checkpoints, singularity containers and results you should store them in the `Work` directory as the storage capacity is higher.
63+
64+
### 🎯 FAQ: What is Local Scratch Directory (`$TMPDIR`) ?
65+
66+
Whenever you run a compute job, you can also ask for a certain amount of local scratch space (`$TMPDIR`) which allocates space on a local hard drive. The main advantage of the local scratch is, that it is located directly inside the compute nodes and not attached via the network. Thus it is highly recommended to copy over your singularity container / datasets to `$TMPDIR` and then use that for the trainings. Detailed workflows for the trainings are provided later in this guide.
67+
68+
---
69+
70+
## 📊 Summary Table of Storage Locations
71+
72+
| Storage Location | Max Inodes | Max Size per User | Purged | Recommended Use Case |
73+
|----------------------------|------------|----------------|--------|----------------------------------------------------|
74+
| `/cluster/home/$USER` | ~450,000 | 45 GB | No | Code, config, small files |
75+
| `/cluster/scratch/$USER` | 1 M | 2.5 TB | Yes (older than 15 days) | Datasets, training data, temporary usage |
76+
| `/cluster/project/rsl/$USER` | 300,000 | 75 GB | No | Conda envs, software packages |
77+
| `/cluster/work/rsl/$USER` | 30,000 | 150 GB | No | Large result files, model checkpoints, Singularity containers, |
78+
| `$TMPDIR` | very high | Upto 800 GB | Yes (at end of job) | Training Datasets, Singularity Images |
79+
80+
---
81+
82+
## 💡 Best Practices
83+
84+
1. **Use the right storage for the right purpose** - Don't waste home directory space on large files
85+
2. **Compress datasets** - Use tar/zip to reduce inode usage
86+
3. **Clean up regularly** - Remove old data from scratch before it's auto-deleted
87+
4. **Monitor your usage** - Check quotas regularly with `lquota`
88+
5. **Use `$TMPDIR` for active jobs** - Copy data to local scratch for faster I/O during computation

docs/getting-started.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Getting Started with Euler
2+
3+
This guide helps new users access and begin working on the **Euler Cluster** at ETH Zurich, specifically for members of the **RSL group (es_hutter)**.
4+
5+
## 📌 Table of Contents
6+
7+
1. [Access Requirements](#access-requirements)
8+
2. [Connecting to Euler via SSH](#connecting-to-euler-via-ssh)
9+
- [Basic Login](#basic-login)
10+
- [Setting Up SSH Keys](#setting-up-ssh-keys-recommended)
11+
- [Using an SSH Config File](#using-an-ssh-config-file)
12+
3. [Verifying Access to the RSL Shareholder Group](#verifying-access-to-the-rsl-shareholder-group)
13+
14+
---
15+
16+
## ✅ Access Requirements
17+
18+
In order to get access to the cluster, kindly fill up the following [form](https://forms.gle/UsiGkXUmo9YyNHsH8). If you are a member of RSL, directly message Manthan Patel to add you to the cluster. The access is approved twice a week (Tuesdays and Fridays).
19+
20+
Before proceeding, make sure you have:
21+
22+
- A valid **nethz username and password** (ETH Zurich credentials)
23+
- Access to a **terminal** (Linux/macOS or Git Bash on Windows)
24+
- (Optional) Some familiarity with command-line tools
25+
26+
---
27+
28+
## 🔐 Connecting to Euler via SSH
29+
30+
You'll connect to Euler using the Secure Shell (SSH) protocol. This allows you to log into a remote machine securely from your local computer.
31+
32+
---
33+
34+
### Basic Login
35+
36+
To log into the Euler cluster, open a terminal and type:
37+
38+
```bash
39+
ssh <your_nethz_username>@euler.ethz.ch
40+
```
41+
42+
Replace `<your_nethz_username>` with your actual ETH Zurich login.
43+
44+
You will be asked to enter your ETH Zurich password. If the login is successful, you'll be connected to a login node on the Euler cluster.
45+
46+
---
47+
48+
### Setting Up SSH Keys (Recommended)
49+
50+
To avoid typing your password every time and to increase security, it is recommended to use SSH key-based authentication.
51+
52+
#### Step-by-Step Instructions:
53+
54+
1. **Generate an SSH key pair** on your local machine (if not already created):
55+
56+
```bash
57+
ssh-keygen -t ed25519 -C "<your_email>@ethz.ch"
58+
```
59+
60+
- Press Enter to accept the default file location (usually `~/.ssh/id_ed25519`).
61+
- When prompted for a passphrase, you can choose to set one or leave it empty.
62+
63+
2. **Copy your public key to Euler** using this command:
64+
65+
```bash
66+
ssh-copy-id <your_nethz_username>@euler.ethz.ch
67+
```
68+
69+
- You'll be asked to enter your ETH password one last time.
70+
- This command installs your public key in the `~/.ssh/authorized_keys` file on Euler.
71+
72+
Now, you should be able to log in without typing your password.
73+
74+
---
75+
76+
### Using an SSH Config File
77+
78+
To make your SSH workflow easier, especially if you frequently access Euler, create or edit the `~/.ssh/config` file on your local machine.
79+
80+
#### Example Configuration:
81+
82+
```sshconfig
83+
Host euler
84+
HostName euler.ethz.ch
85+
User <your_nethz_username>
86+
Compression yes
87+
ForwardX11 yes
88+
IdentityFile ~/.ssh/id_ed25519
89+
```
90+
91+
- Replace `<your_nethz_username>` with your actual ETH username.
92+
- Save and close the file.
93+
94+
Now, instead of typing the full SSH command, you can simply connect using:
95+
96+
```bash
97+
ssh euler
98+
```
99+
100+
---
101+
102+
## 🧾 Verifying Access to the RSL Shareholder Group
103+
104+
Once you are logged into the Euler cluster, it's important to confirm that you have been added to the appropriate shareholder group. This ensures you can access the computing resources allocated to your research group (in this case, the RSL group).
105+
106+
---
107+
108+
### 🔍 How to Check Your Group Membership
109+
110+
1. While connected to Euler (after logging in via SSH), run the following command in the terminal:
111+
112+
```bash
113+
my_share_info
114+
```
115+
116+
2. If everything is correctly set up, you should see output similar to the following:
117+
118+
```
119+
You are a member of the es_hutter shareholder group on Euler.
120+
```
121+
122+
3. This message confirms that you are part of the `es_hutter` group, which is the shareholder group for the RSL lab.
123+
124+
4. Create your user directories for storage by using the following command
125+
```bash
126+
mkdir /cluster/project/rsl/$USER
127+
mkdir /cluster/work/rsl/$USER
128+
```
129+
130+
---
131+
132+
### ❗ If You Do NOT See This Message:
133+
134+
- Double-check with your supervisor whether you've been added to the group.
135+
- It may take a few hours after being added for the change to propagate.
136+
137+
---
138+
139+
## Next Steps
140+
141+
Once you have verified your access:
142+
- Learn about [Data Management](data-management.md) on Euler
143+
- Set up [Python Environments](python-environments.md)
144+
- Start [Computing](computing-guide.md) with interactive sessions or batch jobs

docs/index.md

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,39 @@
11
# RSL Euler Cluster Guide
22

3-
## 📚 Documentation
4-
5-
### Complete Guide
6-
**[Open Complete Guide →](complete-guide.md)**
7-
8-
The complete guide contains:
9-
- Access Requirements
10-
- Connecting to Euler via SSH
11-
- Verifying RSL Group Membership
12-
- Data Management on Euler
13-
- Setting Up Miniconda Environments
14-
- Interactive Sessions
15-
- Sample Sbatch Scripts
16-
- Sample Training Workflow
17-
- Container Workflow
18-
- Useful Links
19-
20-
### Other Resources
21-
- **[Container Workflow](container-workflow.md)** - Docker/Singularity detailed guide
3+
## 🚀 Quick Access to All Sections
4+
5+
### 1. Getting Started
6+
**[Access Requirements, SSH Setup, Verification →](getting-started.md)**
7+
- Getting cluster access
8+
- Setting up SSH connection
9+
- Verifying RSL group membership
10+
11+
### 2. Data Management
12+
**[Storage Locations and Quotas →](data-management.md)**
13+
- Home, Scratch, Project, Work directories
14+
- Storage quotas and best practices
15+
- Using local scratch ($TMPDIR)
16+
17+
### 3. Python Environments & ML Training
18+
**[Miniconda Setup and Training Workflows →](python-environments.md)**
19+
- Installing and managing Miniconda
20+
- Creating conda environments
21+
- Complete ML training workflow
22+
23+
### 4. Computing on Euler
24+
**[Interactive Sessions and Batch Jobs →](computing-guide.md)**
25+
- Requesting interactive sessions
26+
- Writing and submitting SLURM job scripts
27+
- GPU selection and multi-GPU training
28+
29+
### 5. Container Workflow
30+
**[Docker/Singularity Guide →](container-workflow.md)**
31+
- Building Docker containers
32+
- Converting to Singularity
33+
- Running containerized jobs
34+
35+
### 📚 Additional Resources
36+
- **[Complete Reference Guide](complete-guide.md)** - All sections in one document
2237
- **[Scripts Library](scripts.md)** - Ready-to-use job scripts
2338
- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions
2439

0 commit comments

Comments
 (0)