L46 Lab 1: Profiling and Optimizing PyTorch Training

Important: Students must not use the AI assistant from Lightning AI.

Overview

This lab focuses on profiling and optimizing a PyTorch training script for a small language model (SmolLM-360M). You will learn to:

Use TensorBoard to profile PyTorch training loops
Identify performance bottlenecks in deep learning code
Optimize training code to improve throughput and efficiency
Document your findings with metrics and visualizations

Listen

Simply listen to the instructions! No code, no web browsing!

Setup

1. Lightning AI Account and Session

Create a Lightning AI account and log into your account.
Create a new Studio using an L4 (24GB GPU) instance.
⚠️ Make sure you are not running a more expensive instance.
Connect to your Studio and open a terminal (top-left → Terminal).

In the terminal, clone the repository:

git clone [https://github.com/camlsys/L46_lab1_profiler.git](https://github.com/camlsys/L46_lab1_profiler.git)

Move into the repository:
```
cd L46_lab1_profiler
```

2. Running the Baseline Script

The training script (train.py) fine-tunes SmolLM-360M on a subset of the Smol-Smoltalk dataset.

# Install dependencies
pip install -r requirements.txt

# Run the baseline training script
python train.py

Many arguments are available to change the duration of the profiling, or training. Please refer to train.py for more informations.

3. Tensorboard Documentation

The teacher will give you a very short introduction to the Tensorboard profiler live. You can find everything you need here.

4. Viewing TensorBoard Profiling

After the script finishes, a tb_profiler/ directory containing profiling traces will be created.

In your Studio, open the Port Viewer from the right-side panel.
Click + New Port (top right).
Set:
- Name: tensorflow
- Port: 6006
Click Display, then return to your Studio (VS Code icon).
Start TensorBoard in the terminal:
```
tensorboard --logdir tb_profiler
```
Go back to the Port Viewer and open the port you created in a new browser tab.

some notes:

When switching between TensorBoard views (e.g., Overview, Trace, GPU), it may take a few seconds for the page to fully load — this is normal.
If TensorBoard does not load after ~30 seconds, stop the command and restart it.

Part 1: Profiling and Identifying Issues (45 minutes)

Objectives

Collect a trace and verify TensorBoard is working correctly
Define relevant metrics. You will need to document why and how they are relevant when reporting your results later on.
Identify performance bottlenecks in the trace. Look for:
- Host-device synchronization (CPU-GPU stalls)
- Precision and dtype issues (unnecessary float32 operations) - are tensorcores properly used?
- Memory copies (unnecessary data transfers)
- Unfused operations (vectorization opportunities) - This is the biggest slowdown
- I/O stalls (data loading bottlenecks)
- (optional) Flash attention - is it used?

Deliverables

Document each problem you find with:

Screenshot of the relevant trace section
Explanation of what the issue is
Impact on metrics

Part 2: Optimization (45 minutes)

Optimization Tasks

Simply fix the issues that you have found. We do not expect you to find and fix all of them. Simply select important ones and try to fix the underlying problem.

Usually, fusing / precision and copy / synchronisation problems are the most impactful issues.

Deliverables

For each optimization:

Before/after metrics (throughput, timing, VRAM, etc.)
Screenshot of the trace showing the improvement
Explanation of what changed and why it improved performance

Part 3: Report Writing (15 minutes)

Report Structure

Methodology:
- Metrics you chose and why they're relevant
Findings:
- List of identified bottlenecks with screenshots
- Impact quantification for each issue
Optimizations:
- Description of each optimization
- Before/after comparison with metrics
- Screenshots showing improvements
Conclusion: Summary of improvements.

Side Quest

You can structure your optimization section such that it answers specific questions like:

Why is vectorization so important for performance?
What are the trade-offs of mixed precision training?
How does batch size affect Tensor Core utilization?
When is it beneficial to pre-load data to VRAM?

Command-Line Arguments

The training script supports the following arguments:

python train.py \
    --num_epochs 1 \
    --max_train_steps 200 \
    --nb_profile_steps 3 \
    --warmup_steps 50 \
    --batch_size 8 \
    --max_examples 10000

--num_epochs: Number of training epochs
--max_train_steps: Maximum number of training steps
--nb_profile_steps: Number of steps to actively profile
--warmup_steps: Number of warmup steps before profiling
--batch_size: Training batch size
--max_examples: Maximum number of training examples to load

Tips for Using TensorBoard

Trace View: Use the timeline to see when operations occur
Operator View: See which operations take the most time
GPU Kernel View: Check Tensor Core usage and kernel efficiency
Memory View: Analyze VRAM usage patterns
Zoom and Pan: Use these to focus on specific time ranges

Files in This Repository

train.py: Main training script with profiling integration
modeling.py: Local implementation of Llama model (contains intentional slowdowns)
utils.py: Dataset loading and utility functions
requirements.txt: Python dependencies

Getting Help

Ask your teacher for hints (not solutions!) if you're stuck! Do not get stuck, the lab is very short!
Focus on understanding why optimizations work, not just implementing them
Use TensorBoard's documentation if needed

Good luck with the lab!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L46 Lab 1: Profiling and Optimizing PyTorch Training

Overview

Listen

Setup

1. Lightning AI Account and Session

2. Running the Baseline Script

3. Tensorboard Documentation

4. Viewing TensorBoard Profiling

Part 1: Profiling and Identifying Issues (45 minutes)

Objectives

Deliverables

Part 2: Optimization (45 minutes)

Optimization Tasks

Deliverables

Part 3: Report Writing (15 minutes)

Report Structure

Side Quest

Command-Line Arguments

Tips for Using TensorBoard

Files in This Repository

Getting Help

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
modeling.py		modeling.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

L46 Lab 1: Profiling and Optimizing PyTorch Training

Overview

Listen

Setup

1. Lightning AI Account and Session

2. Running the Baseline Script

3. Tensorboard Documentation

4. Viewing TensorBoard Profiling

Part 1: Profiling and Identifying Issues (45 minutes)

Objectives

Deliverables

Part 2: Optimization (45 minutes)

Optimization Tasks

Deliverables

Part 3: Report Writing (15 minutes)

Report Structure

Side Quest

Command-Line Arguments

Tips for Using TensorBoard

Files in This Repository

Getting Help

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages