Skip to content

camlsys/L46_lab1_profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

L46 Lab 1: Profiling and Optimizing PyTorch Training

Important: Students must not use the AI assistant from Lightning AI.

Overview

This lab focuses on profiling and optimizing a PyTorch training script for a small language model (SmolLM-360M). You will learn to:

  • Use TensorBoard to profile PyTorch training loops
  • Identify performance bottlenecks in deep learning code
  • Optimize training code to improve throughput and efficiency
  • Document your findings with metrics and visualizations

Listen

Simply listen to the instructions! No code, no web browsing!

Setup

1. Lightning AI Account and Session

  1. Create a Lightning AI account and log into your account.
  2. Create a new Studio using an L4 (24GB GPU) instance.
    ⚠️ Make sure you are not running a more expensive instance.
  3. Connect to your Studio and open a terminal (top-left → Terminal).
  4. In the terminal, clone the repository:
    git clone [https://github.com/camlsys/L46_lab1_profiler.git](https://github.com/camlsys/L46_lab1_profiler.git)
  5. Move into the repository:
    cd L46_lab1_profiler

2. Running the Baseline Script

The training script (train.py) fine-tunes SmolLM-360M on a subset of the Smol-Smoltalk dataset.

# Install dependencies
pip install -r requirements.txt

# Run the baseline training script
python train.py

Many arguments are available to change the duration of the profiling, or training. Please refer to train.py for more informations.

3. Tensorboard Documentation

The teacher will give you a very short introduction to the Tensorboard profiler live. You can find everything you need here.

4. Viewing TensorBoard Profiling

After the script finishes, a tb_profiler/ directory containing profiling traces will be created.

  1. In your Studio, open the Port Viewer from the right-side panel.
  2. Click + New Port (top right).
  3. Set:
    • Name: tensorflow
    • Port: 6006
  4. Click Display, then return to your Studio (VS Code icon).
  5. Start TensorBoard in the terminal:
    tensorboard --logdir tb_profiler
  6. Go back to the Port Viewer and open the port you created in a new browser tab.

some notes:

  • When switching between TensorBoard views (e.g., Overview, Trace, GPU), it may take a few seconds for the page to fully load — this is normal.
  • If TensorBoard does not load after ~30 seconds, stop the command and restart it.

Part 1: Profiling and Identifying Issues (45 minutes)

Objectives

  1. Collect a trace and verify TensorBoard is working correctly
  2. Define relevant metrics. You will need to document why and how they are relevant when reporting your results later on.
  3. Identify performance bottlenecks in the trace. Look for:
    • Host-device synchronization (CPU-GPU stalls)
    • Precision and dtype issues (unnecessary float32 operations) - are tensorcores properly used?
    • Memory copies (unnecessary data transfers)
    • Unfused operations (vectorization opportunities) - This is the biggest slowdown
    • I/O stalls (data loading bottlenecks)
    • (optional) Flash attention - is it used?

Deliverables

Document each problem you find with:

  • Screenshot of the relevant trace section
  • Explanation of what the issue is
  • Impact on metrics

Part 2: Optimization (45 minutes)

Optimization Tasks

Simply fix the issues that you have found. We do not expect you to find and fix all of them. Simply select important ones and try to fix the underlying problem.

Usually, fusing / precision and copy / synchronisation problems are the most impactful issues.

Deliverables

For each optimization:

  • Before/after metrics (throughput, timing, VRAM, etc.)
  • Screenshot of the trace showing the improvement
  • Explanation of what changed and why it improved performance

Part 3: Report Writing (15 minutes)

Report Structure

  1. Methodology:
    • Metrics you chose and why they're relevant
  2. Findings:
    • List of identified bottlenecks with screenshots
    • Impact quantification for each issue
  3. Optimizations:
    • Description of each optimization
    • Before/after comparison with metrics
    • Screenshots showing improvements
  4. Conclusion: Summary of improvements.

Side Quest

You can structure your optimization section such that it answers specific questions like:

  • Why is vectorization so important for performance?
  • What are the trade-offs of mixed precision training?
  • How does batch size affect Tensor Core utilization?
  • When is it beneficial to pre-load data to VRAM?

Command-Line Arguments

The training script supports the following arguments:

python train.py \
    --num_epochs 1 \
    --max_train_steps 200 \
    --nb_profile_steps 3 \
    --warmup_steps 50 \
    --batch_size 8 \
    --max_examples 10000
  • --num_epochs: Number of training epochs
  • --max_train_steps: Maximum number of training steps
  • --nb_profile_steps: Number of steps to actively profile
  • --warmup_steps: Number of warmup steps before profiling
  • --batch_size: Training batch size
  • --max_examples: Maximum number of training examples to load

Tips for Using TensorBoard

  • Trace View: Use the timeline to see when operations occur
  • Operator View: See which operations take the most time
  • GPU Kernel View: Check Tensor Core usage and kernel efficiency
  • Memory View: Analyze VRAM usage patterns
  • Zoom and Pan: Use these to focus on specific time ranges

Files in This Repository

  • train.py: Main training script with profiling integration
  • modeling.py: Local implementation of Llama model (contains intentional slowdowns)
  • utils.py: Dataset loading and utility functions
  • requirements.txt: Python dependencies

Getting Help

  • Ask your teacher for hints (not solutions!) if you're stuck! Do not get stuck, the lab is very short!
  • Focus on understanding why optimizations work, not just implementing them
  • Use TensorBoard's documentation if needed

Good luck with the lab!

About

Training and profiling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages