diff --git a/README.md b/README.md index cfabcbeb..d4789238 100644 --- a/README.md +++ b/README.md @@ -14,8 +14,7 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing - **CPU Parallelization**: Implement parallel tensor operations with Numba - **GPU Programming**: Write CUDA kernels for tensor operations - **Performance Optimization**: Achieve significant speedup through hardware acceleration -- **Matrix Multiplication**: Optimize the most computationally intensive operations -- **Backend Architecture**: Build multiple computational backends for flexible performance +- **Matrix Multiplication**: Optimize the most computationally intensive operations with operator fusion ## Tasks Overview @@ -27,15 +26,15 @@ Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`. File to edit: `minitorch/fast_ops.py` Implement optimized batched matrix multiplication with parallel outer loops. -**Task 3.3**: GPU Operations +**Task 3.3**: GPU Operations (requires GPU) File to edit: `minitorch/cuda_ops.py` Implement CUDA kernels for tensor map, zip, and reduce operations. -**Task 3.4**: GPU Matrix Multiplication +**Task 3.4**: GPU Matrix Multiplication (requires GPU) File to edit: `minitorch/cuda_ops.py` Implement CUDA matrix multiplication with shared memory optimization for maximum performance. -**Task 3.5**: Training +**Task 3.5**: Training (requires GPU) File to edit: `project/run_fast_tensor.py` Implement missing functions and train models on all datasets to demonstrate performance improvements. @@ -44,95 +43,12 @@ Implement missing functions and train models on all datasets to demonstrate perf - **[Installation Guide](installation.md)** - Setup instructions including GPU configuration - **[Testing Guide](testing.md)** - How to run tests locally and handle GPU requirements -## Quick Start - -### 1. Environment Setup -```bash -# Clone and navigate to your assignment -git clone -cd - -# Create virtual environment (recommended) -conda create --name minitorch python -conda activate minitorch - -# Install dependencies -pip install -e ".[dev,extra]" -``` - -### 2. Sync Previous Module Files -```bash -# Sync required files from your Module 2 solution -python sync_previous_module.py . - -# Example: -python sync_previous_module.py ../Module-2 . -``` - -### 3. Run Tests -```bash -# CPU tasks (run anywhere) -pytest -m task3_1 # CPU parallel operations -pytest -m task3_2 # CPU matrix multiplication - -# GPU tasks (require CUDA-compatible GPU) -pytest -m task3_3 # GPU operations -pytest -m task3_4 # GPU matrix multiplication - -# Style checks -pre-commit run --all-files -``` - ## GPU Setup -### Option 1: Google Colab (Recommended) -Most students should use Google Colab for GPU tasks: - -1. Upload assignment files to Colab -2. Change runtime to GPU (Runtime → Change runtime type → GPU) -3. Install packages: - ```python - !pip install -e ".[dev,extra]" - !python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" - ``` - -### Option 2: Local GPU (If you have NVIDIA GPU) -For students with NVIDIA GPUs and CUDA-compatible hardware: - -```bash -# Install CUDA toolkit -# Visit: https://developer.nvidia.com/cuda-downloads - -# Install GPU packages -pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -pip install numba[cuda] - -# Verify GPU support -python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" -``` - -## Testing Strategy - -### CI/CD (GitHub Actions) -- **Task 3.1**: CPU parallel operations -- **Task 3.2**: CPU matrix multiplication -- **Style Check**: Code quality and formatting - -### GPU Testing (Colab/Local GPU) -- **Task 3.3**: GPU operations (use Colab or local NVIDIA GPU) -- **Task 3.4**: GPU matrix multiplication (use Colab or local NVIDIA GPU) - -### Performance Validation -```bash -# Compare backend performance -python project/run_fast_tensor.py # Optimized backends -python project/run_tensor.py # Basic tensor backend -python project/run_scalar.py # Scalar baseline -``` +Follow this [link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing). Go to the Colab file → save to drive, select runtime to T4 and follow instructions. ## Development Tools - -### Code Quality +## Code Quality ```bash # Automatic style checking pre-commit install @@ -156,25 +72,9 @@ NUMBA_CUDA_DEBUG=1 pytest -m task3_3 -v nvidia-smi -l 1 # Update every second ``` -## Implementation Focus - -### Task 3.1 & 3.2 (CPU Optimization) -- Implement `tensor_map`, `tensor_zip`, `tensor_reduce` with Numba parallel loops -- Optimize matrix multiplication with efficient loop ordering -- Focus on cache locality and parallel execution patterns - -### Task 3.3 & 3.4 (GPU Acceleration) -- Write CUDA kernels for element-wise operations -- Implement efficient GPU matrix multiplication with shared memory -- Optimize thread block organization and memory coalescing - -## Task 3.5 Training Results - -### Performance Targets -- **CPU Backend**: Below 2 seconds per epoch -- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU) - ### Training Commands + +#### Local Environment ```bash # CPU Backend python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05 @@ -187,6 +87,14 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --R python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05 ``` +#### Google Colab (Recommended) +```bash +# GPU Backend examples +!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05 +!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05 +!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05 +``` + ### Student Results **TODO: Add your training results here** @@ -201,10 +109,3 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RAT #### XOR Dataset - CPU Backend: [Add time per epoch and accuracy] - GPU Backend: [Add time per epoch and accuracy] - -## Important Notes - -- **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements -- **GPU Testing**: Use Google Colab (recommended) or local NVIDIA GPU for GPU tasks -- **Performance Critical**: Implementations must show measurable speedup over sequential versions -- **Memory Management**: Be careful with GPU memory allocation and deallocation diff --git a/installation.md b/installation.md index d8924157..4baf55de 100644 --- a/installation.md +++ b/installation.md @@ -83,60 +83,6 @@ Install all packages in your virtual environment: ## GPU Setup (Required for Tasks 3.3 and 3.4) -Tasks 3.3 and 3.4 require GPU support and won't run on GitHub CI. +Tasks 3.3 and 3.4 require GPU support. Use Google Colab for GPU access (Sign up for student version). -### Option 1: Google Colab (Recommended) - -Most students should use Google Colab as it provides free GPU access: - -1. Upload your assignment files to Colab -2. Change runtime to GPU (Runtime → Change runtime type → GPU) -3. Install packages in Colab: - ```python - !pip install -e ".[dev,extra]" - !python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" - ``` - -### Option 2: Local GPU Setup (If you have NVIDIA GPU) - -For students with NVIDIA GPUs and CUDA-compatible hardware: - -1. **Install CUDA Toolkit** - ```bash - # Visit: https://developer.nvidia.com/cuda-downloads - # Follow instructions for your OS - ``` - -2. **Verify CUDA Installation** - ```bash - >>> nvcc --version - >>> nvidia-smi - ``` - -3. **Install GPU-compatible packages** - ```bash - >>> pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 - >>> pip install numba[cuda] - ``` - -## Verification - -Make sure everything is installed by running: - -```bash ->>> python -c "import minitorch; print('Success!')" -``` - -Verify that the tensor functionality is available: - -```bash ->>> python -c "from minitorch import tensor; print('Module 3 ready!')" -``` - -Check if CUDA support is available (for GPU tasks): - -```bash ->>> python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" -``` - -You're ready to start Module 3! \ No newline at end of file +Follow this [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing), save the file to your drive, select T4 GPU runtime, and follow the instructions in the notebook. \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index 7be5e21d..6e3e566d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -10,8 +10,8 @@ requires-python = ">=3.8" dependencies = [ "colorama==0.4.6", "hypothesis==6.138.2", - "numba==0.61.2", - "numpy>=1.24,<2.3", + "numba-cuda[cu12]>=0.4.0", ## cu12 is for CUDA 12.0 cu13 is for CUDA 13.0 + "numpy<2.0", "pytest==8.4.1", "pytest-env==1.1.5", "typing_extensions", diff --git a/testing.md b/testing.md index 08bd3d26..37466863 100644 --- a/testing.md +++ b/testing.md @@ -5,11 +5,9 @@ This project uses pytest for testing. Tests are organized by task: ```bash -# Run all tests for a specific task +# CPU Tasks (3.1 & 3.2) - Run locally pytest -m task3_1 # CPU parallel operations pytest -m task3_2 # CPU matrix multiplication -pytest -m task3_3 # GPU operations (requires CUDA) -pytest -m task3_4 # GPU matrix multiplication (requires CUDA) # Run all tests pytest @@ -31,26 +29,12 @@ pytest tests/test_tensor_general.py::test_matrix_multiply - GitHub Actions CI only runs tasks 3.1 and 3.2 (CPU only) - Tasks 3.3 and 3.4 require local GPU or Google Colab -**Option 1: Google Colab Testing (Recommended):** -```python -# In Colab notebook -!pip install -e ".[dev,extra]" -!python -m pytest -m task3_3 -v -!python -m pytest -m task3_4 -v -!python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" -``` +**GPU Tasks (3.3 & 3.4) - Google Colab (Recommended):** -**Option 2: Local GPU Testing (If you have NVIDIA GPU):** +Follow instructions on the [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing) and run tests like this: ```bash -# Verify CUDA is available -python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())" - -# Test GPU tasks locally -pytest -m task3_3 # GPU operations -pytest -m task3_4 # GPU matrix multiplication - -# Debug GPU issues -NUMBA_DISABLE_JIT=1 pytest -m task3_3 -v # Disable JIT for debugging +!cd $DIR; python3.11 -m pytest -m task3_3 -v +!cd $DIR; python3.11 -m pytest -m task3_4 -v ``` ### Style and Code Quality Checks @@ -67,18 +51,6 @@ ruff format . # Code formatting pyright . # Type checking ``` -### Task 3.5 - Performance Evaluation - -**Training Scripts:** -```bash -# Run optimized training (CPU parallel) -python project/run_fast_tensor.py - -# Compare with previous implementations -python project/run_tensor.py # Basic tensor implementation -python project/run_scalar.py # Scalar implementation -``` - ### Parallel Diagnostics (Tasks 3.1 & 3.2) **Running Parallel Check:** @@ -87,20 +59,6 @@ python project/run_scalar.py # Scalar implementation python project/parallel_check.py ``` -**Expected Output for Task 3.1:** -- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls -- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations -- **REDUCE**: Should show main parallel loop with proper allocation hoisting - -**Expected Output for Task 3.2:** -- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used) - -**Key Success Indicators:** -- Parallel loops detected with `prange()` -- Memory allocations hoisted out of parallel regions -- Loop optimizations applied by Numba -- No unexpected function calls in critical paths - ### Pre-commit Hooks (Automatic Style Checking) The project uses pre-commit hooks that run automatically before each commit: @@ -111,41 +69,4 @@ pre-commit install # Now style checks run automatically on every commit git commit -m "your message" # Will run style checks first -``` - -### Debugging Tools - -**Numba Debugging:** -```bash -# Disable JIT compilation for debugging -NUMBA_DISABLE_JIT=1 pytest -m task3_1 -v - -# Enable Numba debugging output -NUMBA_DEBUG=1 python project/run_fast_tensor.py -``` - -**CUDA Debugging:** -```bash -# Check CUDA device properties -python -c "import numba.cuda; print(numba.cuda.gpus)" - -# Monitor GPU memory usage -nvidia-smi -l 1 # Update every second - -# Debug CUDA kernel launches -NUMBA_CUDA_DEBUG=1 python -m pytest -m task3_3 -v -``` - -**Performance Profiling:** -```bash -# Time specific operations -python -c " -import time -import minitorch -backend = minitorch.TensorBackend(minitorch.FastOps) -# Time your operations here -" - -# Profile memory usage -python -m memory_profiler project/run_fast_tensor.py ``` \ No newline at end of file diff --git a/tests/test_tensor_general.py b/tests/test_tensor_general.py index 6594ab9f..4b1d9cbe 100644 --- a/tests/test_tensor_general.py +++ b/tests/test_tensor_general.py @@ -15,7 +15,8 @@ one_arg, two_arg, red_arg = MathTestVariable._comp_testing() - +from numba import config +config.CUDA_ENABLE_PYNVJITLINK = 1 # The tests in this file only run the main mathematical functions. # The difference is that they run with different tensor ops backends.