diff --git a/README.md b/README.md index b589efa..cfabcbe 100644 --- a/README.md +++ b/README.md @@ -19,13 +19,25 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing ## Tasks Overview -| Task | Description -|---------|------------- -| **3.1** | CPU Parallel Operations (`fast_ops.py`) -| **3.2** | CPU Matrix Multiplication (`fast_ops.py`) -| **3.3** | GPU Operations (`cuda_ops.py`) -| **3.4** | GPU Matrix Multiplication (`cuda_ops.py`) -| **3.5** | Performance Evaluation (`run_fast_tensor.py`) +**Task 3.1**: CPU Parallel Operations +File to edit: `minitorch/fast_ops.py` +Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`. + +**Task 3.2**: CPU Matrix Multiplication +File to edit: `minitorch/fast_ops.py` +Implement optimized batched matrix multiplication with parallel outer loops. + +**Task 3.3**: GPU Operations +File to edit: `minitorch/cuda_ops.py` +Implement CUDA kernels for tensor map, zip, and reduce operations. + +**Task 3.4**: GPU Matrix Multiplication +File to edit: `minitorch/cuda_ops.py` +Implement CUDA matrix multiplication with shared memory optimization for maximum performance. + +**Task 3.5**: Training +File to edit: `project/run_fast_tensor.py` +Implement missing functions and train models on all datasets to demonstrate performance improvements. ## Documentation @@ -156,6 +168,40 @@ nvidia-smi -l 1 # Update every second - Implement efficient GPU matrix multiplication with shared memory - Optimize thread block organization and memory coalescing +## Task 3.5 Training Results + +### Performance Targets +- **CPU Backend**: Below 2 seconds per epoch +- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU) + +### Training Commands +```bash +# CPU Backend +python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05 +python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05 +python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET xor --RATE 0.05 + +# GPU Backend +python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05 +python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05 +python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05 +``` + +### Student Results +**TODO: Add your training results here** + +#### Simple Dataset +- CPU Backend: [Add time per epoch and accuracy] +- GPU Backend: [Add time per epoch and accuracy] + +#### Split Dataset +- CPU Backend: [Add time per epoch and accuracy] +- GPU Backend: [Add time per epoch and accuracy] + +#### XOR Dataset +- CPU Backend: [Add time per epoch and accuracy] +- GPU Backend: [Add time per epoch and accuracy] + ## Important Notes - **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements diff --git a/testing.md b/testing.md index 5f9ecff..08bd3d2 100644 --- a/testing.md +++ b/testing.md @@ -79,6 +79,28 @@ python project/run_tensor.py # Basic tensor implementation python project/run_scalar.py # Scalar implementation ``` +### Parallel Diagnostics (Tasks 3.1 & 3.2) + +**Running Parallel Check:** +```bash +# Verify your parallel implementations +python project/parallel_check.py +``` + +**Expected Output for Task 3.1:** +- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls +- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations +- **REDUCE**: Should show main parallel loop with proper allocation hoisting + +**Expected Output for Task 3.2:** +- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used) + +**Key Success Indicators:** +- Parallel loops detected with `prange()` +- Memory allocations hoisted out of parallel regions +- Loop optimizations applied by Numba +- No unexpected function calls in critical paths + ### Pre-commit Hooks (Automatic Style Checking) The project uses pre-commit hooks that run automatically before each commit: