Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 53 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,25 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing

## Tasks Overview

| Task | Description
|---------|-------------
| **3.1** | CPU Parallel Operations (`fast_ops.py`)
| **3.2** | CPU Matrix Multiplication (`fast_ops.py`)
| **3.3** | GPU Operations (`cuda_ops.py`)
| **3.4** | GPU Matrix Multiplication (`cuda_ops.py`)
| **3.5** | Performance Evaluation (`run_fast_tensor.py`)
**Task 3.1**: CPU Parallel Operations
File to edit: `minitorch/fast_ops.py`
Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`.

**Task 3.2**: CPU Matrix Multiplication
File to edit: `minitorch/fast_ops.py`
Implement optimized batched matrix multiplication with parallel outer loops.

**Task 3.3**: GPU Operations
File to edit: `minitorch/cuda_ops.py`
Implement CUDA kernels for tensor map, zip, and reduce operations.

**Task 3.4**: GPU Matrix Multiplication
File to edit: `minitorch/cuda_ops.py`
Implement CUDA matrix multiplication with shared memory optimization for maximum performance.

**Task 3.5**: Training
File to edit: `project/run_fast_tensor.py`
Implement missing functions and train models on all datasets to demonstrate performance improvements.

## Documentation

Expand Down Expand Up @@ -156,6 +168,40 @@ nvidia-smi -l 1 # Update every second
- Implement efficient GPU matrix multiplication with shared memory
- Optimize thread block organization and memory coalescing

## Task 3.5 Training Results

### Performance Targets
- **CPU Backend**: Below 2 seconds per epoch
- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU)

### Training Commands
```bash
# CPU Backend
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET xor --RATE 0.05

# GPU Backend
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
```

### Student Results
**TODO: Add your training results here**

#### Simple Dataset
- CPU Backend: [Add time per epoch and accuracy]
- GPU Backend: [Add time per epoch and accuracy]

#### Split Dataset
- CPU Backend: [Add time per epoch and accuracy]
- GPU Backend: [Add time per epoch and accuracy]

#### XOR Dataset
- CPU Backend: [Add time per epoch and accuracy]
- GPU Backend: [Add time per epoch and accuracy]

## Important Notes

- **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements
Expand Down
22 changes: 22 additions & 0 deletions testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,28 @@ python project/run_tensor.py # Basic tensor implementation
python project/run_scalar.py # Scalar implementation
```

### Parallel Diagnostics (Tasks 3.1 & 3.2)

**Running Parallel Check:**
```bash
# Verify your parallel implementations
python project/parallel_check.py
```

**Expected Output for Task 3.1:**
- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls
- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations
- **REDUCE**: Should show main parallel loop with proper allocation hoisting

**Expected Output for Task 3.2:**
- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used)

**Key Success Indicators:**
- Parallel loops detected with `prange()`
- Memory allocations hoisted out of parallel regions
- Loop optimizations applied by Numba
- No unexpected function calls in critical paths

### Pre-commit Hooks (Automatic Style Checking)

The project uses pre-commit hooks that run automatically before each commit:
Expand Down
Loading