Rkeramati · Rkeramati · Oct 6, 2025 · Oct 6, 2025
diff --git a/README.md b/README.md
@@ -19,13 +19,25 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing
 
 ## Tasks Overview
 
-| Task    | Description 
-|---------|-------------
-| **3.1** | CPU Parallel Operations (`fast_ops.py`) 
-| **3.2** | CPU Matrix Multiplication (`fast_ops.py`) 
-| **3.3** | GPU Operations (`cuda_ops.py`)
-| **3.4** | GPU Matrix Multiplication (`cuda_ops.py`)
-| **3.5** | Performance Evaluation (`run_fast_tensor.py`)
+**Task 3.1**: CPU Parallel Operations
+File to edit: `minitorch/fast_ops.py`
+Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`.
+
+**Task 3.2**: CPU Matrix Multiplication
+File to edit: `minitorch/fast_ops.py`
+Implement optimized batched matrix multiplication with parallel outer loops.
+
+**Task 3.3**: GPU Operations
+File to edit: `minitorch/cuda_ops.py`
+Implement CUDA kernels for tensor map, zip, and reduce operations.
+
+**Task 3.4**: GPU Matrix Multiplication
+File to edit: `minitorch/cuda_ops.py`
+Implement CUDA matrix multiplication with shared memory optimization for maximum performance.
+
+**Task 3.5**: Training
+File to edit: `project/run_fast_tensor.py`
+Implement missing functions and train models on all datasets to demonstrate performance improvements.
 
 ## Documentation
 
@@ -156,6 +168,40 @@ nvidia-smi -l 1  # Update every second
 - Implement efficient GPU matrix multiplication with shared memory
 - Optimize thread block organization and memory coalescing
 
+## Task 3.5 Training Results
+
+### Performance Targets
+- **CPU Backend**: Below 2 seconds per epoch
+- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU)
+
+### Training Commands
+```bash
+# CPU Backend
+python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
+python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05
+python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET xor --RATE 0.05
+
+# GPU Backend  
+python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
+python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
+python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
+```
+
+### Student Results
+**TODO: Add your training results here**
+
+#### Simple Dataset
+- CPU Backend: [Add time per epoch and accuracy]
+- GPU Backend: [Add time per epoch and accuracy]
+
+#### Split Dataset  
+- CPU Backend: [Add time per epoch and accuracy]
+- GPU Backend: [Add time per epoch and accuracy]
+
+#### XOR Dataset
+- CPU Backend: [Add time per epoch and accuracy] 
+- GPU Backend: [Add time per epoch and accuracy]
+
 ## Important Notes
 
 - **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements

diff --git a/testing.md b/testing.md
@@ -79,6 +79,28 @@ python project/run_tensor.py     # Basic tensor implementation
 python project/run_scalar.py     # Scalar implementation
 ```
 
+### Parallel Diagnostics (Tasks 3.1 & 3.2)
+
+**Running Parallel Check:**
+```bash
+# Verify your parallel implementations
+python project/parallel_check.py
+```
+
+**Expected Output for Task 3.1:**
+- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls
+- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations
+- **REDUCE**: Should show main parallel loop with proper allocation hoisting
+
+**Expected Output for Task 3.2:**
+- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used)
+
+**Key Success Indicators:**
+- Parallel loops detected with `prange()`
+- Memory allocations hoisted out of parallel regions
+- Loop optimizations applied by Numba
+- No unexpected function calls in critical paths
+
 ### Pre-commit Hooks (Automatic Style Checking)
 
 The project uses pre-commit hooks that run automatically before each commit: