|
| 1 | +## HPC exercise training a neural network on the MNIST data-set. |
| 2 | +- The exercise explores training a neural network using [the torch c++ api](https://pytorch.org/cppdocs/). |
| 3 | + |
| 4 | + |
| 5 | + |
| 6 | +You will learn how to train a network to recognize handwritten digits. To do so we will use the mnist data-set. |
| 7 | +The image above shows example images. The exercise assumes you are working on the systems at the Juelich Supercomputing Centre. |
| 8 | +To solve this exercise look through the files in the `source` folder. `TODO`s mark parts of the code that require your attention. |
| 9 | +Come back to this readme for additional hints. |
| 10 | + |
| 11 | +- To get started on the JUWELS Booster load the modules |
| 12 | +``` bash |
| 13 | +Stages/2023 GCC/11.3.0 OpenMPI/4.1.4 CUDA/11.7 CMake PyTorch |
| 14 | +``` |
| 15 | + |
| 16 | +- Use `mkdir build` to create your build directory. Change directory into your build folder and compile by running: |
| 17 | +```bash |
| 18 | +cmake -DCUDA_CUDA_LIB=/usr/lib64/libcuda.so -DCMAKE_PREFIX_PATH=`python -c 'import torch;print(torch.utils.cmake_prefix_path)'` .. |
| 19 | +cmake --build . --config Release |
| 20 | +``` |
| 21 | + |
| 22 | +- Navigate to `source/net.h` implement the constructor for the `Net` struct. |
| 23 | +The `Net` should implement a fully connected network |
| 24 | + |
| 25 | +$$ |
| 26 | + y = \ln(\sigma (W_3f_r(W_2 f_r(W_1 x + b_1) + b_2) + b_3)) |
| 27 | +$$ |
| 28 | + |
| 29 | +with $W_1 \in \mathbb{R}^{h_1, n}, W_2 \in \mathbb{R}^{h_2, h_1}, W_3 \in \mathbb{R}^{m, h_2}$ |
| 30 | +and $b_1 \in \mathbb{R}^{h_1}, b_2 \in \mathbb{R}^{h_2}, b_3 \in \mathbb{R}^{m}$, where |
| 31 | +$n$ denotes the input dimension $h_1$ the number of hidden neurons in the first layer $h_2$ the number of neurons in the second layer, and $m$ the number of output neurons. |
| 32 | +Finally $\sigma$ denotes the [softmax function](https://en.wikipedia.org/wiki/Softmax_function) and $\ln$ the natural logarithm. |
| 33 | +Use `register_module` to add `Linear` layers to the network. Linear layers that implement $Wx +b$ are provided by `torch::nn:Linear`. |
| 34 | +Move on to implement the forward pass. Follow the equation above, use `torch::relu` and |
| 35 | +`torch::log_softmax`. What happens if you choose `torch::sigmoid` instead of the ReLU? |
| 36 | + |
| 37 | +- Before training your network network implement the `acc` function in `source/train_net.cpp`. It should find the ratio of |
| 38 | +correctly identified digits, by comparing the `argmax` of the network output and the annotations. |
| 39 | + |
| 40 | +- Torch devices are defined i.e. by `torch::Device device = torch::kCPU;` move to GPUs by choosing `torch::kCUDA;` if cuda-GPUs are available. |
| 41 | + |
| 42 | +- Finally iterate over the test data set and compute the test accuracy. |
| 43 | + |
| 44 | +- Train and test your network by executing: |
| 45 | +```bash |
| 46 | +./train_net |
| 47 | +``` |
| 48 | + |
| 49 | +- When your network has converged, you should measure more than 90% accuracy. |
| 50 | + |
0 commit comments