tutorial-parallelization/parallel-julia.qmd at main · berkeley-scf/tutorial-parallelization · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
---
title: "Parallel processing in Julia"
format:
  html:
    theme: cosmo
    css: assets/styles.css
    toc: true
    code-copy: true
    code-block-bg: true
    code-block-border-left: "#31BAE9"
ipynb-shell-interactivity: all
code-overflow: wrap
jupyter: julia-1.10
execute:
  freeze: auto
---


## 1 Overview

Julia provides built-in support for various kinds of parallel processing on one or more machines. This material focuses on some standard approaches that are (mostly) analogous to functionality in Python and R. However there is other functionality available, including the ability to control tasks and sending data between processes in a fine-grained way.

In addition to parallelization, the second to last section discusses some issues related to efficiency with for loops, in particular *fused* operations. This is not directly related to parallelization but given the focus on loops in this document, it's useful and interesting to know about.

Finally, the last section discussed offloading computation to the GPU, i.e., massive parallelization on many GPU cores.

## 2 Threading

Threaded calculations are done in parallel on [software threads](./#31-shared-memory).

Threads share objects in memory with the parent process, which is useful for avoiding copies but raises the danger of a "race condition", where different threads modify data that other threads are using and cause errors..

### 2.1 Threaded linear algebra

As with Python and R, Julia uses BLAS, a standard library of basic linear algebra operations (written in
Fortran or C), for linear algebra operations. A fast BLAS can greatly speed up linear algebra relative
to the default BLAS on a machine. Julia uses a fast, open source, free BLAS library called *OpenBLAS*.
In addition to being fast when used on a single core, the openBLAS
library is  threaded - if your computer has multiple cores and there
are free resources, your linear algebra will use multiple cores

Here's an example.

```{julia}
using BenchmarkTools

using LinearAlgebra
using Distributions
n = 7000
x = rand(Uniform(0,1), n,n);

println(BLAS.get_num_threads())
```

```{julia}
function chol_xtx(x)
    z = x'*x   ## z is positive definite
    C = cholesky(z)
end

BLAS.set_num_threads(4)
@btime chol = chol_xtx(x);
```
```{julia}
BLAS.set_num_threads(1)
@btime chol = chol_xtx(x);
```

We see that using four threads is faster than one, but in this case we don't get a four-fold speedup.

#### Number of threads

By default, Julia will set the number of threads for linear algebra equal to the number of processors on your machine.

As seen above, you can check the number of threads being used with:

```{julia}
BLAS.get_num_threads()
```

Other ways to control the number of threads used for linear algebra include:

  - setting the `OMP_NUM_THREADS` environment variable in the shell before starting Julia, and
  - using `BLAS.set_num_threads(n)`.


### 2.2 Threaded for loops

In Julia, you can directly set up [software threads to use for parallel processing](https://docs.julialang.org/en/v1/manual/multi-threading).

Here we'll see some examples of running a for loop in parallel, both acting on a single object and used as a parallel map operation.

Here we can operate on a vector in parallel:

```{julia}
using Base.Threads

n = 50000000;
x = rand(n);

@threads for i in 1:length(x)
    x[i] = exp(x[i]) + sin(x[i]);
end
```

We could also threads to carry out a parallel map operation, implemented as a for loop.


```{julia}
n = 1000

function test(n)
    x = rand(Uniform(0,1), n,n)
    z = x'*x
    C = cholesky(z)
    return(C.U[1,1])
end

a = zeros(12)
@threads for i in 1:12
    a[i] = test(n)
end
```

### 2.3 Spawning tasks on threads

You can also create (aka 'spawn') individual tasks on threads, with the tasks running in parallel.

Let's see an example (taken from [here](http://ferestrepoca.github.io/paradigmas-de-programacion/paralela/tutoriales/julia/notebooks/parallelProgrammingApplications.html) of sorting a vector in parallel, by sorting subsets of the vector in separate threads.

```{julia}

import Base.Threads.@spawn

# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
    println(current_task(), ' ', lo, ' ', hi)
    if lo >= hi                       # 1 or 0 elements; nothing to do
        return v
    end
    if hi - lo < 100000               # below some cutoff, run in serial
        sort!(view(v, lo:hi), alg = MergeSort)
        return v
    end

    mid = (lo+hi)>>>1                 # find the midpoint

    ### Set up parallelization here ###

    ## Sort two halves in parallel, one in current call and one in a new task
    ## in a separate thread:

    half = @spawn psort!(v, lo, mid)  # task to sort the lower half
    psort!(v, mid+1, hi)              # sort the upper half in the current call

    wait(half)                        # wait for the lower half to finish

    temp = v[lo:mid]                  # workspace for merging

    i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
    @inbounds while k < j <= hi
        if v[j] < temp[i]
            v[k] = v[j]
            j += 1
        else
            v[k] = temp[i]
            i += 1
        end
        k += 1
    end
    @inbounds while k < j
        v[k] = temp[i]
        k += 1
        i += 1
    end

    return v
end
```

How does this work? Let's consider an example where we sort a vector of length 250000.

The vector gets split into elements 1:125000 (run in task #1) and 125001:250000 (run in the main call). Then the elements 1:125000 are split into 1:62500 (run in task #2) and 62501:125000 (run in task #1), while the elements 125001:250000 are split into 125001:187500 (run in task #3) and 187501:250000 (run in the main call). No more splitting occurs because vectors of length less than 100000 are run in serial.

Assuming we have at least four threads (including the main process), each of the tasks will run in a separate thread, and all four sorts on the vector subsets will run in parallel.

```{julia}
x = rand(250000);
psort!(x);
```

We see that the output from `current_task()` shows that  the task labels correspond with what I stated above.

The number of tasks running in parallel will be at most the number of threads set in the Julia session.

### 2.4 Controlling the number of threads

You can see the number of threads available:

```{julia}
Threads.nthreads()
```

You can control the number of threads used for threading in Julia (apart from linear algebra) either by:

  - setting the `JULIA_NUM_THREADS` environment variable in the shell before starting Julia, or
  - starting Julia with the `-t` (or `--threads`) flag, e.g.: `julia -t 4`.

Note that we can't use `OMP_NUM_THREADS` as the Julia threading is not based on openMP.


## 3 Multi-process parallelization

### 3.1 Parallel map operations

We can use `pmap` to run a parallel map operation across multiple Julia processes (on one or more machines).
`pmap` is good for cases where each task takes a non-negligible amount of time, as there is overhead (latency) in starting the tasks.

Here we'll carry out multiple computationally expensive calculations in the map.

We need to import packages and create the function on each of the worker processes using `@everywhere`.

```{julia}
using Distributed

if nprocs() == 1
    addprocs(4)
end

nprocs()
```

```{julia}
@everywhere begin
    using Distributions
    using LinearAlgebra
    function test(n)
        x = rand(Uniform(0,1), n,n)
        z = transpose(x)*x
        C = cholesky(z)
        return C.U[2,3]
    end
end

result = pmap(test, repeat([5000],12))
```

One can use [static allocation (prescheduling)](./#4-parallelization-strategies) with the `batch_size` argument, thereby assigning that many tasks to each worker to reduce latentcy.

### 3.2 Parallel for loops

One can execute for loops in parallel across multiple worker processes as follows. This is particularly handy for cases where one uses a reduction operator (e.g., the `+` here) so that little data needs to be copied back to the main process. (And in this case we don't copy any data to the workers either.)

Here we'll sum over a large number of random numbers with chunks done on each of the workers, comparing the time to a basic for loop.

```{julia}
function forfun(n)
    sum = 0.0
    for i in 1:n
        sum += rand(1)[1]
    end
    return(sum)
end

function pforfun(n)
   out = @sync @distributed (+) for i = 1:n
       rand(1)[1]
   end
   return(out)
end

n=50000000
@time forfun(n);
```
```{julia}
@time pforfun(n);
```

The use of `@sync` causes the operation to block until the result is available so we can get the correct timing.

Without a reduction operation, one would generally end up passing a lot of data back to the main process, and this could take a lot of time. For such calculations, one would generally be better off using threaded for loops in order to take advantage of shared memory.

We'd have to look into how the random number seed is set on each worker to better understand any issues that might arise from parallel random number generation, but I believe that each worker has a different seed (but note that this does not explicitly ensure that the random number streams on the workers are distinct, as is the case if one uses the L'Ecuyer algorithm).

### 3.3 Passing data to the workers

With multiple workers, particularly on more than one machine, one generally wants to be careful about having to copy large data objects to each worker, as that could make up a substantial portion of the time involved in the computation.

One can explicitly copy a variable to the workers in an `@everywhere` block by using Julia's interpolation syntax:

```{julia}
@everywhere begin
    x = $x  # copy to workers using interpolation syntax
    println(pointer_from_objref(x), ' ', x[1])
end
```

We see based on `pointer_from_objref` that each copy of `x` is stored at a distinct location in memory, even when processes are on the same machine.


Also note that if one creates a variable within an `@everywhere` block, that variable is available to all tasks run on the worker, so it is global' with respect to those tasks. Note the repeated values in the result here.

```{julia}
@everywhere begin
    x = rand(5)
    function test(i)
        return sum(x)
    end
end

result = pmap(test, 1:12, batch_size = 3)
```

If one wants to have multiple processes all work on the same object, without copying it, one can consider using Julia's [SharedArray](https://docs.julialang.org/en/v1/stdlib/SharedArrays/#SharedArrays.SharedArray) (one machine) or [DArray from the DistributedArrays package](https://juliaparallel.org/DistributedArrays.jl/stable/) (multiple machines) types, which break up arrays into pieces, with different pieces stored locally on different processes.

### 3.4 Spawning tasks

One can use the `Distributed.@spawnat` macro to run tasks on processes, in a fashion similar to using `Threads.@spawn`. More details can be found [here](https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.@spawnat).

### 3.5 Using multiple machines

In addition to using processes on one machine, one can use processes across multiple machines. One can either start the processes when you start the main Julia session or you can start them from within the Julia session. In both cases you'll need to have the ability to ssh to the other machines without entering your password.

To start the processes when starting Julia, create a "machinefile" that lists the names of the machines and the number of worker processes to start on each machine.

Here's an example machinefile:

```
arwen.berkeley.edu
arwen.berkeley.edu
gandalf.berkeley.edu
gandalf.berkeley.edu
```

Note that if you're using Slurm on a Linux cluster, you could generate that file in the shell from within your Slurm allocation like this:

```
srun hostname > machines
```

Then start Julia like this:

```
julia --machine-file machines
```

From within Julia, you can add processes like this (first we'll remove the existing worker processes started using `addprocs()` previously):

```{julia}
rmprocs(workers())

addprocs([("arwen", 2), ("gandalf", 2)])
```

To check on the number of processes:

```{julia}
nprocs()
```


## 4 Loops and fused operations

Consider the following vectorized code that you might run in a variety of languages (e.g., Julia, Python, R).

```
x = tan(x) + 3*sin(x)
```

If run as vectorized code, this has downsides. First, it will use additional memory (temporary arrays will be created to store `tan(x)`, `sin(x)`, `3*sin(x)`). Second, multiple for loops will have to get executed when the vectorized code is run, looping over the elements of `x` to calculate `tan(x)`, `sin(x)`, etc. (For example in R or Python/numpy, multiple for loops would get run in the underlying C code.)

Contrast that to running directly as a for loop (e.g., in Julia or in C/C++):

```{julia}
#| eval: false
for i in 1:length(x)
    x[i] = tan(x[i]) + 3*sin(x[i])
end
```

Here temporary arrays don't need to be allocated and there is only a single for loop.

Combining loops is called 'fusing' and is an [important optimization that Julia can do](https://docs.julialang.org/en/v1/manual/performance-tips/#More-dots:-Fuse-vectorized-operations). (It's also a [key optimization done by XLA](https://www.tensorflow.org/xla), a compiler used with JAX and Tensorflow.)

Of course you might ask why use vectorized code at all given that Julia will [JIT compile](https://en.wikipedia.org/wiki/Just-in-time_compilation) the for loop above and run it really quickly. That's true, but reading and writing vectorized code is easier than writing for loops.

Let's compare the speed of the following approaches. We'll put everything into functions as generally [recommended when timing Julia code](https://www.juliabloggers.com/timing-in-julia) to avoid [global variables that incur a performance penalty because their type can change](https://julialang.org/blog/2022/08/julia-1.8-highlights/#typed_globals).

First, let's find the time when directly using a for loop, as a baseline.

```{julia}
n = 50000000
y = Array{Float64}(undef, n);
x = rand(n);
```

```{julia}
function direct_for_loop_calc(x, y)
    for i in 1:length(x)
        y[i] = exp(x[i]) + sin(x[i])
    end
end

using BenchmarkTools
@benchmark direct_for_loop_calc(x, y)
```

Notice the lack of additional memory use.

Now let's try a basic vectorized calculation (for which we need the various periods to get vectorization), without fusing. We'll reassign the result to the allocated `y` vector for comparability to the for loop implementation above.

```{julia}
function basic_vectorized_calc(x, y)
     y .= exp.(x) + 3 * sin.(x)
end

using BenchmarkTools
@benchmark basic_vectorized_calc(x, y)
```

The original `x` array is 400 MB; notice the additional memory allocation and that this takes almost twice as long as the original for loop.

Here's a fused version of the vectorized calculation, where the `@.` causes the loops to be fused.

```{julia}
function fused_vectorized_calc(x, y)
    y .= @. tan(x) + 3 * sin(x)
end

@benchmark fused_vectorized_calc(x, y)
```

We see that the time and (lack of) memory allocation are essentially the same as the original basic for loop.


Finally one can achieve the same fusion by having the function just compute scalar quantities and then using the vectorized version of the function (by using `scalar_calc.()` instead of `scalar_calc()`), which also does the fusion.

```{julia}
function scalar_calc(x)
    return(tan(x) + 3 * sin(x))
end

@benchmark y .= scalar_calc.(x)
```

## 5 Using the GPU - basic offloading

We can use `CUDA.jl` to offload computations to the GPU. Here we'll explore matrix multiplication and vectorized calculations. For this, Julia will take care, behind the scenes, of converting our Julia code to code that can run on the GPU.

There are a couple key things to remember about using a GPU:

 - The GPU memory is separate from CPU memory, and transferring data from the CPU to GPU (or back) is often more costly than doing the computation on the GPU.
    - If possible, generate the data on the GPU or keep the data on the GPU when carrying out a sequence of operations.
 - By default GPU calculations are often doing using 32-bit (4-byte) floating point numbers rather than the standard of 64-bit (8-byte) when on the CPU.
    - This can affect speed comparisons between CPU and GPU.

Note that for this section, I'm pasting in the output when running the code separately on a machine with a GPU because this document is generated on a machine without a GPU.

### 5.1 Matrix multiplication

Let's first consider basic matrix multiplication. In this case since we generate the matrices on the CPU, they are 64-bit.

```{julia}
#| eval: false

using BenchmarkTools
using CUDA
using LinearAlgebra

function matmult(x, y)
    z = x * y
    return z
end

n = 7000

x = randn(n, n);
y = randn(n, n);
x_gpu = CuArray(x);
y_gpu = CuArray(y);

## These use 64-bit numbers:
typeof(x)
# Matrix{Float64} (alias for Array{Float64, 2})
typeof(x_gpu)
# CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}

LinearAlgebra.BLAS.set_num_threads(1);

@benchmark z = matmult(x, y)
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
#  Single result which took 17.271 s (0.00% GC) to evaluate,
#  with a memory estimate of 373.84 MiB, over 2 allocations.

@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 65 samples with 1 evaluation.
#  Range (min … max):  53.172 ms … 90.679 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     76.419 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   77.404 ms ±  4.092 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
```

Clearly the the GPU calculation is much faster, taking about 75 milliseconds, compared to 17 seconds
on the CPU (albeit using a single thread).

Let's compare that to the time of copying the data to the GPU:

```{julia}
#| eval: false
@benchmark CUDA.@sync tmp = CuArray(x)
# BenchmarkTools.Trial: 59 samples with 1 evaluation.
#  Range (min … max):  83.766 ms … 137.849 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     84.684 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   85.696 ms ±   7.011 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
```

This suggests that the time in copying the data is similar to that for doing the
computation.

If we count the time of transferring the data to and from the GPU, that ends up being a substantial part of the time, compared to the 75 ms for simply doing the matrix multiplication.

```{julia}
#| eval: false
function matmult_with_transfer(x, y)
    xc = CuArray(x)
    yc = CuArray(y)
    z = xc * yc
    return Array(z)
end

@benchmark CUDA.@sync z = matmult_with_transfer(x, y)
# BenchmarkTools.Trial: 20 samples with 1 evaluation.
#  Range (min … max):  251.578 ms … 258.017 ms  ┊ GC (min … max): 0.00% … 0.57%
#  Time  (median):     253.886 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   254.228 ms ±   1.708 ms  ┊ GC (mean ± σ):  4.33% ± 6.78%
```


As a sidenote, we can force use of 64-bit numbers on the GPU (in this case when generating values on the GPU) like this.

```{julia}
#| eval: false
x_gpu = CUDA.randn(Float64, n, n);
```

Finally, let's consider whether the matrix multiplication is faster using 32-bit numbers.

```{julia}
#| eval: false
x = randn(Float32, n, n);
y = randn(Float32, n, n);
x_gpu = CuArray(x);
y_gpu = CuArray(y);
typeof(x_gpu)

@benchmark z = matmult(x, y)
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
#  Single result which took 8.671 s (0.00% GC) to evaluate,
@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 91 samples with 1 evaluation.
#  Range (min … max):  41.174 ms … 70.491 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     54.420 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   55.363 ms ±  2.912 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
```

So that's faster, though I'm not sure why the CPU implementation is about twice as fast (which makes sense in that it is working with numbers taking up half as much space) while the GPU implementation does not achieve that speedup (54 ms. with 32-bit compared to 75 ms. with 64-bit).

### 5.2 Vectorized calculations

Here we'll consider using the GPU for vectorized calculations, noting that [earlier](./#4-loops-and-fused-operations) we talked about using `.` to vectorize and `@` to fuse loops in the context of CPU-based calculations.

```{julia}
#| eval: false
# Scalar function to do something.
function myfun(x)
    y = tan(x) + 3 * sin(x)
    return y
end

# Vectorized version that modifies `y` in place.
function myfun_vec(x, y)
    y .= myfun.(x)
    return
end

n = 250000000;
y = Vector{Float64}(undef, n);
x = rand(n);

x_gpu = CuArray(x);
y_gpu = CuArray(y);

@benchmark myfun_vec(x, y)   # 3.5 sec.

@benchmark CUDA.@sync myfun_vec(x_gpu, y_gpu)  # 6 ms.
```

Here we have a massive 500x speedup of 6 ms. compared to 3.5 seconds.

Of course, as in the matrix multiplication example above, if you need to copy the data to and from the GPU, that will add substantial time.

## 6 Using the GPU - writing GPU kernels

Next we'll explore [writing our own GPU kernels](https://cuda.juliagpu.org/stable/development/kernel). Kernels are functions that encode the core computational operations that are executed in parallel.

In other languages, the basic mode of operation with a GPU when you are writing your own GPU code is to write a kernel using CUDA (basically C) code and then call the kernel in parallel via C, R, or Python code. In Julia, we can write the kernel using Julia syntax (though many operations (particularly non-numerical ones) will not run on the GPU...).

### 6.1 Basic example

Here's a basic example in which we'll do a calculation in place. We run 1000 scalar calculations using 1000 threads.

We use `@cuda` to compile and run the kernel.

```{julia}
#| eval: false
function my_kernel(x)
  idx = threadIdx().x;   # What thread am I?
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
  end
  return
end

n = 1000;
x_gpu = CUDA.randn(n);
Array(x_gpu)[n]
# -1.5321726f0
@cuda threads=n my_kernel(x_gpu);
Array(x_gpu)[n]   # Check the computation was done by checking last element.
# -28.875708f0
```

There are limits on the number of threads we can use.

```{julia}
#| eval: false
n = 2000;
x_gpu = CUDA.randn(n);
@cuda threads=n my_kernel(x_gpu);
# ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
```

#### 6.1.1 Multiple blocks

We need to use at least as many threads as computations, and in addition to only being able to use 1024 threads in the x dimension, we can have at most 1024 threads in a block on the A100 GPU we're using. So we'll need multiple blocks.

```{julia}
#| eval: false
function my_kernel(x)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
  end
  return
end

n = 2000;
nthreads = 1024;
x_gpu = CUDA.randn(n);
initial = Array(x_gpu)[n]
nblocks = Int(ceil(n/nthreads));

@cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
(initial, Array(x_gpu)[n])  # Check that calculation was done.
```

Let's do a smaller test run in which we can check on the thread and block indexing.

```{julia}
#| eval: false
function my_kernel_print(x)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
    @cuprintln idx, i, j, blockDim().x, blockDim().y;
  end
  return
end

n = 200;
x_gpu = CUDA.randn(n);
nthreads = 100;
nblocks = Int(ceil(n/nthreads));
@cuda threads=nthreads blocks=nblocks my_kernel_print(x_gpu);
```

When we run this, notice the output seems to be grouped based on warps of 32 threads (apart from the last set since `n=200` is not a multiple of 32).

####  6.1.2 Larger computations

In many cases we'll have more tasks than the total number of GPU cores. As long as we don't exceed the maximum size of a block or grid, we can just ask for as many threads as we have tasks and rely on the GPU to manage assigning the tasks to the GPU cores.

We'd want to check that the number/dimension of the block here does not exceed the maximum block size. I didn't do that, but it ran, so it must have been ok!

Here we'll run the computation we ran earlier when we did not write our own kernel and just relied on Julia to offload to the GPU behind the scene.

```{julia}
#| eval: false
n = 250000000;
x_gpu = CUDA.randn(n);
nthreads = 1024;
nblocks = Int(ceil(n/nthreads));
Array(x_gpu)[n]

# Run it once to flush out any compilation/transformation time.
y_gpu = CUDA.randn(5);
CUDA.@sync @cuda threads=nthreads blocks=nblocks my_kernel(y_gpu);

CUDA.@time CUDA.@sync @cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
# 0.002003 seconds (45 CPU allocations: 2.719 KiB)
Array(x_gpu)[n]
```

The 2.0 ms is reasonably comparable to the 3.7 ms when we just had Julia run the [vectorized computation on the GPU](./#vectorized-example) (from the last time we ran it). That used 64-bit floats. When I reran the code above using 64-bit floats, the time was 5.2 ms.

### 6.2 Efficient memory access

We'll explore two final topics related to efficiently accessing data in memory: first accessing global GPU memory efficiently and second making use of shared GPU memory.

#### 6.2.1 Coalesced access to global memory

If adjacent threads in a block access adjacent memory locations,  a chunk of data can be obtained in a single access to global memory.

We'll implement element-wise summing of two matrices. Obviously one can just do this directly with `CuArray`s in Julia, but if we implement it ourselves, it illustrates that reading a matrix by column is much more efficient than reading by row. Here a thread block either handles part of a column (good) or part of a row (bad). The x-dimension of the blocks in the grid then handles multiple thread blocks within each column (or row; bad) and the y-dimension of the blocks in the grid handles the different columns (or rows; bad).

```{julia}
#| eval: false
n = 10000;
X_gpu = CUDA.randn(n,n);
Y_gpu = CUDA.randn(n,n);
out_gpu = CUDA.zeros(n,n);

X_gpu_small = CUDA.randn(5,5);
Y_gpu_small = CUDA.randn(5,5);
out_gpu_small = CUDA.zeros(5,5);

# Good: Adjacent threads process elements in a column.
function kernel_sum_bycol!(X, Y, output)
    row_idx = threadIdx().x + (blockIdx().x - 1)*blockDim().x;
    col_idx = blockIdx().y;

    if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
        output[row_idx, col_idx] = X[row_idx, col_idx] + Y[row_idx, col_idx]
    end
    return nothing
end

nthreads = 1024;
# x dim of grid is number of thread blocks in a column.
# y dim of grid is number of columns.
nblocks = (Int(ceil(n/nthreads)), n);


# Flush out any compilation time.
CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu_small, Y_gpu_small, out_gpu_small);

@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu, Y_gpu, out_gpu);
# 2.153 ms (47 allocations: 1.30 KiB)

# Bad: Adjacent threads process elements in a row.
function kernel_sum_byrow!(X, Y, output)
    row_idx = blockIdx().y;
    col_idx = threadIdx().x + (blockIdx().x - 1)*blockDim().x;

    if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
        output[row_idx, col_idx] = X[row_idx, col_idx] + Y[row_idx, col_idx]
    end
    return nothing
end


# Flush out any compilation time.
CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu_small, Y_gpu_small, out_gpu_small);

@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu, Y_gpu, out_gpu);
# 10.500 ms (47 allocations: 1.30 KiB)
```

#### 6.2.2 Using shared memory

Accessing global GPU memory is much slower than doing computation on the GPU. So we'd like to avoid repeated access to global memory (e.g., a bad scenario would be a ratio of one arithmetic calculation per retrieval from global memory). One strategy is for multiple threads in a block to cooperate to load data from global memory into shared memory accessible by all the threads in the block. The computation can then be done on the data in shared memory.

Here's a simplified example that shows how to load the data into shared memory. There's no actual computation coded here, but one could imagine that each thread would then each do a computation that uses the entire chunk of data in shared memory.

```{julia}
#| eval: false
function kernel_reader_bycol(x::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  dims = size(x);

  # Setup shared memory for the subset of data.
  shared_data = CuDynamicSharedArray(T, (blockDim().x, dims[2]));

  for chunk_start = 1:blockDim().x:dims[1]
    chunk_size = min(blockDim().x, dims[1] - chunk_start + 1);
    ## Transfer a chunk of rows in parallel, one row per thread.
    if i <= chunk_size
      for col in 1:dims[2]
        shared_data[i, col] = x[chunk_start + i - 1, col];
      end
    end
    sync_threads()

    # At this point we'd insert code to do the actual computation, based on `idx`.
    # Each thread now has the opportunity to compute on all the data in the chunk in
    # `shared_data`.

  end

  return
end

n = 10000000;
m = 10;
x_gpu = CUDA.randn(n, m);
x_gpu_small = CUDA.randn(5, m);

nthreads = 1024;
nblocks = 100;  # This is arbitrary in this example as we are not doing an actual computation.

memsize = nthreads * m * 4;

CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_bycol(x_gpu_small);

CUDA.@time @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_bycol(x_gpu);
# 0.138480 seconds (24 CPU allocations: 752 bytes)
```

If `m` gets much bigger, we get an error "ERROR: Amount of dynamic shared memory exceeds device limit (400.000 KiB > 48.000 KiB)." So for larger `m` we'd need to rework how we manipulate the data.

Let's close by seeing if the memory access patterns make a difference in this example. Instead of accessing by column, we'll access by row, but with the matrix transposed so it is very wide instead of very long.

My initial thought was that accessing by row would be slower because adjacent threads are not reading from adjacent locations in global memory.

```{julia}
#| eval: false
function kernel_reader_byrow(x::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  dims = size(x);

  # Setup shared memory for the subset of data.
  shared_data = CuDynamicSharedArray(T, (dims[1], blockDim().x));

  for chunk_start = 1:blockDim().x:dims[2]
    chunk_size = min(blockDim().x, dims[2] - chunk_start + 1);
    ## Transfer a chunk of rows in parallel, one column per thread.
    if i <= chunk_size
      for row in 1:dims[1]
        shared_data[row, i] = x[row, chunk_start + i - 1];
      end
    end
    sync_threads()

    # At this point we'd insert code to do the actual computation, based on `idx`.
    # Each thread now has the opportunity to compute on all the data in the chunk in
    # `shared_data`.

  end

  return
end

n = 10000000;
m = 10;
x_gpu = CUDA.randn(m, n);
x_gpu_small = CUDA.randn(m, 5);

nthreads = 1024;
nblocks = 100;  # This is arbitrary in this example as we are not doing an actual computation.

memsize = nthreads * m * 4;

CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_byrow(x_gpu);

CUDA.@time CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_byrow(x_gpu);
# 0.105434 seconds (25 CPU allocations: 1008 bytes)
```

We see that the access by row here is (a bit) faster. I think this is because the entire chunk of data in the wide matrix lives in a small area of global memory, while in the long matrix, each column in the chunk has adjacent values but separate columns are very far apart because the matrix is so long.

We might be able to improve efficiency with the wide matrix by operating by column within the wide matrix. This would involve more work to manage the indexing because we wouldn't just have each thread manage a column (unless we used very few threads, which would presumably reduce efficiency).

### 6.3 Using atomics for reduction operations

One thing we haven't seen so far is being able to have different threads write to the same memory location (e.g., to a scalar or to an element of an array). One can easily imagine needing to do this to carry out reduction operations (e.g., calculating a sum or a max or min).

The obvious danger is that two threads might write to the memory location at the same time and somehow cause the location not to be properly updated.

Suppose we want to calculate the log-likelihood (or some other loss function) across independent observations. We'd like to do the summation on the GPU to avoid passing all the log-likelihood values from GPU to CPU and then having to do the sum on the CPU.

```{julia}
#| eval: false
using BenchmarkTools
using Distributions

n = 100_000_000;   # Formatted for readability.
norm_dist = Normal(0,1)
samples = rand(norm_dist, n);

function loglik_kernel(x, result)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    # logpdf(norm_dist, x[idx]) # Doesn't compile.
    CUDA.@atomic result[1] += -0.5*x[idx]^2;            # Experimental, but nicer interface.
    #CUDA.atomic_add!(pointer(result), -0.5*x[idx]^2);  # Stable low-level API.
  end
  return
end

nthreads = 1024;
nblocks = Int(ceil(n/nthreads));

samples_gpu = CuArray(samples);
result = CUDA.zeros(typeof(samples[1]), 1);
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks loglik_kernel(samples_gpu, result);
# 161.352 ms (34 allocations: 944 bytes)
Array(result)[1] -n*log(2*pi)/2  # Adjust for normalizing constant as scalar computation, not on GPU.

@btime sum(logpdf.(norm_dist, samples))
# 1.410 s (5 allocations: 762.94 MiB)
```

So we got about a 12-fold speedup, which is less than we've been getting for some of our other comparisons.

I was curious how much time is spent handling the reduction operation (presumably there is some loss in efficiency from having all the threads write to the same memory location). When I changed `result` to be a vector of length equal to that of `samples` and just assign the individual PDF evaluations to the corresponding elements of `result` without the atomic operation, the time was 3 milliseconds (compared to 161 above), so there is a performance degradation from the atomic operation.

#### 6.3.1 Using shared memory to reduce the cost of atomic operations

One solution to the performance degradation is to not have all of the summing make use of the same location in memory to accumulate the result.

Instead we can use shared memory to more efficiently do the reduction within each thread block before doing the final reduction across blocks. Here's an approach using a tree-like operation (as suggested by a ChatBot, but requiring some debugging on my part) to compute the partial sum within each thread block before using the atomic operation to compute the sum of the partial sums:

```{julia}
#| eval: false

function loglik_kernel_shmem(x::CuDeviceArray{T}, result::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;

  shared_data = CuDynamicSharedArray(T, (blockDim().x));
  # First do the core calculation and store in shared memory.
  if idx <= length(x)
     shared_data[i] = -0.5*x[idx]^2;
  else
     shared_data[i] = 0.0;
  end

  # Tree-like partial sum within the thread block,
  # summing pairs until the sum within the block
  # is contained in `shared_data[1]`.
  s = blockDim().x ÷ 2;   # `÷` ensures `s` is Int.
  while s >= 1
    if i <= s
      shared_data[i] += shared_data[i + s];
    end
    sync_threads()
    s ÷= 2;
  end

  # The first thread in the block writes the partial sum to global memory.
  if i == 1
    CUDA.@atomic result[1] += shared_data[1];
  end
  return
end

memsize = nthreads * sizeof(samples[1]);

result2 = CUDA.zeros(typeof(samples[1]), 1);
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize loglik_kernel_shmem(samples_gpu, result2);
# 6.317 ms (34 allocations: 944 bytes)

Array(result2)[1] -n*log(2*pi)/2
```

### 6.4 Debugging kernel code

It can be much harder to debug kernel code than regular Julia code. If the syntax doesn't produce valid compiled code that can run on the GPU, it may not be obvious from the error messsage what the problem is.

As an example in the code in the previous section, I originally had `s = blockDim().x / 2;` and `s /= 2;`. I didn't realize that even with integer inputs, that this produced a float output type for `s` and that as a result using `s` for indexing in `shared_data[i + s];` wouldn't work. The error message said there was a problem with the LLVM/IR code produced from the kernel, but didn't say where and it took a binary search on my part to figure out that `shared_data[i + s];` was the problematic piece of code and that was caused by `s` being a float.

On an only somewhat related point, the ChatBot originally gave me while `s >= 0`, which is a bug that doesn't prevent the code from running, but does give incorrect numerical results, so we still need to be careful with what we get from ChatBots.