-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Overlap-and-save method of calculation linear convolution on NVIDIA GPUs
Convolution Convolution is a standard tool in signal processing. It is a linear operation where a signal s is modified by a response function r or filter. There are two fundamental ways how to calculate convolution. We can calculate convolution in time-domain using formulae for discrete convolution (more in any signal processing literature for example R. G. Lyons, Understanding digital signal processing 3rd ed, Prentice 180 Hall, 2011). We can also invoke convolution theorem and calculate convolution in frequency domain. To do this we need to perform discrete Fourier transformation first, then perform convolution in frequency-domain and finally use inverse discrete Fourier transformation to return to time-domain. The overlap-and-save method of calculating convolution is designed for special case where we have very long input signal, but relatively short response function. The disadvantage of Fourier-domain convolution method in cases such as this is that convolution theorem demands that both the signal and response function in Fourier-domain must be of the same length. This can be performance prohibitive if we are dealing with very long signals and multiple response functions. The overlap-and-save method separates input signal into smaller segments which are then independently processed, which makes this method ideal for parallel processing for example on GPUs. At the end the overlap-and-save method add all these segments together in such a way as to produce unbroken linear convolution.
We provide two implementations of overlap-and-save method, first is using vendor provided FFT library the cuFFT library for calculating necessary FFTs, the second implementation is using our custom FFT implementation of FFT algorithm. The advantage of having a custom FFT algorithm is that we can perform all necessary steps of the overlap-and-save inside one CUDA kernel thus saving costly global memory transactions thus providing significant speedup over cuFFT implementation.
Current version of the code aims to demonstrate performance gain of our implementation and for performance testing. Although it could be used as a stand alone code for calculation of the convolutions it is not intended as such. As a benchmark the code does not choose optimal segment size on its own and the segment size need to be set manually in 'params.h'.
Compilation: We have provided a make file which should take care of the compilation step. The make file assumes that environmental variable 'CUDA_HOME' is set and that it points to a folder containing installation of CUDA. The compute capability also needs to be set in the make file. By default we set it to compute capability 6.0 for NVIDIA P100. The code uses shuffle instructions to compute FFTs thus it should work from GPU of Kepler generation and higher. The code does not require any other dependencies.
Execution of the code: All of our implementations (cuFFT and custom FFT) expect same command line arguments. There are two modes of operation, one for performance testing (first argument 'r') which does not require user to provide input data in form of a file and one for processing user data provided (first argument 'f').
The arguments expected by the code depends on chosen mode of operation.
- Input type: 'r' or 'f'
'f' - file input provided by user
- Input signal file
- Input filter file
- Output signal file
- number of filters
- number of GPU kernel runs (optional) Example: CONV.exe f signal.dat filter.dat output.dat 32 10
'r' - random input generated by the code
- Signal length in number of time samples
- Filter length in samples
- Number of templates
- number of GPU kernel runs (optional) Example: CONV.exe r 2097152 192 32 10
Beware when using very long signals, signal with 2^23 time samples with 100 filter will take 6.4GB of memory.
The cuFFT implementation expects that CONV_SIZE would be #defined. This constant determines size of the segment which will be processed, best performing value is 8192 regardless of response function width. The AAFFT implementation in addition to CONV_SIZE expect CONV_HALF = CONV_SIZE/2 and FFT_EXP which is given as FFT_EXP = log2(CONV_SIZE) = log(CONV_SIZE)/log(2). In case of AAFFT implementation there is no universal segment size. We have found that for filter lengths N_r<128 the best performing is CONV_SIZE = 512, for 256<=N_r<384 we use CONV_SIZE = 1024 and for N_r>384 we use CONV_SIZE = 2048. These values of CONV_SIZE may however depend on GPU chosen.
If enabled in debug.h the code will write out timing and parameters of given run for analysis of the results. The columns of the output file are:
- number of time samples
- filter length
- number of filters
- average execution time of convolution kernel (no transfer time from/to host are included)
- number of executions from which the execution time is calculated
- limitation on number of registers if set
- CONV_SIZE
- number of filters processed per CUDA thread block
- type of the kernel used
In case of AAFFT implementation we provide two executables one CONV_1f.exe uses kernel which processes one segment per CUDA thread block and CONV_2f.exe which processes two segments per CUDA thread block. Some GPUs seems to prefer one or the other.
Structure of input files and generating example files: The structure of the input files is simple each time sample is written on individual line where real part is in the first column and imaginary part is in second column.
Structure of the output file is divided into blocks by filters and they are separated by empty line. The column in output file are as follows:
- filter index 0<=i<(number of filters)
- time sample
- real part of convolved signal
- imaginary part of convolved signal
in gnuplot one can display powers of convolved signal as: splot 'output.dat' using 1:2:($3*$3+$4*$4) palette Be advised that the code might produce very big files for long signals and large number of filters.
The example files could be generated by using 'GPU_OaS_generate_files'. The cod efor generating example files expects following arguments:
- Signal length in number of time samples (min 15000 samples)
- Template length Example:128
- Number of filters
- Name of the file to export signal to
- Name of the file to export filters to
Contributors: Karel Adamek Sofia Dimoudi Wes Armour Mike Giles