Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
193 commits
Select commit Hold shift + click to select a range
d909529
Add GPU packing and unpacking
Nov 7, 2014
e3463fa
indexed datatype new, bonus stask support.
eddy16112 Nov 14, 2014
cf44223
RDMA send is now working.
eddy16112 Apr 9, 2015
c6a00d7
Add support for vector datatype. Add pipeline.
eddy16112 Apr 22, 2015
c10d3f4
fix gpu memory and vector datatype
eddy16112 May 1, 2015
0fda4df
unrestricted GPU. Instead of forcing everything to go on
bosilca May 7, 2015
bc0e104
Using globally defined indexes lead to several synchronization
bosilca Jun 18, 2015
6fda036
Generate the Makefile. It will now be placed in the bindir
bosilca Jun 18, 2015
742992a
This file was certainly not supposed to be here. There is NO valid
bosilca Jun 18, 2015
9c63b09
Add the capability to install the generated library and other
bosilca Jun 18, 2015
a681551
Open the datatype CUDA library from a default install location.
bosilca Jun 18, 2015
bcd77f6
Add a patch from Rolf fixing 2 issues:
bosilca Jun 30, 2015
bdfe31b
clean up code in pack and unpack
eddy16112 Aug 19, 2015
a670db4
big changes, now pack is driven by receiver by active message
eddy16112 Aug 22, 2015
42ad920
intel test working
eddy16112 Aug 31, 2015
bab3559
fix a bug when buffer is not big enough for whole ddt
eddy16112 Aug 31, 2015
29c90a0
if data in different gpu, instead of copy direct from one to the other,
eddy16112 Sep 2, 2015
44a1550
now we can use cudamemcpy2d
eddy16112 Sep 8, 2015
a67c842
enable zero copy + fix GPU buffer bug
eddy16112 Sep 8, 2015
7bd8151
put pipeline size into mca
eddy16112 Sep 14, 2015
9d10357
Upon datatype commit create a list of iovec representing a single
bosilca Sep 15, 2015
756b2af
contiguous vs non-contiguous is working
eddy16112 Sep 17, 2015
3a6bdd9
Fix pipeline bug
eddy16112 Sep 17, 2015
f86c81e
now we are able to pack directly to remote buffer if receiver is
eddy16112 Sep 18, 2015
6ae39b2
add ddt_benchmark
eddy16112 Sep 29, 2015
25ead9b
modify for matrix transpose
eddy16112 Oct 2, 2015
5e14fdd
enable vector
eddy16112 Oct 2, 2015
d03c601
receiver now will send msg back to sender for buffer reuse
eddy16112 Oct 6, 2015
c377c36
fix zerocopy
eddy16112 Oct 9, 2015
c4b5fcf
offset instead of actual addess, and lots of clean up for unused
eddy16112 Oct 22, 2015
c451a4a
rewrite pipeline
eddy16112 Oct 25, 2015
fa331f8
s up and running. PUT size in an MCA parameters.
eddy16112 Oct 26, 2015
a3f79aa
less bugs
eddy16112 Oct 27, 2015
8acf254
fix pipelining for non-contiguous to contiguous
eddy16112 Oct 27, 2015
fe88901
opal_datatype is chnaged, so we need more space
Oct 27, 2015
688a423
reorder datatypes to cache boundaries
Oct 27, 2015
04a9785
slience warnings
eddy16112 Oct 28, 2015
e05edf8
remove smcuda btl calls from pml ob1
eddy16112 Oct 28, 2015
c13df8c
this file is not used anymore
eddy16112 Oct 28, 2015
fa02341
cuda ddt support is able to turn itself off. Make it support
eddy16112 Oct 29, 2015
7258acd
fix a cuda stream bug for iov, remove some stream syncs
eddy16112 Oct 30, 2015
4271a0d
in openib, disable rdma for non-contiguous gpu data
eddy16112 Nov 4, 2015
85f6428
move ddt kernel support function pointer into opal_datatype_cuda.c
eddy16112 Nov 4, 2015
44361c0
rename some functions
eddy16112 Nov 5, 2015
6b95a38
check point
eddy16112 Nov 6, 2015
101387c
Add support for caching the unpacked datatype description
bosilca Nov 7, 2015
0379c0b
check point use raw_cached, but cuda iov caching is not enabled
eddy16112 Nov 7, 2015
4a39f49
check point, split iov into two version, non-cached and cached
eddy16112 Nov 8, 2015
d9927f4
check point iov cache
eddy16112 Nov 8, 2015
4c192e9
another checkpoint
eddy16112 Nov 9, 2015
0f24449
check point, cuda iov is cached, but not used for pack/unpack
eddy16112 Nov 9, 2015
f1f4a7d
check point, ready to use cached cuda iov
eddy16112 Nov 10, 2015
043fa9c
checkpoint, cached cuda iov is working with multiple send, but not for
eddy16112 Nov 10, 2015
5e8c77a
checkpoint, fix a bug for partial unpack
eddy16112 Nov 11, 2015
deb67ec
checkpoint, fix unpack size
eddy16112 Nov 11, 2015
56c9fa4
checkpoint, during unpack, cache the entire iov before unpack
eddy16112 Nov 11, 2015
f3e03bd
another checkpoint
eddy16112 Nov 12, 2015
f524be6
checkpoint , remove unnecessary cuda stream sync
eddy16112 Nov 12, 2015
4b76d89
use bit to replace %
eddy16112 Nov 13, 2015
ff3a896
rollback to use %, not bit, since it is faster, not sure why
eddy16112 Nov 13, 2015
141cbbf
now cuda iov is {nc_disp, c_disp}
eddy16112 Nov 13, 2015
b2f6611
clean up kernel, put variables uses multiple times into register
eddy16112 Nov 13, 2015
a59950f
another checkpoint
eddy16112 Nov 14, 2015
e3cb0ee
now convertor->count > 1 is woring
eddy16112 Nov 14, 2015
30d493b
move the cuda iov caching into a seperate function
eddy16112 Nov 16, 2015
8c830a6
these two variables are useless now
eddy16112 Nov 16, 2015
8d1db8a
fix a bug for ib, current count of convertor should be set in
eddy16112 Nov 16, 2015
754b0d0
cleanup, move cudamalloc into cache cuda iov
eddy16112 Nov 17, 2015
27e44f5
rearrange varibles
eddy16112 Nov 17, 2015
7a663f4
if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
eddy16112 Nov 17, 2015
150ba7a
make sure check pointer is not NULL before free it
eddy16112 Nov 18, 2015
38ca646
checkpoint, rewrite non-cached version
eddy16112 Nov 25, 2015
ade51ba
fix for non cached iov
eddy16112 Nov 25, 2015
897ea1e
fix the non cached iov, set position should be put at first
eddy16112 Nov 25, 2015
e5d3441
move ddt iov to cuda iov into a function
eddy16112 Nov 25, 2015
d61f424
merge iov cached and non-cached
eddy16112 Nov 30, 2015
41edca1
for non cached iov, if there is no enough cuda iov space, break
eddy16112 Dec 1, 2015
5cf6dba
cached iov is working for count = 1
eddy16112 Nov 7, 2015
9386ffb
cache the entire cuda iov
eddy16112 Nov 11, 2015
540e448
now cuda iov is {nc_disp, c_disp}
eddy16112 Nov 13, 2015
180382b
clean up kernel, put variables uses multiple times into register
eddy16112 Nov 13, 2015
9ba68ca
cached cuda iov is working for count > 1
eddy16112 Nov 14, 2015
12a3ade
move the cuda iov caching into a seperate function
eddy16112 Nov 16, 2015
a39bc35
these two variables are useless now
eddy16112 Nov 16, 2015
1c3fb45
fix a bug for ib, current count of convertor should be set in
eddy16112 Nov 16, 2015
02d6560
cleanup, move cudamalloc into cache cuda iov
eddy16112 Nov 17, 2015
47cd909
rearrange varibles
eddy16112 Nov 17, 2015
c953c5b
if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
eddy16112 Nov 17, 2015
5d3cca0
make sure check pointer is not NULL before free it
eddy16112 Nov 18, 2015
9517b4d
rewrite non cached iov, make it unified with cached iov
eddy16112 Nov 25, 2015
dfcab4a
Merge branch 'cuda' of https://github.com/eddy16112/ompi into cuda
eddy16112 Dec 2, 2015
d242b0c
apply loop unroll on packing kernels
eddy16112 Feb 5, 2016
2e8b414
apply unroll to unpack
eddy16112 Feb 23, 2016
0c680c2
fix a cuda event bug. cudaStreamWaitEvent is not blocking call.
eddy16112 Feb 23, 2016
ad0d5f1
Merge pull request #6 from ICLDisco/master
eddy16112 Feb 26, 2016
b6d56eb
new vector kernel
eddy16112 Feb 26, 2016
48b2d06
Add GPU packing and unpacking
Nov 7, 2014
3f3ee94
indexed datatype new, bonus stask support.
eddy16112 Nov 14, 2014
34f4a3b
RDMA send is now working.
eddy16112 Apr 9, 2015
fb10144
Add support for vector datatype. Add pipeline.
eddy16112 Apr 22, 2015
ae49135
fix gpu memory and vector datatype
eddy16112 May 1, 2015
ef54b4d
unrestricted GPU. Instead of forcing everything to go on
bosilca May 7, 2015
653f54d
Using globally defined indexes lead to several synchronization
bosilca Jun 18, 2015
10f5543
Generate the Makefile. It will now be placed in the bindir
bosilca Jun 18, 2015
96ec3c5
This file was certainly not supposed to be here. There is NO valid
bosilca Jun 18, 2015
d9ca4ae
Add the capability to install the generated library and other
bosilca Jun 18, 2015
805938d
Open the datatype CUDA library from a default install location.
bosilca Jun 18, 2015
0b4c5df
Add a patch from Rolf fixing 2 issues:
bosilca Jun 30, 2015
b74997e
clean up code in pack and unpack
eddy16112 Aug 19, 2015
c182b30
big changes, now pack is driven by receiver by active message
eddy16112 Aug 22, 2015
d131f81
intel test working
eddy16112 Aug 31, 2015
c5fb939
fix a bug when buffer is not big enough for whole ddt
eddy16112 Aug 31, 2015
bcb1e05
if data in different gpu, instead of copy direct from one to the other,
eddy16112 Sep 2, 2015
7a86b4b
now we can use cudamemcpy2d
eddy16112 Sep 8, 2015
5f2aac5
enable zero copy + fix GPU buffer bug
eddy16112 Sep 8, 2015
eee322e
put pipeline size into mca
eddy16112 Sep 14, 2015
630e831
Upon datatype commit create a list of iovec representing a single
bosilca Sep 15, 2015
c3016bc
contiguous vs non-contiguous is working
eddy16112 Sep 17, 2015
39d548a
Fix pipeline bug
eddy16112 Sep 17, 2015
817a4fc
now we are able to pack directly to remote buffer if receiver is
eddy16112 Sep 18, 2015
ea582c5
add ddt_benchmark
eddy16112 Sep 29, 2015
0a0df96
modify for matrix transpose
eddy16112 Oct 2, 2015
58371c8
enable vector
eddy16112 Oct 2, 2015
277c8bd
receiver now will send msg back to sender for buffer reuse
eddy16112 Oct 6, 2015
4591656
fix zerocopy
eddy16112 Oct 9, 2015
c5add7e
offset instead of actual addess, and lots of clean up for unused
eddy16112 Oct 22, 2015
38db0e6
rewrite pipeline
eddy16112 Oct 25, 2015
0ab564b
s up and running. PUT size in an MCA parameters.
eddy16112 Oct 26, 2015
50abfc8
less bugs
eddy16112 Oct 27, 2015
7c86f4c
fix pipelining for non-contiguous to contiguous
eddy16112 Oct 27, 2015
06a07a5
opal_datatype is chnaged, so we need more space
Oct 27, 2015
986e5c9
reorder datatypes to cache boundaries
Oct 27, 2015
08f69f6
slience warnings
eddy16112 Oct 28, 2015
de1ef4e
remove smcuda btl calls from pml ob1
eddy16112 Oct 28, 2015
b60bae5
this file is not used anymore
eddy16112 Oct 28, 2015
351bce9
cuda ddt support is able to turn itself off. Make it support
eddy16112 Oct 29, 2015
aa57116
fix a cuda stream bug for iov, remove some stream syncs
eddy16112 Oct 30, 2015
5eb7bf1
in openib, disable rdma for non-contiguous gpu data
eddy16112 Nov 4, 2015
aa24f4c
move ddt kernel support function pointer into opal_datatype_cuda.c
eddy16112 Nov 4, 2015
1b1d827
rename some functions
eddy16112 Nov 5, 2015
3235663
check point
eddy16112 Nov 6, 2015
f3d37d8
Add support for caching the unpacked datatype description
bosilca Nov 7, 2015
e4e11bc
check point use raw_cached, but cuda iov caching is not enabled
eddy16112 Nov 7, 2015
bbd221f
check point, split iov into two version, non-cached and cached
eddy16112 Nov 8, 2015
a741011
check point iov cache
eddy16112 Nov 8, 2015
7b87adb
another checkpoint
eddy16112 Nov 9, 2015
270898b
check point, cuda iov is cached, but not used for pack/unpack
eddy16112 Nov 9, 2015
ee0408f
check point, ready to use cached cuda iov
eddy16112 Nov 10, 2015
b76bf60
checkpoint, cached cuda iov is working with multiple send, but not for
eddy16112 Nov 10, 2015
a0e9493
checkpoint, fix a bug for partial unpack
eddy16112 Nov 11, 2015
c1f5959
checkpoint, fix unpack size
eddy16112 Nov 11, 2015
5b63994
checkpoint, during unpack, cache the entire iov before unpack
eddy16112 Nov 11, 2015
fb68b99
another checkpoint
eddy16112 Nov 12, 2015
64e2a62
checkpoint , remove unnecessary cuda stream sync
eddy16112 Nov 12, 2015
39de9e0
use bit to replace %
eddy16112 Nov 13, 2015
f17c5f8
rollback to use %, not bit, since it is faster, not sure why
eddy16112 Nov 13, 2015
4ea326e
now cuda iov is {nc_disp, c_disp}
eddy16112 Nov 13, 2015
491dd73
clean up kernel, put variables uses multiple times into register
eddy16112 Nov 13, 2015
6cc7ada
another checkpoint
eddy16112 Nov 14, 2015
998a072
now convertor->count > 1 is woring
eddy16112 Nov 14, 2015
83d4858
move the cuda iov caching into a seperate function
eddy16112 Nov 16, 2015
f49dae4
these two variables are useless now
eddy16112 Nov 16, 2015
ef04c97
fix a bug for ib, current count of convertor should be set in
eddy16112 Nov 16, 2015
189fa15
cleanup, move cudamalloc into cache cuda iov
eddy16112 Nov 17, 2015
56eeffb
rearrange varibles
eddy16112 Nov 17, 2015
84f7abb
if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
eddy16112 Nov 17, 2015
65424d0
make sure check pointer is not NULL before free it
eddy16112 Nov 18, 2015
5d316d9
checkpoint, rewrite non-cached version
eddy16112 Nov 25, 2015
02c8b7f
fix for non cached iov
eddy16112 Nov 25, 2015
bb807fc
fix the non cached iov, set position should be put at first
eddy16112 Nov 25, 2015
842cc3f
move ddt iov to cuda iov into a function
eddy16112 Nov 25, 2015
6df01a5
merge iov cached and non-cached
eddy16112 Nov 30, 2015
da23f82
for non cached iov, if there is no enough cuda iov space, break
eddy16112 Dec 1, 2015
880a233
cached iov is working for count = 1
eddy16112 Nov 7, 2015
7b26aaa
cache the entire cuda iov
eddy16112 Nov 11, 2015
6af6658
now cuda iov is {nc_disp, c_disp}
eddy16112 Nov 13, 2015
63e148e
clean up kernel, put variables uses multiple times into register
eddy16112 Nov 13, 2015
c75393f
cached cuda iov is working for count > 1
eddy16112 Nov 14, 2015
11d4a5b
move the cuda iov caching into a seperate function
eddy16112 Nov 16, 2015
1e29fc0
these two variables are useless now
eddy16112 Nov 16, 2015
1bac78c
fix a bug for ib, current count of convertor should be set in
eddy16112 Nov 16, 2015
686c90e
cleanup, move cudamalloc into cache cuda iov
eddy16112 Nov 17, 2015
85dad6c
rearrange varibles
eddy16112 Nov 17, 2015
4c6c0e4
if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
eddy16112 Nov 17, 2015
2120edd
make sure check pointer is not NULL before free it
eddy16112 Nov 18, 2015
98fc62c
rewrite non cached iov, make it unified with cached iov
eddy16112 Nov 25, 2015
eb143dc
apply loop unroll on packing kernels
eddy16112 Feb 5, 2016
b45b646
apply unroll to unpack
eddy16112 Feb 23, 2016
4037554
fix a cuda event bug. cudaStreamWaitEvent is not blocking call.
eddy16112 Feb 23, 2016
e6c765e
new vector kernel
eddy16112 Feb 26, 2016
e981580
Merge branch 'cuda' of https://github.com/eddy16112/ompi into cuda
eddy16112 Feb 26, 2016
2b0048f
fix a if CUDA_41 error
eddy16112 Feb 26, 2016
d22e54a
clean up a if
eddy16112 Mar 1, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -1357,6 +1357,10 @@ m4_ifdef([project_oshmem],

opal_show_subtitle "Final output"

if test "$OPAL_cuda_support" != "0"; then
AC_CONFIG_FILES([opal/datatype/cuda/Makefile])
fi

AC_CONFIG_FILES([
Makefile

Expand Down
2 changes: 1 addition & 1 deletion ompi/datatype/ompi_datatype.h
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ OMPI_DECLSPEC OBJ_CLASS_DECLARATION(ompi_datatype_t);
/* Using set constant for padding of the DATATYPE handles because the size of
* base structure is very close to being the same no matter the bitness.
*/
#define PREDEFINED_DATATYPE_PAD (512)
#define PREDEFINED_DATATYPE_PAD (1024)

struct ompi_predefined_datatype_t {
struct ompi_datatype_t dt;
Expand Down
134 changes: 126 additions & 8 deletions ompi/mca/pml/ob1/pml_ob1_cuda.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,24 @@
#include "ompi/mca/bml/base/base.h"
#include "ompi/memchecker.h"

#include "opal/datatype/opal_datatype_cuda.h"
#include "opal/mca/common/cuda/common_cuda.h"
#include "opal/mca/btl/smcuda/btl_smcuda.h"

#define CUDA_DDT_WITH_RDMA 1

size_t mca_pml_ob1_rdma_cuda_btls(
mca_bml_base_endpoint_t* bml_endpoint,
unsigned char* base,
size_t size,
mca_pml_ob1_com_btl_t* rdma_btls);

int mca_pml_ob1_rdma_cuda_btl_register_data(
mca_pml_ob1_com_btl_t* rdma_btls,
uint32_t num_btls_used,
struct opal_convertor_t *pack_convertor, uint8_t pack_required, int32_t gpu_device);

size_t mca_pml_ob1_rdma_cuda_avail(mca_bml_base_endpoint_t* bml_endpoint);

int mca_pml_ob1_cuda_need_buffers(void * rreq,
mca_btl_base_module_t* btl);
Expand All @@ -56,16 +69,18 @@ int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
mca_bml_base_btl_t* bml_btl,
size_t size) {
int rc;
#if OPAL_CUDA_GDR_SUPPORT
/* With some BTLs, switch to RNDV from RGET at large messages */
if ((sendreq->req_send.req_base.req_convertor.flags & CONVERTOR_CUDA) &&
(sendreq->req_send.req_bytes_packed > (bml_btl->btl->btl_cuda_rdma_limit - sizeof(mca_pml_ob1_hdr_t)))) {
return mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}
#endif /* OPAL_CUDA_GDR_SUPPORT */
int32_t local_device = 0;

sendreq->req_send.req_base.req_convertor.flags &= ~CONVERTOR_CUDA;
struct opal_convertor_t *convertor = &(sendreq->req_send.req_base.req_convertor);
if (opal_convertor_need_buffers(&sendreq->req_send.req_base.req_convertor) == false) {
#if OPAL_CUDA_GDR_SUPPORT
/* With some BTLs, switch to RNDV from RGET at large messages */
if ((sendreq->req_send.req_bytes_packed > (bml_btl->btl->btl_cuda_rdma_limit - sizeof(mca_pml_ob1_hdr_t)))) {
sendreq->req_send.req_base.req_convertor.flags |= CONVERTOR_CUDA;
return mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}
#endif /* OPAL_CUDA_GDR_SUPPORT */
unsigned char *base;
opal_convertor_get_current_pointer( &sendreq->req_send.req_base.req_convertor, (void**)&base );
/* Set flag back */
Expand All @@ -75,6 +90,13 @@ int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
base,
sendreq->req_send.req_bytes_packed,
sendreq->req_rdma))) {

rc = mca_common_cuda_get_device(&local_device);
if (rc != 0) {
opal_output(0, "Failed to get the GPU device ID, rc= %d\n", rc);
return rc;
}
mca_pml_ob1_rdma_cuda_btl_register_data(sendreq->req_rdma, sendreq->req_rdma_cnt, convertor, 0, local_device);
rc = mca_pml_ob1_send_request_start_rdma(sendreq, bml_btl,
sendreq->req_send.req_bytes_packed);
if( OPAL_UNLIKELY(OMPI_SUCCESS != rc) ) {
Expand All @@ -92,7 +114,48 @@ int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
/* Do not send anything with first rendezvous message as copying GPU
* memory into RNDV message is expensive. */
sendreq->req_send.req_base.req_convertor.flags |= CONVERTOR_CUDA;
rc = mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
if ((mca_pml_ob1_rdma_cuda_avail(sendreq->req_endpoint) != 0) &&
(opal_datatype_cuda_kernel_support == 1) &&
(bml_btl->btl->btl_cuda_ddt_allow_rdma == 1)) {
unsigned char *base;
size_t buffer_size = 0;
if (convertor->local_size > bml_btl->btl->btl_cuda_ddt_pipeline_size) {
buffer_size = bml_btl->btl->btl_cuda_ddt_pipeline_size * bml_btl->btl->btl_cuda_ddt_pipeline_depth;
} else {
buffer_size = convertor->local_size;
}
base = opal_cuda_malloc_gpu_buffer(buffer_size, 0);
convertor->gpu_buffer_ptr = base;
convertor->gpu_buffer_size = buffer_size;
sendreq->req_send.req_bytes_packed = convertor->local_size;
opal_output(0, "malloc GPU BUFFER %p for pack, local size %lu, pipeline size %lu, depth %d\n", base, convertor->local_size, bml_btl->btl->btl_cuda_ddt_pipeline_size, bml_btl->btl->btl_cuda_ddt_pipeline_depth);
if( 0 != (sendreq->req_rdma_cnt = (uint32_t)mca_pml_ob1_rdma_cuda_btls(
sendreq->req_endpoint,
base,
sendreq->req_send.req_bytes_packed,
sendreq->req_rdma))) {

rc = mca_common_cuda_get_device(&local_device);
if (rc != 0) {
opal_output(0, "Failed to get the GPU device ID, rc=%d\n", rc);
return rc;
}
mca_pml_ob1_rdma_cuda_btl_register_data(sendreq->req_rdma, sendreq->req_rdma_cnt, convertor, 1, local_device);

rc = mca_pml_ob1_send_request_start_rdma(sendreq, bml_btl,
sendreq->req_send.req_bytes_packed);

if( OPAL_UNLIKELY(OMPI_SUCCESS != rc) ) {
mca_pml_ob1_free_rdma_resources(sendreq);
}
} else {
rc = mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}


} else {
rc = mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}
}
return rc;
}
Expand Down Expand Up @@ -152,6 +215,61 @@ size_t mca_pml_ob1_rdma_cuda_btls(
return num_btls_used;
}

int mca_pml_ob1_rdma_cuda_btl_register_data(
mca_pml_ob1_com_btl_t* rdma_btls,
uint32_t num_btls_used,
struct opal_convertor_t *pack_convertor, uint8_t pack_required, int32_t gpu_device)
{
uint32_t i;
for (i = 0; i < num_btls_used; i++) {
mca_btl_base_registration_handle_t *handle = rdma_btls[i].btl_reg;
mca_mpool_common_cuda_reg_t *cuda_reg = (mca_mpool_common_cuda_reg_t *)
((intptr_t) handle - offsetof (mca_mpool_common_cuda_reg_t, data));
// printf("base %p\n", cuda_reg->base.base);
// for (j = 0; j < MAX_IPC_EVENT_HANDLE; j++) {
// mca_common_cuda_geteventhandle(&convertor->pipeline_event[j], j, (mca_mpool_base_registration_t *)cuda_reg);
// // printf("event %lu, j %d\n", convertor->pipeline_event[j], j);
// }
cuda_reg->data.pack_required = pack_required;
cuda_reg->data.gpu_device = gpu_device;
cuda_reg->data.pack_convertor = pack_convertor;

}
return 0;
}

size_t mca_pml_ob1_rdma_cuda_avail(mca_bml_base_endpoint_t* bml_endpoint)
{
int num_btls = mca_bml_base_btl_array_get_size(&bml_endpoint->btl_send);
double weight_total = 0;
int num_btls_used = 0, n;

/* shortcut when there are no rdma capable btls */
if(num_btls == 0) {
return 0;
}

/* check to see if memory is registered */
for(n = 0; n < num_btls && num_btls_used < mca_pml_ob1.max_rdma_per_request;
n++) {
mca_bml_base_btl_t* bml_btl =
mca_bml_base_btl_array_get_index(&bml_endpoint->btl_send, n);

if (bml_btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET) {
weight_total += bml_btl->btl_weight;
num_btls_used++;
}
}

/* if we don't use leave_pinned and all BTLs that already have this memory
* * registered amount to less then half of available bandwidth - fall back to
* * pipeline protocol */
if(0 == num_btls_used || (!mca_pml_ob1.leave_pinned && weight_total < 0.5))
return 0;

return num_btls_used;
}

int mca_pml_ob1_cuda_need_buffers(void * rreq,
mca_btl_base_module_t* btl)
{
Expand Down
7 changes: 5 additions & 2 deletions ompi/mca/pml/ob1/pml_ob1_recvreq.c
Original file line number Diff line number Diff line change
Expand Up @@ -649,8 +649,11 @@ void mca_pml_ob1_recv_request_progress_rget( mca_pml_ob1_recv_request_t* recvreq
if (mca_pml_ob1_cuda_need_buffers(recvreq, btl))
#endif /* OPAL_CUDA_SUPPORT */
{
mca_pml_ob1_recv_request_ack(recvreq, &hdr->hdr_rndv, 0);
return;
/* need more careful check here */
if (! (recvreq->req_recv.req_base.req_convertor.flags & CONVERTOR_CUDA)) {
mca_pml_ob1_recv_request_ack(recvreq, &hdr->hdr_rndv, 0);
return;
}
}
}

Expand Down
18 changes: 17 additions & 1 deletion ompi/mca/pml/ob1/pml_ob1_sendreq.c
Original file line number Diff line number Diff line change
Expand Up @@ -675,10 +675,26 @@ int mca_pml_ob1_send_request_start_rdma( mca_pml_ob1_send_request_t* sendreq,
MCA_PML_OB1_HDR_FLAGS_PIN);
}

#if OPAL_CUDA_SUPPORT
if ( (sendreq->req_send.req_base.req_convertor.flags & CONVERTOR_CUDA)) {
sendreq->req_send.req_base.req_convertor.flags &= ~CONVERTOR_CUDA;
if (opal_convertor_need_buffers(&sendreq->req_send.req_base.req_convertor) == true) {
data_ptr = sendreq->req_send.req_base.req_convertor.gpu_buffer_ptr;
printf("START RMDA data_ptr %p\n", data_ptr);
} else {
opal_convertor_get_current_pointer (&sendreq->req_send.req_base.req_convertor, &data_ptr);
}
/* Set flag back */
sendreq->req_send.req_base.req_convertor.flags |= CONVERTOR_CUDA;
} else {
opal_convertor_get_current_pointer (&sendreq->req_send.req_base.req_convertor, &data_ptr);
}
#else
/* at this time ob1 does not support non-contiguous gets. the convertor represents a
* contiguous block of memory */
opal_convertor_get_current_pointer (&sendreq->req_send.req_base.req_convertor, &data_ptr);

#endif

local_handle = sendreq->req_rdma[0].btl_reg;

/* allocate an rdma fragment to keep track of the request size for use in the fin message */
Expand Down
2 changes: 1 addition & 1 deletion opal/datatype/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ libdatatype_la_SOURCES = \
opal_datatype_pack.c \
opal_datatype_position.c \
opal_datatype_resize.c \
opal_datatype_unpack.c
opal_datatype_unpack.c

libdatatype_la_LIBADD = libdatatype_reliable.la

Expand Down
60 changes: 60 additions & 0 deletions opal/datatype/cuda/Makefile.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
@SET_MAKE@

AM_CPPFLAGS = @common_cuda_CPPFLAGS@
srcdir = @srcdir@
top_builddir = @top_builddir@
top_srcdir = @top_srcdir@
VPATH = @srcdir@

NVCC = nvcc
ARCH = @AR@
ARCHFLAGS = cr
STLIB ?= opal_datatype_cuda_kernel.a
DYLIB ?= opal_datatype_cuda_kernel.so
EXTLIB = -L$(top_builddir)/opal/datatype/.libs -ldatatype -L$(top_builddir)/opal/.libs -lopen-pal -L/usr/local/cuda/lib -lcuda
subdir = opal/datatype/cuda

CC = nvcc
CFLAGS = -I$(top_builddir)/opal/include -I$(top_srcdir)/opal/include -I$(top_builddir) -I$(top_srcdir) -gencode arch=compute_35,code=sm_35 --compiler-options '-fPIC @CFLAGS@'
LDFLAGS = -shared --compiler-options '-fPIC @LDFLAGS@'

SRC := \
opal_datatype_cuda.cu \
opal_datatype_pack_cuda_kernel.cu \
opal_datatype_pack_cuda_wrapper.cu \
opal_datatype_unpack_cuda_kernel.cu \
opal_datatype_unpack_cuda_wrapper.cu

OBJ := $(SRC:.cu=.o)

.PHONY: all clean cleanall

all: Makefile $(STLIB) $(DYLIB)

Makefile: $(srcdir)/Makefile.in $(top_builddir)/config.status
@case '$?' in \
*config.status*) \
cd $(top_builddir) && $(MAKE) $(AM_MAKEFLAGS) am--refresh;; \
*) \
echo ' cd $(top_builddir) && $(SHELL) ./config.status $(subdir)/$@ $(am__depfiles_maybe)'; \
cd $(top_builddir) && $(SHELL) ./config.status $(subdir)/$@ $(am__depfiles_maybe);; \
esac;

$(STLIB): $(OBJ)
$(ARCH) $(ARCHFLAGS) $@ $(OBJ)
@RANLIB@ $@

$(DYLIB): $(OBJ)
$(NVCC) $(LDFLAGS) $(EXTLIB) -o $(DYLIB) $(OBJ)

%.o: %.cu
$(NVCC) $(CFLAGS) $(EXTLIB) $(INC) -c $< -o $@

install: $(DYLIB)
cp -f $(DYLIB) @OMPI_WRAPPER_LIBDIR@/

clean:
rm -f $(OBJ)

cleanall: clean
rm -f $(STLIB) $(DYLIB)
Loading