Skip to content

Tesla P4 - CUDA error cudaErrorIllegalAddress #278

@altendky

Description

@altendky

While previously I have run bladebit CUDA with my Tesla P4, after noticing a few other people reporting issues with the card I tried again and was able to consistently recreate the crash. For this first failure I was using the Ubuntu binary from https://github.com/Chia-Network/bladebit/actions/runs/4129720923/jobs/7135639600#step:3:5.

https://gist.github.com/altendky/3ad52845cbb71c106dbe276f3d95bba1

Completed table 1 in 29.27 seconds with 3429027681 / 4294803672 entries ( 79.84% ).
Compressing tables 2 and 3...
 Step 1 completed step in 4.59 seconds.
CUDA error: 700 (0x2bc) cudaErrorIllegalAddress : an illegal memory access was encountered

*** Panic!!! *** Fatal Error:  
CUDA error cudaErrorIllegalAddress : an illegal memory access was encountered.
./bladebit_cuda(+0xcf8cb)[0x564cf43288cb]
./bladebit_cuda(+0xcf0af)[0x564cf43280af]
./bladebit_cuda(+0x5217a)[0x564cf42ab17a]
./bladebit_cuda(+0x52443)[0x564cf42ab443]
./bladebit_cuda(+0x36e6d)[0x564cf428fe6d]
./bladebit_cuda(+0x2e7f0)[0x564cf42877f0]
./bladebit_cuda(+0x1c98b)[0x564cf427598b]
./bladebit_cuda(+0x18245)[0x564cf4271245]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f30b9f79083]
./bladebit_cuda(+0x1974e)[0x564cf427274e]

After Harold requested debug info, I made #271 to get debug builds with the following results from https://github.com/Chia-Network/bladebit/actions/runs/4149269955

https://gist.github.com/altendky/25ef339f5cfd28345dd641bdd9a1e4bb

Completed table 1 in 505.43 seconds with 3429368445 / 4294952657 entries ( 79.85% ).
Compressing tables 2 and 3...
 Step 1 completed step in 40.28 seconds.
Assertion Failed @ /home/runner/work/bladebit/bladebit/cuda/GpuStreams.cpp:571 UploadArray().
fish: “./bladebit_cuda -f b0a374845f4f…” terminated by signal SIGTRAP (Trace or breakpoint trap)

ASSERT( self->outgoingSequence - self->lockSequence < 2 );

void GpuUploadBuffer::UploadArray( const void* hostBuffer, uint32 length, uint32 elementSize, uint32 srcStride, 
                                   uint32 countStride, const uint32* counts, cudaStream_t workStream )
{
    ASSERT( hostBuffer );
    ASSERT( self->outgoingSequence - self->lockSequence < 2 );

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions