parallel processing - moving elements between arrays in a CUDA kernel -
i stuck in simple thing , need opinion. have simple kernel in cuda copies elements between 2 arrays (there reason want in way) ,
__global__ void kernelexample( float* a, float* b, float* c, int rows, int cols ) { int r = blockidx.y * blockdim.y + threadidx.y; // vertical dim in block int c = blockidx.x * blockdim.x + threadidx.x; // horizontal dim in block if ( r < rows && c < cols) { // row-major order c[ c + r*cols ] = a[ c + r*cols ]; } //__syncthreads(); } i taking unsatisfying results. suggestions please?
the kernel called this:
int numelements = rows * cols; int threadsperblock = 256; int blockspergrid = ceil( (double) numelements / threadsperblock); kernelexample<<<blockspergrid , threadsperblock >>>( d_a, d_b, d_c, rows, cols ); updated(after eric's help):
int numelements = rows * cols; int threadsperblock = 32; //talonmies comment int blockspergrid = ceil( (double) numelements / threadsperblock); dim3 dimblock( threadsperblock,threadsperblock ); dim3 dimgrid( blockspergrid,blockspergrid ); kernelexample<<<dimblock, dimblock>>>( d_a, d_b, d_c, rows, cols ); for example having matrix
a =[ 0 1 2 1 0 2 0 0 2 0 0 1 2 1 2 2 2 2 0 0 2 1 2 2 3 1 2 2 2 2 ] the returned matrix c is
c = [ 0 1 2 1 0 2 0 0 2 0 0 1 2 1 2 2 2 2 0 0 2 1 2 2 3 1 2 2 2 2 ]
c/c++ uses 0-based indexing default.
try
1) change from
if ( r <= rows && c <= cols) { to
if ( r < rows && c < cols) { 2) del __syncthreads(); since don't share data between threads
3) correct block , grid settings 1-d 2-d, since use both .x , .y in kernel
4) remove float* b if don't use it.
to solve problem.
see kernel copy() located in following file in cuda sample code more info.
$cuda_home/samples/6_advanced/transpose/transpose.cu
Comments
Post a Comment