parallel processing - moving elements between arrays in a CUDA kernel -
i stuck in simple thing , need opinion. have simple kernel in cuda copies elements between 2 arrays (there reason want in way) ,
__global__ void kernelexample( float* a, float* b, float* c, int rows, int cols ) { int r = blockidx.y * blockdim.y + threadidx.y; // vertical dim in block int c = blockidx.x * blockdim.x + threadidx.x; // horizontal dim in block if ( r < rows && c < cols) { // row-major order c[ c + r*cols ] = a[ c + r*cols ]; } //__syncthreads(); }
i taking unsatisfying results. suggestions please?
the kernel called this:
int numelements = rows * cols; int threadsperblock = 256; int blockspergrid = ceil( (double) numelements / threadsperblock); kernelexample<<<blockspergrid , threadsperblock >>>( d_a, d_b, d_c, rows, cols );
updated(after eric's help):
int numelements = rows * cols; int threadsperblock = 32; //talonmies comment int blockspergrid = ceil( (double) numelements / threadsperblock); dim3 dimblock( threadsperblock,threadsperblock ); dim3 dimgrid( blockspergrid,blockspergrid ); kernelexample<<<dimblock, dimblock>>>( d_a, d_b, d_c, rows, cols );
for example having matrix
a =[ 0 1 2 1 0 2 0 0 2 0 0 1 2 1 2 2 2 2 0 0 2 1 2 2 3 1 2 2 2 2 ]
the returned matrix c is
c = [ 0 1 2 1 0 2 0 0 2 0 0 1 2 1 2 2 2 2 0 0 2 1 2 2 3 1 2 2 2 2 ]
c/c++ uses 0-based indexing default.
try
1) change from
if ( r <= rows && c <= cols) {
to
if ( r < rows && c < cols) {
2) del __syncthreads();
since don't share data between threads
3) correct block , grid settings 1-d 2-d, since use both .x
, .y
in kernel
4) remove float* b
if don't use it.
to solve problem.
see kernel copy()
located in following file in cuda sample code more info.
$cuda_home/samples/6_advanced/transpose/transpose.cu
Comments
Post a Comment