c++ - CUDA Constant Memory Error -


i trying sample code constant memory cuda 5.5. have 2 constant arrays of size 3000 each. have global array x of size n. want compute

y[tid] = x[tid]*a[tid%3000] + b[tid%3000] 

here code.

#include <iostream> #include <stdio.h> using namespace std;  #include <cuda.h>    __device__ __constant__ int a[3000]; __device__ __constant__ int b[3000];   __global__ void kernel( int *dc_a, int *dc_b, int *x, int *out, int n) {     int tid = threadidx.x + blockidx.x*blockdim.x;     if( tid<n )     {         out[tid] = dc_a[tid%3000]*x[tid] + dc_b[tid%3000];     }  }  int main() {     int n=100000;      // set affine constants on host     int *h_a, *h_b ; //host vectors     h_a = (int*) malloc( 3000*sizeof(int) );     h_b = (int*) malloc( 3000*sizeof(int) );     for( int i=0 ; i<3000 ; i++ )     {         h_a[i] = (int) (drand48() * 10);         h_b[i] = (int) (drand48() * 10);     }      //set x , y on host     int * h_x = (int*) malloc( n*sizeof(int) );     int * h_out = (int *) malloc( n*sizeof(int) );     //set vector     for( int i=0 ; i<n ; i++ )     {         h_x[i] = i;         h_out[i] = 0;     }      // copy, a,b,x,y device     int * d_x, *d_out;     cudamemcpytosymbol( a, h_a, 3000 * sizeof(int) ) ;     cudamemcpytosymbol( b, h_b, 3000 * sizeof(int) ) ;      cudamalloc( (void**)&d_x, n*sizeof(int) ) );     cudamemcpy( d_x, h_x, n*sizeof(int), cudamemcpyhosttodevice ) ;     cudamalloc( (void**)&d_out, n*sizeof(int) ) ;        //call kernel vector addition     kernel<<< (n+1024)/1024,1024 >>>(a,b, d_x, d_out, n);     cudapeekatlasterror() ;     cudadevicesynchronize() ;       // d --> h     cudamemcpy(h_out, d_out, n * sizeof(int), cudamemcpydevicetohost ) ;       free(h_a);     free(h_b);       return 0; } 

i trying run debugger on code analyze. turns out on line copies constant memory following error debugger

coalescing of cuda commands output off. [thread debugging using libthread_db enabled] [new thread 0x7ffff5c5b700 (lwp 31200)] 

can please me out constant memory

there several problems here. easier start showing "correct" way use 2 constant arrays, explain why did doesn't work. kernel should this:

__global__ void kernel(int *x, int *out, int n) {     int tid = threadidx.x + blockidx.x*blockdim.x;     if( tid<n )     {         out[tid] = a[tid%3000]*x[tid] + b[tid%3000];     } } 

ie. don't try passing , b kernel. reasons follows:

  1. somewhat confusingly, a , b in host code not valid device memory addresses. host symbols provide hooks runtime device symbol lookup. illegal pass them kernel- if want device memory address, must use cudagetsymboladdress retrieve @ runtime.
  2. even if did call cudagetsymboladdress , retrieve symbols device addresses in constant memory, shouldn't pass them kernel argument, because doing not yield uniform memory access in running kernel. correct use of constant memory requires compiler emit special ptx instructions, , compiler when knows particular global memory location in constant memory. if pass constant memory address value argument, __constant__ property lost , compiler can't know produce correct load instructions

once working, find terribly slow , if profile find there high degrees of instruction replay , serialization. whole idea of using constant memory can exploit constant cache broadcast mechanism in cases when every thread in warp accesses same value in constant memory. example complete opposite of - every thread accessing different value. regular global memory faster in such use case. aware performance of modulo operator on current gpus poor, , should avoid wherever possible.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -