c++ - CUB (CUDA UnBound) equivalent of thrust::gather -


due performance issues thrust libraries (see this page more details), planning on re-factoring cuda application use cub instead of thrust. specifically, replace thrust::sort_by_key , thrust::inclusive_scan calls). in particular point in application need sort 3 arrays key. how did thrust:

thrust::sort_by_key(key_iter, key_iter + numkeys, indices); thrust::gather_wrapper(indices, indices + numkeys,        thrust::make_zip_iterator(thrust::make_tuple(values1ptr, values2ptr, values3ptr)),       thrust::make_zip_iterator(thrust::make_tuple(valuesout1ptr, valuesout2ptr, valuesout3ptr)) ); 

where

  • key iter thrust::device_ptr points keys want sort by
  • indices point sequence (from 0 numkeys-1) in device memory
  • values{1,2,3}ptr device_ptrs values want sort
  • values{1,2,3}outptr device_ptrs sorted values

with cub sortpairs function can sort single value buffer, not 3 in 1 shot. problem don't see cub "gather-like" utilities. suggestions?

edit:

i suppose implement own gather kernel, there better way other than:

template <typename index, typename value>  __global__ void  gather_kernel(const unsigned int n, const index * map,  const value * src, value * dst)  {      unsigned int = blockdim.x * blockidx.x + threadidx.x;      if (i < n)      {          dst[i] = src[map[i]];      }  }  

the non-coalesed loads , stores make me chringe, unavoidable without known structure on map.

it seems want achieve depends on thrust::zip_iterator. either

  1. only replace thrust::sort_by_key cub::deviceradixsort::sortpairs , keep thrust::gather, or
  2. zip values{1,2,3} array of structures before using cub::deviceradixsort::sortpairs

update

after reading implementation of thrust::gather,

$cuda_home/include/thrust/system/detail/generic/gather.inl 

you can see naive kernel like

__global__ gather(int* index, float* in, float* out, int len) {   int i=...;   if (i<len) { out[i] = in[index[i]]; } } 

then think code above can replaced single kernel without effort.

in kernel, first use cub block-wize primitive cub::blockradixsort<...>::sortblockedtostriped sorted indices stored in registers , perform naive re-order copy thrust::gather fill values{1,2,3}out.

using sortblockedtostriped rather sort can coalesced writing (not reading though) when copying values.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -