c++ - CUB (CUDA UnBound) equivalent of thrust::gather -
due performance issues thrust libraries (see this page more details), planning on re-factoring cuda application use cub instead of thrust. specifically, replace thrust::sort_by_key , thrust::inclusive_scan calls). in particular point in application need sort 3 arrays key. how did thrust:
thrust::sort_by_key(key_iter, key_iter + numkeys, indices); thrust::gather_wrapper(indices, indices + numkeys, thrust::make_zip_iterator(thrust::make_tuple(values1ptr, values2ptr, values3ptr)), thrust::make_zip_iterator(thrust::make_tuple(valuesout1ptr, valuesout2ptr, valuesout3ptr)) );
where
key iter
thrust::device_ptr points keys want sort byindices
point sequence (from 0 numkeys-1) in device memoryvalues{1,2,3}ptr
device_ptrs values want sortvalues{1,2,3}outptr
device_ptrs sorted values
with cub sortpairs function can sort single value buffer, not 3 in 1 shot. problem don't see cub "gather-like" utilities. suggestions?
edit:
i suppose implement own gather kernel, there better way other than:
template <typename index, typename value> __global__ void gather_kernel(const unsigned int n, const index * map, const value * src, value * dst) { unsigned int = blockdim.x * blockidx.x + threadidx.x; if (i < n) { dst[i] = src[map[i]]; } }
the non-coalesed loads , stores make me chringe, unavoidable without known structure on map
.
it seems want achieve depends on thrust::zip_iterator
. either
- only replace
thrust::sort_by_key
cub::deviceradixsort::sortpairs
, keepthrust::gather
, or - zip
values{1,2,3}
array of structures before usingcub::deviceradixsort::sortpairs
update
after reading implementation of thrust::gather
,
$cuda_home/include/thrust/system/detail/generic/gather.inl
you can see naive kernel like
__global__ gather(int* index, float* in, float* out, int len) { int i=...; if (i<len) { out[i] = in[index[i]]; } }
then think code above can replaced single kernel without effort.
in kernel, first use cub block-wize primitive cub::blockradixsort<...>::sortblockedtostriped
sorted indices stored in registers , perform naive re-order copy thrust::gather
fill values{1,2,3}out
.
using sortblockedtostriped
rather sort
can coalesced writing (not reading though) when copying values
.
Comments
Post a Comment