c++ - CUB (CUDA UnBound) equivalent of thrust::gather -
due performance issues thrust libraries (see this page more details), planning on re-factoring cuda application use cub instead of thrust. specifically, replace thrust::sort_by_key , thrust::inclusive_scan calls). in particular point in application need sort 3 arrays key. how did thrust:
thrust::sort_by_key(key_iter, key_iter + numkeys, indices); thrust::gather_wrapper(indices, indices + numkeys, thrust::make_zip_iterator(thrust::make_tuple(values1ptr, values2ptr, values3ptr)), thrust::make_zip_iterator(thrust::make_tuple(valuesout1ptr, valuesout2ptr, valuesout3ptr)) ); where
key iterthrust::device_ptr points keys want sort byindicespoint sequence (from 0 numkeys-1) in device memoryvalues{1,2,3}ptrdevice_ptrs values want sortvalues{1,2,3}outptrdevice_ptrs sorted values
with cub sortpairs function can sort single value buffer, not 3 in 1 shot. problem don't see cub "gather-like" utilities. suggestions?
edit:
i suppose implement own gather kernel, there better way other than:
template <typename index, typename value> __global__ void gather_kernel(const unsigned int n, const index * map, const value * src, value * dst) { unsigned int = blockdim.x * blockidx.x + threadidx.x; if (i < n) { dst[i] = src[map[i]]; } } the non-coalesed loads , stores make me chringe, unavoidable without known structure on map.
it seems want achieve depends on thrust::zip_iterator. either
- only replace
thrust::sort_by_keycub::deviceradixsort::sortpairs, keepthrust::gather, or - zip
values{1,2,3}array of structures before usingcub::deviceradixsort::sortpairs
update
after reading implementation of thrust::gather,
$cuda_home/include/thrust/system/detail/generic/gather.inl you can see naive kernel like
__global__ gather(int* index, float* in, float* out, int len) { int i=...; if (i<len) { out[i] = in[index[i]]; } } then think code above can replaced single kernel without effort.
in kernel, first use cub block-wize primitive cub::blockradixsort<...>::sortblockedtostriped sorted indices stored in registers , perform naive re-order copy thrust::gather fill values{1,2,3}out.
using sortblockedtostriped rather sort can coalesced writing (not reading though) when copying values.
Comments
Post a Comment