c - Why is parallel execution slower than sequential for this code? -
#pragma omp parallel { (i=1; i<1024; i++) #pragma omp (j=1; j<1024; j++) a[i][j] = 2*a[i-1][j]; }
i'm using 12 threads execute code. suggestions must speed up?
assuming a's type smaller 64bytes, trying parallelize inner loop way cause have false sharing in cache lines.
say aligned array of 4-byte ints, have a[i][0] through a[i][15] in same cache line. means 12 threads attempt read line simultaneously, each part needs, may have resulted in having line shared between multiple cores if left @ that, try write back, leading each core attempt take ownership on line in order modify it.
cpu caches based on mesi based protocols, making store attempt issue read-for-ownership invalidate line in each other core except requester. issuing 12 parallel (or rather 6 if have 6 core * 2 threads each) result in race first 1 winning line may have preempted him snoop before had chance modify it (although that's not likely). result quite messy, , may take while before line travels each core in turn, gets modified, , snooped out core. recurred each of next consecutive groups of 16 elements (again, assuming int).
what might is:
- make sure each individual thread working on own cacheline, adding internal loop runs on required number of elements per line, , parallelizing loops skips on number of elements.
this prevent reaching full potential of cpu lose spatial locality , streaming property of code. instead, may:
- parallelize outer loop each thread works on few lines, thereby allowing own entire consecutive stream of memory. however, since need ordering between lines, may have little tweaking here (for e.g. transpose).
there's still downside here, since if thread encounters many streams might loose track of them. third approach is, therefore -
- tile array - break sets of, say, 48 lines, distribute them between threads each runs on few full lines (the transpose trick still applies here, btw), , continue next group
Comments
Post a Comment