Performance of Floyd-Warshall in Haskell – Fixing a space leak -
i wanted write efficient implementation of floyd-warshall pairs shortest path algorithm in haskell using vector
s performance.
the implementation quite straight-forward, instead of using 3-dimensional |v|×|v|×|v| matrix, 2-dimensional vector used, since ever read previous k
value.
thus, algorithm series of steps 2d vector passed in, , new 2d vector generated. final 2d vector contains shortest paths between nodes (i,j).
my intuition told me important make sure previous 2d vector evaluated before each step, used bangpatterns
on prev
argument fw
function , strict foldl'
:
{-# language bangpatterns #-} import control.deepseq import control.monad (form_) import data.list (foldl') import qualified data.map.strict m import data.vector (vector, (!), (//)) import qualified data.vector v import qualified data.vector.mutable v hiding (length, replicate, take) type graph = vector (m.map int double) type twodvector = vector (vector double) infinity :: double infinity = 1/0 -- calculate shortest path between pairs in given graph, if there -- negative cycles, return nothing allpairsshortestpaths :: graph -> int -> maybe twodvector allpairsshortestpaths g v = let initial = fw g v v.empty 0 results = foldl' (fw g v) initial [1..v] in if negcycle results nothing else results -- check negative elements along diagonal negcycle = not $ map (\i -> ! ! >= 0) [0..(v.length a-1)] -- 1 step of floyd-warshall algorithm fw :: graph -> int -> twodvector -> int -> twodvector fw g v !prev k = v.create $ -- ← bang curr <- v.new v form_ [0..(v-1)] $ \i -> v.write curr $ v.create $ ivec <- v.new v form_ [0..(v-1)] $ \j -> let d = distance g prev j k v.write ivec j d return ivec return curr distance :: graph -> twodvector -> int -> int -> int -> double distance g _ j 0 -- base case; 0 if same vertex, edge weight if neighbours | == j = 0.0 | otherwise = m.findwithdefault infinity j (g ! i) distance _ j k = let c1 = ! ! j c2 = (a ! ! (k-1))+(a ! (k-1) ! j) in min c1 c2
however, when running program 1000-node graph 47978 edges, things not @ all. memory usage high , program takes way long run. program compiled ghc -o2
.
i rebuilt program profiling, , limited number of iterations 50:
results = foldl' (fw g v) initial [1..50]
i ran program +rts -p -hc
, +rts -p -hd
:
this is... interesting, guess it's showing it's accumulating tonnes of thunks. not good.
ok, after few shots in dark, added deepseq
in fw
make sure prev
really evaluted:
let d = prev `deepseq` distance g prev j k
now things better, , can run program completion constant memory usage. it's obvious bang on prev
argument not enough.
for comparison previous graphs, here memory usage 50 iterations after adding deepseq
:
ok, things better, still have questions:
- is correct solution space leak? wrong in feeling inserting
deepseq
bit ugly? - is usage of
vector
s here idiomatic/correct? i'm building new vector every iteration , hoping garbage collector delete oldvector
s. - is there other things make run faster approach?
for references, here graph.txt
: http://sebsauvage.net/paste/?45147f7caf8c5f29#7ticipovphwrm1xnvrsb/znl3ujf3xb3yehrxhedvww=
here main
:
main = ls <- fmap lines $ readfile "graph.txt" let numverts = head . map read . words . head $ ls let edges = map (map read . words) (tail ls) let g = v.create $ g' <- v.new numverts form_ [0..(numverts-1)] (\idx -> v.write g' idx m.empty) form_ edges $ \[f,t,w] -> -- subtract 1 vertex ids can index directly curr <- v.read g' (f-1) v.write g' (f-1) $ m.insert (t-1) (fromintegral w) curr return g' let = allpairsshortestpaths g numverts case of nothing -> putstrln "negative cycle detected." a' -> putstrln $ "the shortest, shortest path has length " ++ show ((v.minimum . v.map v.minimum) a')
first, general code cleanup:
in fw
function, explicitly allocate , fill mutable vectors. however, there premade function exact purpose, namely generate
. fw
can therefore rewritten as
v.generate v (\i -> v.generate v (\j -> distance g prev j k))
similarly, graph generation code can replaced replicate
, accum
:
let parsededges = map (\[f,t,w] -> (f - 1, (t - 1, fromintegral w))) edges let g = v.accum (flip (uncurry m.insert)) (v.replicate numverts m.empty) parsededges
note totally removes need mutation, without losing performance.
now, actual questions:
in experience,
deepseq
useful, quick fix space leaks one. fundamental problem not need force results after you've produced them. instead, use ofdeepseq
implies should have been building structure more strictly in first place. in fact, if add bang pattern in vector creation code so:let !d = distance g prev j k
then problem fixed without
deepseq
. note doesn't workgenerate
code, because, reason (i might create feature request this),vector
not provide strict functions boxed vectors. however, when unboxed vectors in answer question 3, strict, both approaches work without strictness annotations.as far know, pattern of repeatedly generating new vectors idiomatic. thing not idiomatic use of mutability - except when strictly necessary, mutable vectors discouraged.
there couple of things do:
most simply, can replace
map int
intmap
. isn't slow point of function, doesn't matter much,intmap
can faster heavy workloads.you can switch using unboxed vectors. although outer vector has remain boxed, vectors of vectors can't unboxed, inner vector can be. solves strictness problem - because unboxed vectors strict in elements, don't space leak. note on machine, improves performance 4.1 seconds 1.3 seconds, unboxing helpful.
you can flatten vector single 1 , use multiplication , division switch between 2 dimensional indicies , 1 dimentional indicies. don't recommend this, bit involved, quite ugly, and, due division, slows down code on machine.
you can use
repa
. has huge advantage of automatically parallelizing code. note that, sincerepa
flattens arrays , apparently doesn't rid of divisions needed fill nicely (it's possible nested loops, think uses single loop , division), has same performance penalty mentioned above, bringing runtime 1.3 seconds 1.8. however, if enable parallelism , use multicore machine, start seeing benifits. unfortunately, current test case tiny see benifit, so, on 6 core machine, see drop down 1.2 seconds. if size[1..v]
instead of[1..50]
, parallelism brings 32 seconds 13. presumably, if give program larger input, might see more benifit.if you're interested, i've posted
repa
-ified version here.edit: use
-fllvm
. testing on computer, usingrepa
, 14.7 seconds without parallelism, without-fllvm
, parallelism. in general, llvm can handle array based code well.
Comments
Post a Comment