Performance of Floyd-Warshall in Haskell – Fixing a space leak -


i wanted write efficient implementation of floyd-warshall pairs shortest path algorithm in haskell using vectors performance.

the implementation quite straight-forward, instead of using 3-dimensional |v|×|v|×|v| matrix, 2-dimensional vector used, since ever read previous k value.

thus, algorithm series of steps 2d vector passed in, , new 2d vector generated. final 2d vector contains shortest paths between nodes (i,j).

my intuition told me important make sure previous 2d vector evaluated before each step, used bangpatterns on prev argument fw function , strict foldl':

{-# language bangpatterns #-}  import           control.deepseq import           control.monad       (form_) import           data.list           (foldl') import qualified data.map.strict     m import           data.vector         (vector, (!), (//)) import qualified data.vector         v import qualified data.vector.mutable v hiding (length, replicate, take)  type graph = vector (m.map int double) type twodvector = vector (vector double)  infinity :: double infinity = 1/0  -- calculate shortest path between pairs in given graph, if there -- negative cycles, return nothing allpairsshortestpaths :: graph -> int -> maybe twodvector allpairsshortestpaths g v =   let initial = fw g v v.empty 0       results = foldl' (fw g v) initial [1..v]   in if negcycle results         nothing         else results   -- check negative elements along diagonal         negcycle = not $ map (\i -> ! ! >= 0) [0..(v.length a-1)]  -- 1 step of floyd-warshall algorithm fw :: graph -> int -> twodvector -> int -> twodvector fw g v !prev k = v.create $                                           -- ← bang   curr <- v.new v   form_ [0..(v-1)] $ \i ->     v.write curr $ v.create $       ivec <- v.new v       form_ [0..(v-1)] $ \j ->         let d = distance g prev j k         v.write ivec j d       return ivec   return curr  distance :: graph -> twodvector -> int -> int -> int -> double distance g _ j 0 -- base case; 0 if same vertex, edge weight if neighbours   | == j    = 0.0   | otherwise = m.findwithdefault infinity j (g ! i) distance _ j k = let c1 = ! ! j                         c2 = (a ! ! (k-1))+(a ! (k-1) ! j)                         in min c1 c2 

however, when running program 1000-node graph 47978 edges, things not @ all. memory usage high , program takes way long run. program compiled ghc -o2.

i rebuilt program profiling, , limited number of iterations 50:

 results = foldl' (fw g v) initial [1..50] 

i ran program +rts -p -hc , +rts -p -hd:

this is... interesting, guess it's showing it's accumulating tonnes of thunks. not good.

ok, after few shots in dark, added deepseq in fw make sure prev really evaluted:

let d = prev `deepseq` distance g prev j k 

now things better, , can run program completion constant memory usage. it's obvious bang on prev argument not enough.

for comparison previous graphs, here memory usage 50 iterations after adding deepseq:

ok, things better, still have questions:

  1. is correct solution space leak? wrong in feeling inserting deepseq bit ugly?
  2. is usage of vectors here idiomatic/correct? i'm building new vector every iteration , hoping garbage collector delete old vectors.
  3. is there other things make run faster approach?

for references, here graph.txt: http://sebsauvage.net/paste/?45147f7caf8c5f29#7ticipovphwrm1xnvrsb/znl3ujf3xb3yehrxhedvww=

here main:

main =   ls <- fmap lines $ readfile "graph.txt"   let numverts = head . map read . words . head $ ls   let edges = map (map read . words) (tail ls)   let g = v.create $         g' <- v.new numverts         form_ [0..(numverts-1)] (\idx -> v.write g' idx m.empty)         form_ edges $ \[f,t,w] ->           -- subtract 1 vertex ids can index directly           curr <- v.read g' (f-1)           v.write g' (f-1) $ m.insert (t-1) (fromintegral w) curr         return g'   let = allpairsshortestpaths g numverts   case of     nothing -> putstrln "negative cycle detected."     a' ->       putstrln  $ "the shortest, shortest path has length "               ++ show ((v.minimum . v.map v.minimum) a') 

first, general code cleanup:

in fw function, explicitly allocate , fill mutable vectors. however, there premade function exact purpose, namely generate. fw can therefore rewritten as

v.generate v (\i -> v.generate v (\j -> distance g prev j k)) 

similarly, graph generation code can replaced replicate , accum:

let parsededges = map (\[f,t,w] -> (f - 1, (t - 1, fromintegral w))) edges let g = v.accum (flip (uncurry m.insert)) (v.replicate numverts m.empty) parsededges 

note totally removes need mutation, without losing performance.

now, actual questions:

  1. in experience, deepseq useful, quick fix space leaks one. fundamental problem not need force results after you've produced them. instead, use of deepseq implies should have been building structure more strictly in first place. in fact, if add bang pattern in vector creation code so:

    let !d = distance g prev j k 

    then problem fixed without deepseq. note doesn't work generate code, because, reason (i might create feature request this), vector not provide strict functions boxed vectors. however, when unboxed vectors in answer question 3, strict, both approaches work without strictness annotations.

  2. as far know, pattern of repeatedly generating new vectors idiomatic. thing not idiomatic use of mutability - except when strictly necessary, mutable vectors discouraged.

  3. there couple of things do:

    • most simply, can replace map int intmap. isn't slow point of function, doesn't matter much, intmap can faster heavy workloads.

    • you can switch using unboxed vectors. although outer vector has remain boxed, vectors of vectors can't unboxed, inner vector can be. solves strictness problem - because unboxed vectors strict in elements, don't space leak. note on machine, improves performance 4.1 seconds 1.3 seconds, unboxing helpful.

    • you can flatten vector single 1 , use multiplication , division switch between 2 dimensional indicies , 1 dimentional indicies. don't recommend this, bit involved, quite ugly, and, due division, slows down code on machine.

    • you can use repa. has huge advantage of automatically parallelizing code. note that, since repa flattens arrays , apparently doesn't rid of divisions needed fill nicely (it's possible nested loops, think uses single loop , division), has same performance penalty mentioned above, bringing runtime 1.3 seconds 1.8. however, if enable parallelism , use multicore machine, start seeing benifits. unfortunately, current test case tiny see benifit, so, on 6 core machine, see drop down 1.2 seconds. if size [1..v] instead of [1..50], parallelism brings 32 seconds 13. presumably, if give program larger input, might see more benifit.

      if you're interested, i've posted repa-ified version here.

    • edit: use -fllvm. testing on computer, using repa, 14.7 seconds without parallelism, without -fllvm , parallelism. in general, llvm can handle array based code well.


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -