Importing big tecplot block files in python as fast as possible -


i want import in python ascii file ( tecplot, software cfd post processing). rules files (at least, need import):

  • the file divided in several section

each section has 2 lines header like:

variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen"  zone t="window(s) : e_w_block0002_all",  i=29,  j=17,  k=25, f=block 
  • each section has set of variable given first line. when section ends, new section starts 2 similar lines.
  • for each variable there i*j*k values.
  • each variable continous block of values.
  • there fixed number of values per row (6).
  • when variable ends, next 1 starts in new line.
  • variables "ijk ordered data".the i-index varies fastest; j-index next fastest; k-index slowest. i-index should inner loop, k-index shoould outer loop, , j-index loop in between.

here example of data:

variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen"  zone t="window(s) : e_w_block0002_all",  i=29,  j=17,  k=25, f=block -3.9999999e+00 -3.3327306e+00 -2.7760824e+00 -2.3117116e+00 -1.9243209e+00 -1.6011492e+00 [...] 0.0000000e+00 #fin first variable -4.3532482e-02 -4.3584235e-02 -4.3627592e-02 -4.3663762e-02 -4.3693815e-02 -4.3718831e-02 #second variable, 'y' [...] 1.0738781e-01 #end of second variable [...] [...] variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen" #next zone zone t="window(s) : e_w_block0003_all",  i=17,  j=17,  k=25, f=block 

i quite new @ python , have written code import data dictionary, writing variables 3d numpy.array . files big, (up gb). how can make code faster? (or more generally, how can import such files fast possible)?

import re numpy import zeros, array, prod def vectorr(i,  j,  k):     """function"""     vect = []     k in range(0,  k):         j in range(0, j):             in range(0, i):                 vect.append([i, j, k])     return vect  = open('e:\u.dat')  filelist = a.readlines()  numbercol = 6 count = 0 data = dict() leng = len(filelist) countzone = 0 while count < leng:     strvariables = re.findall('variables', filelist[count])     variables = re.findall(r'"(.*?)"',  filelist[count])     countzone = countzone+1     data[countzone] = {key:[] key in variables}     count = count+1     stri = re.findall('i=....', filelist[count])     stri = re.findall('\d+', stri[0])      = int(stri[0])     ##     strj = re.findall('j=....', filelist[count])     strj = re.findall('\d+', strj[0])     j = int(strj[0])     ##     strk = re.findall('k=....', filelist[count])     strk = re.findall('\d+', strk[0])     k = int(strk[0])     data[countzone]['indmax'] = array([i, j, k])     pr = prod(data[countzone]['indmax'])     lin = pr // numbercol     if pr%numbercol != 0:         lin = lin+1     vect = vectorr(i, j, k)     key in variables:         init = zeros((i, j, k))         ii in range(0, lin):             count = count+1             temp = map(float, filelist[count].split())             iii in range(0, len(temp)):                 init.itemset(tuple(vect[ii*6+iii]), temp[iii])         data[countzone][key] = init     count = count+1 

ps. in python, no cython or other languages

converting large bunch of strings numbers going little slow, assuming triple-nested for-loop bottleneck here maybe changing following gives sufficient speedup:

# add line imports numpy import fromstring  # replace nested for-loop with: count += 1 key in variables:     str_vector = ' '.join(filelist[count:count+lin])     ar = fromstring(str_vector, sep=' ')     ar = ar.reshape((i, j, k), order='f')      data[countzone][key] = ar      count += lin 

unfortunately @ moment have access smartphone (no pc) can't test how fast or if works correctly or @ all!


update

finally got around doing testing:

  • my code contained small error, seem work correctly now.
  • the code proposed changes runs 4 times faster original
  • your code spends of time on ndarray.itemset , loop overhead , float conversion. unfortunately cprofile doesn't show in detail..
  • the improved code spends 70% of time in numpy.fromstring, which, in view, indicates method reasonably fast can achieve python / numpy.

update 2

of course better iterate on file instead of loading @ once. in case faster (i tried it) , reduces memory use. try use multiple cpu cores loading , conversion floats, becomes difficult have data under 1 variable. word of warning: fromstring method used scales rather bad length of string. e.g. string length becomes more efficient use np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).


Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -