Importing big tecplot block files in python as fast as possible -
i want import in python ascii file ( tecplot, software cfd post processing). rules files (at least, need import):
- the file divided in several section
each section has 2 lines header like:
variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen" zone t="window(s) : e_w_block0002_all", i=29, j=17, k=25, f=block
- each section has set of variable given first line. when section ends, new section starts 2 similar lines.
- for each variable there i*j*k values.
- each variable continous block of values.
- there fixed number of values per row (6).
- when variable ends, next 1 starts in new line.
- variables "ijk ordered data".the i-index varies fastest; j-index next fastest; k-index slowest. i-index should inner loop, k-index shoould outer loop, , j-index loop in between.
here example of data:
variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen" zone t="window(s) : e_w_block0002_all", i=29, j=17, k=25, f=block -3.9999999e+00 -3.3327306e+00 -2.7760824e+00 -2.3117116e+00 -1.9243209e+00 -1.6011492e+00 [...] 0.0000000e+00 #fin first variable -4.3532482e-02 -4.3584235e-02 -4.3627592e-02 -4.3663762e-02 -4.3693815e-02 -4.3718831e-02 #second variable, 'y' [...] 1.0738781e-01 #end of second variable [...] [...] variables = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roe" "m" "p" "pi" "tsta" "tgen" #next zone zone t="window(s) : e_w_block0003_all", i=17, j=17, k=25, f=block
i quite new @ python , have written code import data dictionary, writing variables 3d numpy.array
. files big, (up gb). how can make code faster? (or more generally, how can import such files fast possible)?
import re numpy import zeros, array, prod def vectorr(i, j, k): """function""" vect = [] k in range(0, k): j in range(0, j): in range(0, i): vect.append([i, j, k]) return vect = open('e:\u.dat') filelist = a.readlines() numbercol = 6 count = 0 data = dict() leng = len(filelist) countzone = 0 while count < leng: strvariables = re.findall('variables', filelist[count]) variables = re.findall(r'"(.*?)"', filelist[count]) countzone = countzone+1 data[countzone] = {key:[] key in variables} count = count+1 stri = re.findall('i=....', filelist[count]) stri = re.findall('\d+', stri[0]) = int(stri[0]) ## strj = re.findall('j=....', filelist[count]) strj = re.findall('\d+', strj[0]) j = int(strj[0]) ## strk = re.findall('k=....', filelist[count]) strk = re.findall('\d+', strk[0]) k = int(strk[0]) data[countzone]['indmax'] = array([i, j, k]) pr = prod(data[countzone]['indmax']) lin = pr // numbercol if pr%numbercol != 0: lin = lin+1 vect = vectorr(i, j, k) key in variables: init = zeros((i, j, k)) ii in range(0, lin): count = count+1 temp = map(float, filelist[count].split()) iii in range(0, len(temp)): init.itemset(tuple(vect[ii*6+iii]), temp[iii]) data[countzone][key] = init count = count+1
ps. in python, no cython or other languages
converting large bunch of strings numbers going little slow, assuming triple-nested for-loop bottleneck here maybe changing following gives sufficient speedup:
# add line imports numpy import fromstring # replace nested for-loop with: count += 1 key in variables: str_vector = ' '.join(filelist[count:count+lin]) ar = fromstring(str_vector, sep=' ') ar = ar.reshape((i, j, k), order='f') data[countzone][key] = ar count += lin
unfortunately @ moment have access smartphone (no pc) can't test how fast or if works correctly or @ all!
update
finally got around doing testing:
- my code contained small error, seem work correctly now.
- the code proposed changes runs 4 times faster original
- your code spends of time on
ndarray.itemset
, loop overhead , float conversion. unfortunately cprofile doesn't show in detail.. - the improved code spends 70% of time in
numpy.fromstring
, which, in view, indicates method reasonably fast can achieve python / numpy.
update 2
of course better iterate on file instead of loading @ once. in case faster (i tried it) , reduces memory use. try use multiple cpu cores loading , conversion floats, becomes difficult have data under 1 variable. word of warning: fromstring
method used scales rather bad length of string. e.g. string length becomes more efficient use np.fromiter(itertools.imap(float, str_vector.split()), dtype=float)
.
Comments
Post a Comment