python 2.7 - unable to read a tab delimited file into a numpy 2-D array -


i quite new nympy , trying read tab(\t) delimited text file numpy array matrix using following code:

train_data = np.genfromtxt('training.txt', dtype=none, delimiter='\t') 

file contents:

38   private    215646   hs-grad    9    divorced    handlers-cleaners   not-in-family   white   male   0   0   40   united-states   <=50k 53   private    234721   11th   7    married-civ-spouse  handlers-cleaners   husband     black   male   0   0   40   united-states   <=50k 30   state-gov  141297   bachelors  13   married-civ-spouse  prof-specialty  husband     asian-pac-islander  male   0   0   40   india   >50k 

what expect 2-d array matrix of shape (3, 15)

but above code single row array of shape (3,)

i not sure why fifteen fields of each row not assigned column each.

i tried using numpy's loadtxt() not handle type conversions on data i.e though gave dtype=none tried convert strings default float type , failed @ it.

tried code:

train_data = np.loadtxt('try.txt', dtype=none, delimiter='\t')  error: valueerror: not convert string float: state-gov 

any pointers?

thanks

actually issue here np.genfromtxt , np.loadtxt both return structured array if dtype structured (i.e., has multiple types). array reports have shape of (3,), because technically 1d array of 'records'. these 'records' hold columns can access data if 2d.

you loading correctly:

in [82]: d = np.genfromtxt('tmp',dtype=none) 

as reported, has 1d shape:

in [83]: d.shape out[83]: (3,) 

but data there:

in [84]: d out[84]:  array([ (38, 'private', 215646, 'hs-grad', 9, 'divorced', 'handlers-cleaners', 'not-in-family', 'white', 'male', 0, 0, 40, 'united-states', '<=50k'),        (53, 'private', 234721, '11th', 7, 'married-civ-spouse', 'handlers-cleaners', 'husband', 'black', 'male', 0, 0, 40, 'united-states', '<=50k'),        (30, 'state-gov', 141297, 'bachelors', 13, 'married-civ-spouse', 'prof-specialty', 'husband', 'asian-pac-islander', 'male', 0, 0, 40, 'india', '>50k')],        dtype=[('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')]) 

the dtype of array structured so:

in [85]: d.dtype out[85]: dtype([('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')]) 

and can still access "columns" (known fields) using names given in dtype:

in [86]: d['f0'] out[86]: array([38, 53, 30])  in [87]: d['f1'] out[87]:  array(['private', 'private', 'state-gov'],        dtype='|s9') 

it's more convenient give proper names fields:

in [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"  in [105]: d = np.genfromtxt('tmp',dtype=none, names=names) 

so can access 'age' field, etc.:

in [106]: d['age'] out[106]: array([38, 53, 30])  in [107]: d['income'] out[107]:  array(['<=50k', '<=50k', '>50k'],        dtype='|s5') 

or incomes of people under 35

in [108]: d[d['age'] < 35]['income'] out[108]:  array(['>50k'],        dtype='|s5') 

and on 35

in [109]: d[d['age'] > 35]['income'] out[109]:  array(['<=50k', '<=50k'],        dtype='|s5') 

Comments

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

rewrite - Trouble with Wordpress multiple custom querystrings -