python 2.7 - unable to read a tab delimited file into a numpy 2-D array -

i quite new nympy , trying read tab(\t) delimited text file numpy array matrix using following code:

train_data = np.genfromtxt('training.txt', dtype=none, delimiter='\t')

file contents:

38   private    215646   hs-grad    9    divorced    handlers-cleaners   not-in-family   white   male   0   0   40   united-states   <=50k 53   private    234721   11th   7    married-civ-spouse  handlers-cleaners   husband     black   male   0   0   40   united-states   <=50k 30   state-gov  141297   bachelors  13   married-civ-spouse  prof-specialty  husband     asian-pac-islander  male   0   0   40   india   >50k

what expect 2-d array matrix of shape (3, 15)

but above code single row array of shape (3,)

i not sure why fifteen fields of each row not assigned column each.

i tried using numpy's loadtxt() not handle type conversions on data i.e though gave dtype=none tried convert strings default float type , failed @ it.

tried code:

train_data = np.loadtxt('try.txt', dtype=none, delimiter='\t')  error: valueerror: not convert string float: state-gov

any pointers?

thanks

actually issue here np.genfromtxt , np.loadtxt both return structured array if dtype structured (i.e., has multiple types). array reports have shape of (3,), because technically 1d array of 'records'. these 'records' hold columns can access data if 2d.

you loading correctly:

in [82]: d = np.genfromtxt('tmp',dtype=none)

as reported, has 1d shape:

in [83]: d.shape out[83]: (3,)

but data there:

in [84]: d out[84]:  array([ (38, 'private', 215646, 'hs-grad', 9, 'divorced', 'handlers-cleaners', 'not-in-family', 'white', 'male', 0, 0, 40, 'united-states', '<=50k'),        (53, 'private', 234721, '11th', 7, 'married-civ-spouse', 'handlers-cleaners', 'husband', 'black', 'male', 0, 0, 40, 'united-states', '<=50k'),        (30, 'state-gov', 141297, 'bachelors', 13, 'married-civ-spouse', 'prof-specialty', 'husband', 'asian-pac-islander', 'male', 0, 0, 40, 'india', '>50k')],        dtype=[('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')])

the dtype of array structured so:

in [85]: d.dtype out[85]: dtype([('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')])

and can still access "columns" (known fields) using names given in dtype:

in [86]: d['f0'] out[86]: array([38, 53, 30])  in [87]: d['f1'] out[87]:  array(['private', 'private', 'state-gov'],        dtype='|s9')

it's more convenient give proper names fields:

in [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income"  in [105]: d = np.genfromtxt('tmp',dtype=none, names=names)

so can access 'age' field, etc.:

in [106]: d['age'] out[106]: array([38, 53, 30])  in [107]: d['income'] out[107]:  array(['<=50k', '<=50k', '>50k'],        dtype='|s5')

or incomes of people under 35

in [108]: d[d['age'] < 35]['income'] out[108]:  array(['>50k'],        dtype='|s5')

and on 35

in [109]: d[d['age'] > 35]['income'] out[109]:  array(['<=50k', '<=50k'],        dtype='|s5')

Search This Blog

Bradly

python 2.7 - unable to read a tab delimited file into a numpy 2-D array -

Comments

Post a Comment

Popular posts from this blog

java.util.scanner - How to read and add only numbers to array from a text file -

What is the end of string notation in python -

php - Add the correct number of days for each month -