python 2.7 - unable to read a tab delimited file into a numpy 2-D array -
i quite new nympy , trying read tab(\t) delimited text file numpy array matrix using following code:
train_data = np.genfromtxt('training.txt', dtype=none, delimiter='\t')
file contents:
38 private 215646 hs-grad 9 divorced handlers-cleaners not-in-family white male 0 0 40 united-states <=50k 53 private 234721 11th 7 married-civ-spouse handlers-cleaners husband black male 0 0 40 united-states <=50k 30 state-gov 141297 bachelors 13 married-civ-spouse prof-specialty husband asian-pac-islander male 0 0 40 india >50k
what expect 2-d array matrix of shape (3, 15)
but above code single row array of shape (3,)
i not sure why fifteen fields of each row not assigned column each.
i tried using numpy's loadtxt() not handle type conversions on data i.e though gave dtype=none tried convert strings default float type , failed @ it.
tried code:
train_data = np.loadtxt('try.txt', dtype=none, delimiter='\t') error: valueerror: not convert string float: state-gov
any pointers?
thanks
actually issue here np.genfromtxt
, np.loadtxt
both return structured array if dtype structured (i.e., has multiple types). array reports have shape of (3,)
, because technically 1d array of 'records'. these 'records' hold columns can access data if 2d.
you loading correctly:
in [82]: d = np.genfromtxt('tmp',dtype=none)
as reported, has 1d shape:
in [83]: d.shape out[83]: (3,)
but data there:
in [84]: d out[84]: array([ (38, 'private', 215646, 'hs-grad', 9, 'divorced', 'handlers-cleaners', 'not-in-family', 'white', 'male', 0, 0, 40, 'united-states', '<=50k'), (53, 'private', 234721, '11th', 7, 'married-civ-spouse', 'handlers-cleaners', 'husband', 'black', 'male', 0, 0, 40, 'united-states', '<=50k'), (30, 'state-gov', 141297, 'bachelors', 13, 'married-civ-spouse', 'prof-specialty', 'husband', 'asian-pac-islander', 'male', 0, 0, 40, 'india', '>50k')], dtype=[('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')])
the dtype
of array structured so:
in [85]: d.dtype out[85]: dtype([('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')])
and can still access "columns" (known fields) using names given in dtype:
in [86]: d['f0'] out[86]: array([38, 53, 30]) in [87]: d['f1'] out[87]: array(['private', 'private', 'state-gov'], dtype='|s9')
it's more convenient give proper names fields:
in [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income" in [105]: d = np.genfromtxt('tmp',dtype=none, names=names)
so can access 'age'
field, etc.:
in [106]: d['age'] out[106]: array([38, 53, 30]) in [107]: d['income'] out[107]: array(['<=50k', '<=50k', '>50k'], dtype='|s5')
or incomes of people under 35
in [108]: d[d['age'] < 35]['income'] out[108]: array(['>50k'], dtype='|s5')
and on 35
in [109]: d[d['age'] > 35]['income'] out[109]: array(['<=50k', '<=50k'], dtype='|s5')
Comments
Post a Comment