python 2.7 - unable to read a tab delimited file into a numpy 2-D array -
i quite new nympy , trying read tab(\t) delimited text file numpy array matrix using following code:
train_data = np.genfromtxt('training.txt', dtype=none, delimiter='\t') file contents:
38 private 215646 hs-grad 9 divorced handlers-cleaners not-in-family white male 0 0 40 united-states <=50k 53 private 234721 11th 7 married-civ-spouse handlers-cleaners husband black male 0 0 40 united-states <=50k 30 state-gov 141297 bachelors 13 married-civ-spouse prof-specialty husband asian-pac-islander male 0 0 40 india >50k what expect 2-d array matrix of shape (3, 15)
but above code single row array of shape (3,)
i not sure why fifteen fields of each row not assigned column each.
i tried using numpy's loadtxt() not handle type conversions on data i.e though gave dtype=none tried convert strings default float type , failed @ it.
tried code:
train_data = np.loadtxt('try.txt', dtype=none, delimiter='\t') error: valueerror: not convert string float: state-gov any pointers?
thanks
actually issue here np.genfromtxt , np.loadtxt both return structured array if dtype structured (i.e., has multiple types). array reports have shape of (3,), because technically 1d array of 'records'. these 'records' hold columns can access data if 2d.
you loading correctly:
in [82]: d = np.genfromtxt('tmp',dtype=none) as reported, has 1d shape:
in [83]: d.shape out[83]: (3,) but data there:
in [84]: d out[84]: array([ (38, 'private', 215646, 'hs-grad', 9, 'divorced', 'handlers-cleaners', 'not-in-family', 'white', 'male', 0, 0, 40, 'united-states', '<=50k'), (53, 'private', 234721, '11th', 7, 'married-civ-spouse', 'handlers-cleaners', 'husband', 'black', 'male', 0, 0, 40, 'united-states', '<=50k'), (30, 'state-gov', 141297, 'bachelors', 13, 'married-civ-spouse', 'prof-specialty', 'husband', 'asian-pac-islander', 'male', 0, 0, 40, 'india', '>50k')], dtype=[('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')]) the dtype of array structured so:
in [85]: d.dtype out[85]: dtype([('f0', '<i8'), ('f1', 's9'), ('f2', '<i8'), ('f3', 's9'), ('f4', '<i8'), ('f5', 's18'), ('f6', 's17'), ('f7', 's13'), ('f8', 's18'), ('f9', 's4'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', 's13'), ('f14', 's5')]) and can still access "columns" (known fields) using names given in dtype:
in [86]: d['f0'] out[86]: array([38, 53, 30]) in [87]: d['f1'] out[87]: array(['private', 'private', 'state-gov'], dtype='|s9') it's more convenient give proper names fields:
in [104]: names = "age,military,id,edu,a,marital,job,fam,ethnicity,gender,b,c,d,country,income" in [105]: d = np.genfromtxt('tmp',dtype=none, names=names) so can access 'age' field, etc.:
in [106]: d['age'] out[106]: array([38, 53, 30]) in [107]: d['income'] out[107]: array(['<=50k', '<=50k', '>50k'], dtype='|s5') or incomes of people under 35
in [108]: d[d['age'] < 35]['income'] out[108]: array(['>50k'], dtype='|s5') and on 35
in [109]: d[d['age'] > 35]['income'] out[109]: array(['<=50k', '<=50k'], dtype='|s5')
Comments
Post a Comment