python - handling too many categorical features using scikit-learn -


i quite new scikit-learn , trying use package make predictions on income data. maybe duplicate question saw post on looking easy example understand what's expected scikit-learn estimators.

the data have of following structure many features categorical (eg: workclass, education..)

age: continuous. workclass: private, self-emp-not-inc, self-emp-inc, federal-gov, local-gov, state-gov, without-pay, never-worked. fnlwgt: continuous. education: bachelors, some-college, 11th, hs-grad, prof-school, assoc-acdm, assoc-voc, 9th, 7th-8th, 12th, masters, 1st-4th, 10th, doctorate, 5th-6th, preschool. education-num: continuous. marital-status: married-civ-spouse, divorced, never-married, separated, widowed, married-spouse-absent, married-af-spouse. occupation: tech-support, craft-repair, other-service, sales, exec-managerial, prof-specialty, handlers-cleaners, machine-op-inspct, adm-clerical, farming-fishing, transport-moving, priv-house-serv, protective-serv, armed-forces. relationship: wife, own-child, husband, not-in-family, other-relative, unmarried. race: white, asian-pac-islander, amer-indian-eskimo, other, black. sex: female, male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: united-states, cambodia, england, puerto-rico, canada, germany, outlying-us(guam-usvi-etc), india, japan, greece, south, china, cuba, iran, honduras, philippines, italy, poland, jamaica, vietnam, mexico, portugal, ireland, france, dominican-republic, laos, ecuador, taiwan, haiti, columbia, hungary, guatemala, nicaragua, scotland, thailand, yugoslavia, el-salvador, trinadad&tobago, peru, hong, holand-netherlands. 

example records:

38   private    215646   hs-grad    9    divorced    handlers-cleaners   not-in-family   white   male   0   0   40   united-states   <=50k 53   private    234721   11th   7    married-civ-spouse  handlers-cleaners   husband     black   male   0   0   40   united-states   <=50k 30   state-gov  141297   bachelors  13   married-civ-spouse  prof-specialty  husband     asian-pac-islander  male   0   0   40   india   >50k 

i having hard time handling categorical features of models in sckit-learn expect features numbers? provide classes transform/encode such features (like onehotencoder, dictvectorizer) cannot find way use these on data. know there quite number of steps involved here before encode them numbers wondering if knows simpler , efficient(since there many such features) way can understood example. vaguely know dictvectorizer way go need in how proceed here.

here's example code using dictvectorizer. first, let set data in python shell. leave reading file you.

>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", ...             "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"] >>> input_text = """38   private    215646   hs-grad    9    divorced    handlers-cleaners   not-in-family   white   male   0   0   40   united-states   <=50k ... 53   private    234721   11th   7    married-civ-spouse  handlers-cleaners   husband     black   male   0   0   40   united-states   <=50k ... 30   state-gov  141297   bachelors  13   married-civ-spouse  prof-specialty  husband     asian-pac-islander  male   0   0   40   india   >50k ... """ 

now, parse these:

>>> ln in input_text.splitlines(): ...     values = ln.split() ...     y.append(values[-1]) ...     d = dict(zip(features, values[:-1])) ...     samples.append(d) 

what have got now? let's check:

>>> pprint import pprint >>> pprint(samples[0]) {'age': '38',  'capital-gain': '0',  'capital-loss': '0',  'education': 'hs-grad',  'education-num': '9',  'fnlwgt': '215646',  'hours-per-week': '40',  'marital-status': 'divorced',  'native-country': 'united-states',  'occupation': 'handlers-cleaners',  'race': 'white',  'relationship': 'not-in-family',  'sex': 'male',  'workclass': 'private'} >>> print(y) ['<=50k', '<=50k', '>50k'] 

these samples ready dictvectorizer, pass them:

>>> sklearn.feature_extraction import dictvectorizer >>> dv = dictvectorizer() >>> x = dv.fit_transform(samples) >>> x <3x29 sparse matrix of type '<type 'numpy.float64'>'         42 stored elements in compressed sparse row format> 

et voila, have x , y can passed estimator, provided supports sparse matrices. (otherwise, pass sparse=false dictvectorizer constructor.)

test samples can passed dictvectorizer.transform; if there feature/value combinations in test set not occur in training set, these ignored (because learned model cannot sensible them anyway).


Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -