python - handling too many categorical features using scikit-learn -
i quite new scikit-learn , trying use package make predictions on income data. maybe duplicate question saw post on looking easy example understand what's expected scikit-learn estimators.
the data have of following structure many features categorical (eg: workclass, education..)
age: continuous. workclass: private, self-emp-not-inc, self-emp-inc, federal-gov, local-gov, state-gov, without-pay, never-worked. fnlwgt: continuous. education: bachelors, some-college, 11th, hs-grad, prof-school, assoc-acdm, assoc-voc, 9th, 7th-8th, 12th, masters, 1st-4th, 10th, doctorate, 5th-6th, preschool. education-num: continuous. marital-status: married-civ-spouse, divorced, never-married, separated, widowed, married-spouse-absent, married-af-spouse. occupation: tech-support, craft-repair, other-service, sales, exec-managerial, prof-specialty, handlers-cleaners, machine-op-inspct, adm-clerical, farming-fishing, transport-moving, priv-house-serv, protective-serv, armed-forces. relationship: wife, own-child, husband, not-in-family, other-relative, unmarried. race: white, asian-pac-islander, amer-indian-eskimo, other, black. sex: female, male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: united-states, cambodia, england, puerto-rico, canada, germany, outlying-us(guam-usvi-etc), india, japan, greece, south, china, cuba, iran, honduras, philippines, italy, poland, jamaica, vietnam, mexico, portugal, ireland, france, dominican-republic, laos, ecuador, taiwan, haiti, columbia, hungary, guatemala, nicaragua, scotland, thailand, yugoslavia, el-salvador, trinadad&tobago, peru, hong, holand-netherlands. example records:
38 private 215646 hs-grad 9 divorced handlers-cleaners not-in-family white male 0 0 40 united-states <=50k 53 private 234721 11th 7 married-civ-spouse handlers-cleaners husband black male 0 0 40 united-states <=50k 30 state-gov 141297 bachelors 13 married-civ-spouse prof-specialty husband asian-pac-islander male 0 0 40 india >50k i having hard time handling categorical features of models in sckit-learn expect features numbers? provide classes transform/encode such features (like onehotencoder, dictvectorizer) cannot find way use these on data. know there quite number of steps involved here before encode them numbers wondering if knows simpler , efficient(since there many such features) way can understood example. vaguely know dictvectorizer way go need in how proceed here.
here's example code using dictvectorizer. first, let set data in python shell. leave reading file you.
>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", ... "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"] >>> input_text = """38 private 215646 hs-grad 9 divorced handlers-cleaners not-in-family white male 0 0 40 united-states <=50k ... 53 private 234721 11th 7 married-civ-spouse handlers-cleaners husband black male 0 0 40 united-states <=50k ... 30 state-gov 141297 bachelors 13 married-civ-spouse prof-specialty husband asian-pac-islander male 0 0 40 india >50k ... """ now, parse these:
>>> ln in input_text.splitlines(): ... values = ln.split() ... y.append(values[-1]) ... d = dict(zip(features, values[:-1])) ... samples.append(d) what have got now? let's check:
>>> pprint import pprint >>> pprint(samples[0]) {'age': '38', 'capital-gain': '0', 'capital-loss': '0', 'education': 'hs-grad', 'education-num': '9', 'fnlwgt': '215646', 'hours-per-week': '40', 'marital-status': 'divorced', 'native-country': 'united-states', 'occupation': 'handlers-cleaners', 'race': 'white', 'relationship': 'not-in-family', 'sex': 'male', 'workclass': 'private'} >>> print(y) ['<=50k', '<=50k', '>50k'] these samples ready dictvectorizer, pass them:
>>> sklearn.feature_extraction import dictvectorizer >>> dv = dictvectorizer() >>> x = dv.fit_transform(samples) >>> x <3x29 sparse matrix of type '<type 'numpy.float64'>' 42 stored elements in compressed sparse row format> et voila, have x , y can passed estimator, provided supports sparse matrices. (otherwise, pass sparse=false dictvectorizer constructor.)
test samples can passed dictvectorizer.transform; if there feature/value combinations in test set not occur in training set, these ignored (because learned model cannot sensible them anyway).
Comments
Post a Comment