decision tree - Features used in sci-kit learn implementation of DT -
i have implemented dt classifier cv in sci-kit learn. however, output number of features contributed classification. code have far:
from collections import defaultdict import numpy np sklearn.cross_validation import cross_val_score sklearn.tree import decisiontreeclassifier scipy.sparse import csr_matrix lemma2feat = defaultdict(lambda: defaultdict(float)) # { lemma: {feat : weight}} lemma2cat = dict() features = set() open("input.csv","rb") infile: line in infile: lemma, feature, weight, tclass = line.split() lemma2feat[lemma][feature] = float(weight) lemma2cat[lemma] = int(tclass) features.add(feature) sorted_rows = sorted(lemma2feat.keys()) col2index = dict() colidx, col in enumerate(sorted(list(features))): col2index[col] = colidx dmat = np.zeros((len(sorted_rows), len(col2index.keys())), dtype = float) # popola la matrice vidx, vector in enumerate(sorted_rows): feature in lemma2feat[vector].keys(): dmat[vidx][col2index[feature]] = lemma2feat[vector][feature] res = [] lem in sorted_rows: res.append(lemma2cat[lem]) clf = decisiontreeclassifier(random_state=0) print "acc:" print cross_val_score(clf, dmat, np.asarray(res), cv=10, scoring = "accuracy")
what can include output number of features, looked @ rfe instance, inquired in different question, can not included dt. therefore, know if there way modify above code output number of features contribute highest accuracy. overall goal here plot in elbow plot in comparison output of other classifiers. thank you.
you can inspect relevant features using feature_importances_
attribute once tree fit. give array of n_features
float values such feature_importances_[i]
high (w.r.t other values) if i-th feature important/helpful build tree, , low (close 0) if not.
Comments
Post a Comment