Sklearn
step 1
crreate a classification dataset(n_samples >= 1000, n_features >= 10)
dataset = datasets.make_classification(
n_samples=1000,
n_features=10,
n_informative = 8,
n_redundant=2,
n_repeated=0,
n_classes=4)
使用函数datasets.make_classification
来创建一个随机数据集。其中n_samples
表示样例数,n_features
是特征数,n_informative
是有效的特征数(受到n_features
的值的限制),n_redundant
是冗余特征数(某个冗余特征可以表示成其他特征的线性组合),n_repeated
是重复特征数。
n_classes
是分类数,表示可以分成多少类。
step 2
split the dataset using 10-fold cross validation
kf = cross_validation.KFold(
len(dataset[0]),
n_folds=10,
shuffle=True)
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
cross_validation.kFold
可以直接完成cross validation.不过该函数返回的是类似于”索引”的东西,需要用for ... in
语法来获得其值。
step 3
GaussianNB
from sklearn.naive_bayes import GaussianNB
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = cli.predict(X_test)
采用.fit()
和.predict()
方法来拟合和预测。以下不再赘述。
SVC
possible C value [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel
from sklearn.svm import SVC
params = [1e-02, 1e-01, 1e00, 1e01, 1e02]
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
for c in params:
clf = SVC(C=c, kernel='rbf', gamma=0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
RandomForestClassifer
possible n estimators values[10, 100, 1000]
from sklearn.ensemble import RandomForestClassifier
params = [10, 100, 1000]
for train_index, test_index in kf:
X_train, y_train = dataset[0][train_index], dataset[1][train_index]
X_test, y_test = dataset[0][test_index], dataset[1][test_index]
for estimate in params:
clf = RandomForestClassifier(n_estimators = estimate)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Evaluate the cross-validated performance
Accuracy
from sklearn import metrics
acc = metrics.accuracy_score(y_test, pred)
F1-score
from sklearn import metrics
f1 = metrics.f1_score(y_test, pred)
AUC ROC
from sklearn import metrics
auc = metrics.roc_auc_score(y_test, pred)