Sklearn

Sklearn

step 1

crreate a classification dataset(n_samples >= 1000, n_features >= 10)

dataset = datasets.make_classification(
    n_samples=1000, 
    n_features=10, 
    n_informative = 8, 
    n_redundant=2, 
    n_repeated=0, 
    n_classes=4)

使用函数datasets.make_classification来创建一个随机数据集。其中n_samples表示样例数,n_features是特征数,n_informative是有效的特征数(受到n_features的值的限制),n_redundant是冗余特征数(某个冗余特征可以表示成其他特征的线性组合),n_repeated是重复特征数。 n_classes是分类数,表示可以分成多少类。

step 2

split the dataset using 10-fold cross validation

kf = cross_validation.KFold(
    len(dataset[0]), 
    n_folds=10, 
    shuffle=True)
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

cross_validation.kFold可以直接完成cross validation.不过该函数返回的是类似于”索引”的东西,需要用for ... in语法来获得其值。

step 3

GaussianNB

from sklearn.naive_bayes import GaussianNB
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

    clf = GaussianNB()
    clf.fit(X_train, y_train)
    pred = cli.predict(X_test)

采用.fit().predict()方法来拟合和预测。以下不再赘述。

SVC

possible C value [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel

from sklearn.svm import SVC
params = [1e-02, 1e-01, 1e00, 1e01, 1e02]
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

    for c in params:
        clf = SVC(C=c, kernel='rbf', gamma=0.1)
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)

RandomForestClassifer

possible n estimators values[10, 100, 1000]

from sklearn.ensemble import RandomForestClassifier

params = [10, 100, 1000]
for train_index, test_index in kf:
    X_train, y_train = dataset[0][train_index], dataset[1][train_index]
    X_test, y_test = dataset[0][test_index], dataset[1][test_index]

    for estimate in params:
        clf = RandomForestClassifier(n_estimators = estimate)
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)

Evaluate the cross-validated performance

Accuracy

from sklearn import metrics
acc = metrics.accuracy_score(y_test, pred)

F1-score

from sklearn import metrics
f1 = metrics.f1_score(y_test, pred)

AUC ROC

from sklearn import metrics
auc = metrics.roc_auc_score(y_test, pred)