书名：《机器学习实战：基于Scikit-Learn和TensorFlow》

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

作者：Aurélien Géron

出版社：机械工业出版社出版

第三章分类任务

使用MNIST 数据集，进行手写数字分类

1. 数据初探

导入数据集

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

Scikit-Learn加载的数据集通常具有类似字典结构，包括：DESCR键，描述数据集。data键，包含一个数组，每个实例为一行，每个特征为一列。target键，包含一个带有标记的数组。

查看一下数据的的keys

print(mnist.keys())

结果如下

查看一下数据集大小

X, y = mnist["data"], mnist["target"]
print(X.shape, y.shape)

结果如下：

结果显示，共有7万张图片，每张图片有784个特征。

下面，查看其中一个数字，就看第一个吧

需要随手抓取一个实例的特征向量，将其重新形成一个28×28数组，然后使用Matplotlib的imshow（）函数将其显示出来

import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap=mpl.cm.binary)
plt.axis("off")
plt.show()

图片展示的结果看起来像5，我们查看一下标签的实际结果，确实就是5

print(y[0])

注意标签是字符，大部分机器学习算法希望是数字，让我们把y转换成整数

import numpy as np
y = y.astype(np.uint8)

同样的方法，绘制100张图来看看

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    # This is equivalent to n_rows = ceil(len(instances) / images_per_row):
    n_rows = (len(instances) - 1) // images_per_row + 1

    # Append empty images to fill the end of the grid, if needed:
    n_empty = n_rows * images_per_row - len(instances)
    padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)

    # Reshape the array so it's organized as a grid containing 28×28 images:
    image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))

    # Combine axes 0 and 2 (vertical image grid axis, and vertical image axis),
    # and axes 1 and 3 (horizontal axes). We first need to move the axes that we
    # want to combine next to each other, using transpose(), and only then we
    # can reshape:
    big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size,
                                                         images_per_row * size)
    # Now that we have a big image, we just need to show it:
    plt.imshow(big_image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
plt.show()

100张图结果：

在开始深入研究这些数据之前，你还是应该先创建一个测试集，并将其放在一边。事实上，MNIST数据集已经分成训练集（前6万张图片）和测试集（最后1万张图片）了

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

2. 训练二元分类器

先简化问题，只尝试识别一个数字，比如数字5。那么这个“数字5检测器”就是一个二元分类器的示例。它只能区分两个类别：5和非5。先为此分类任务创建目标向量

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

接着挑选一个分类器并开始训练。一个好的初始选择是随机梯度下降（SGD）分类器，使用Scikit-Learn的SGDClassifier类即可。这个分类器的优势是能够有效处理非常大型的数据集

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

print(sgd_clf.predict([some_digit]))  # 回到一开始，some_digit = X[0]

因为预测的是第一个数字，正是5，因此结果为true

3. 性能评估

第一个性能评估的指标为准确率

可以使用交叉验证测量准确率

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print("正常预测交叉验证结果：")
    print(n_correct / len(y_pred))

使用cross_val_score（）函数来评估SGDClassifier模型

from sklearn.model_selection import cross_val_score
cv3 = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("正常预测cross_val_score结果：")
print(list(cv3))

结果看着不错，准确率高达95%，下面我们来看一个，对所有数字都预测为“不是5”的分类器，看看起性能怎样

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_5_clf = Never5Classifier()
not5 = cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("非5分类器结果：")
print(not5)

准确率指标结果：

结果出乎意料，准确率竟然也高达90%，为什么呢，这是因为这个数据集里面，大约有10%的数字是5，因此，预测为非5，90%的概率是正确的。由此可见，准确率不能作为性能评价的首要性能指标，而应该使用新的指标——混淆矩阵

混淆矩阵即一个2x2的矩阵，分别表示，在真5中，有多少被正确分类，有多少被错误分类，在非5中，有多少被正确分类，有多少被错误分类

下面我们来看看混淆矩阵的结果,依然使用交叉验证来看看

# 与cross_val_score（）函数一样，cross_val_predict（）函数同样执行K-折交叉验证，
# 但返回的不是评估分数，而是每个折叠的预测
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix
print("正常预测混淆矩阵:")
print(confusion_matrix(y_train_5, y_train_pred))

混淆矩阵中的行表示实际类别，列表示预测类别。本例中第一行表示所有“非5”（负类）的图片中： 53892张被正确地分为“非5”类别（真负类），687张被错误地分类成了“5”（假正类）；第二行表示所有“5”（正类）的图片中：1891张被错误地分为“非5”类别（假负类），3530张被正确地分在了“5”这一类别（真正类）。一个完美的分类器只有真正类和真负类，所以它的混淆矩阵只会在其对角线（左上到右下）上有非零值

下面我们来看一下一个完美预测的混淆矩阵

y_train_perfect_predictions = y_train_5
print("完美预测混淆矩阵:")
print(confusion_matrix(y_train_5, y_train_perfect_predictions))

混淆矩阵结果：

混淆矩阵是一给数字矩阵，不够直观，为了更准确地评估性能，使用下列的指标更好：

# Scikit-Learn提供了计算多种分类器指标的函数，包括精度和召回率，我们重新评价一下性能
from sklearn.metrics import precision_score, recall_score
pre5 = precision_score(y_train_5, y_train_pred)
# 该值是这么来的， pre5 = 53892 / (53892+687)
rec5 = recall_score(y_train_5, y_train_pred)
# 该值是这么来的， rec5 = 3530 / (3530+1891)
print("精确度：", pre5, "召回率:", rec5)

结果显示，模型预测的精确度为83.7% ，召回率为65.1%。也就是说，当一张图是5时，模型只有83.7%的概率猜对它是5；而一张图不是5时，模型有65.1%的概率猜对它不是5

将精度和召回率组合成一个单一的指标，称为F1分数F1分数是精度和召回率的谐波平均值

# 在sklearn中，要计算F1分数，只需要调用f1_score（）即可
from sklearn.metrics import f1_score
F1_5 = f1_score(y_train_5, y_train_pred)
print("F1值：", F1_5)

精确度、召回率与F1值结果：

4. 精度/召回率的权衡

Scikit-Learn不允许直接设置阈值，但是可以访问它用于预测的决策分数。不是调用分类器的predict（）方法，而是调用decision_function（）方法，这种方法返回每个实例的分数，然后就可以根据这些分数，使用任意阈值进行预测了。

使用cross_val_predict（）函数获取训练集中所有实例的分数

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

有了这些分数，可以使用precision_recall_curve（）函数来计算所有可能的阈值的精度和召回率了

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

下面，使用Matplotlib绘制精度和召回率相对于阈值的函数图

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16)
    plt.xlabel("Threshold", fontsize=16)
    plt.grid(True)
    plt.axis([-50000, 50000, 0, 1])

recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "r:")
plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "r:")
plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "r:")
plt.plot([threshold_90_precision], [0.9], "ro")
plt.plot([threshold_90_precision], [recall_90_precision], "ro")
plt.show()

精度和召回率相对于阈值的函数图

再看看精度和召回率的关系曲线

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "r:")
plt.plot([0.0, recall_90_precision], [0.9, 0.9], "r:")
plt.plot([recall_90_precision], [0.9], "ro")
plt.show()

再看看精度和召回率的关系曲线

由此可见，精度和召回率，两者一个大，另一个就小，二者不可兼得

我们设置一个精确度为90%的阈值，查看一下其精度和召回率

y_train_pred_90 = (y_scores >= threshold_90_precision)
print(precision_score(y_train_5, y_train_pred_90), recall_score(y_train_5, y_train_pred_90))

结果显示：当设置精度为90%，该模型得到的精度结果为0.900003，召回率为0.47998。因此，如果不调整模型的话，可见，如果要求精度高达90%，其召回率不到50%。

5. ROC曲线

要绘制ROC曲线，首先需要使用roc_curve（）函数计算多种阈值的TPR和FPR

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

然后，使用Matplotlib绘制FPR对TPR的曲线。下面的代码可以绘制出图3-6的曲线：

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16)
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
fpr_90 = fpr[np.argmax(tpr >= recall_90_precision)]
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")
plt.plot([fpr_90], [recall_90_precision], "ro")
plt.show()

ROC曲线

同样这里再次面临一个折中权衡：召回率（TPR）越高，分类器产生的假正类（FPR）就越多。虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）。

同样这里再次面临一个折中权衡：召回率（TPR）越高，分类器产生的假正类（FPR）就越多。虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_train_5, y_scores))

下面，我们训练一下另一个模型（RandomForestClassifier分类器），比较两个模型的ROC曲线

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

recall_for_forest = tpr_forest[np.argmax(fpr_forest >= fpr_90)]

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")
plt.plot([fpr_90], [recall_90_precision], "ro")
plt.plot([fpr_90, fpr_90], [0., recall_for_forest], "r:")
plt.plot([fpr_90], [recall_for_forest], "ro")
plt.grid(True)
plt.legend(loc="lower right", fontsize=16)
plt.show()

两种模型的ROC曲线比较

查看随机森林ROC面积

print("随机森林ROC分数:", roc_auc_score(y_train_5, y_scores_forest))

再看一下随机森林的精度和召回率

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
print("随机森林模型精度和召回率：")
print(precision_score(y_train_5, y_train_pred_forest), recall_score(y_train_5, y_train_pred_forest))

随机森林ROC分数，精度和召回率结果：

完整代码

# 第三章 分类
# 3.1 使用MNIST 数据集，进行手写数字分类

# 准备工作
# 导入数据集
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

# Scikit-Learn加载的数据集通常具有类似的字典结构，包括：
# DESCR键，描述数据集。data键，包含一个数组，每个实例为一行，每个特征为一列。target键，包含一个带有标记的数组。

# 查看一下数据的的keys
print(mnist.keys())

# 查看一下数据集大小
X, y = mnist["data"], mnist["target"]
print(X.shape, y.shape)
# 结果显示，共有7万张图片，每张图片有784个特征。

# 查看其中一个数字，就看第一个吧
# 需要随手抓取一个实例的特征向量，将其重新形成一个28×28数组，然后使用Matplotlib的imshow（）函数将其显示出来
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap=mpl.cm.binary)
plt.axis("off")
plt.show()

# 图片展示的结果看起来像5，我们查看一下标签的实际结果，确实就是5
print("第一个数字的值：", y[0])

# 注意标签是字符，大部分机器学习算法希望是数字，让我们把y转换成整数
import numpy as np
y = y.astype(np.uint8)


# 同样的方法， 绘制100张图来看看
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    # This is equivalent to n_rows = ceil(len(instances) / images_per_row):
    n_rows = (len(instances) - 1) // images_per_row + 1

    # Append empty images to fill the end of the grid, if needed:
    n_empty = n_rows * images_per_row - len(instances)
    padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)

    # Reshape the array so it's organized as a grid containing 28×28 images:
    image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))

    # Combine axes 0 and 2 (vertical image grid axis, and vertical image axis),
    # and axes 1 and 3 (horizontal axes). We first need to move the axes that we
    # want to combine next to each other, using transpose(), and only then we
    # can reshape:
    big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size,
                                                         images_per_row * size)
    # Now that we have a big image, we just need to show it:
    plt.imshow(big_image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
plt.show()

# 在开始深入研究这些数据之前，你还是应该先创建一个测试集，并将其放在一边。
# 事实上，MNIST数据集已经分成训练集（前6万张图片）和测试集（最后1万张图片）了

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

# 3.2 训练二元分类器
# 先简化问题，只尝试识别一个数字，比如数字5。那么这个“数字5检测器”就是一个二元分类器的示例。
# 它只能区分两个类别：5和非5。先为此分类任务创建目标向量

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

# 接着挑选一个分类器并开始训练。一个好的初始选择是随机梯度下降（SGD）分类器，使用Scikit-Learn的SGDClassifier类即可。
# 这个分类器的优势是能够有效处理非常大型的数据集

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

print(sgd_clf.predict([some_digit]))  # 回到一开始，some_digit = X[0]
# 因为预测的是第一个数字，正是5，因此结果为true

# 3.3 性能评估
# 3.3.1 使用交叉验证测量准确率
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print("正常预测交叉验证结果：")
    print(n_correct / len(y_pred))

# 使用cross_val_score（）函数来评估SGDClassifier模型
from sklearn.model_selection import cross_val_score
cv3 = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("正常预测cross_val_score结果：")
print(list(cv3))

# 结果看着不错，准确率高达95%，下面我们来看一个，对所有数字都预测为“不是5”的分类器，看看起性能怎样
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)
never_5_clf = Never5Classifier()
not5 = cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
print("非5分类器结果：")
print(not5)
# 结果出乎意料，准确率竟然也高达90%，为什么呢，这是因为这个数据集里面，大约有10%的数字是5，
# 因此，预测为非5，90%的概率是正确的。由此可见，准确率不能作为性能评价的首要性能指标，而是混淆矩阵

# 混淆矩阵即一个2*2的矩阵，分别表示，在真5中，有多少被正确分类，有多少被错误分类，在非5中，有多少被正确分类，有多少被错误分类
# 下面我们来看看混淆矩阵的结果

# 依然使用交叉验证来看看
# 与cross_val_score（）函数一样，cross_val_predict（）函数同样执行K-折交叉验证，
# 但返回的不是评估分数，而是每个折叠的预测
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix
print("正常预测混淆矩阵:")
print(confusion_matrix(y_train_5, y_train_pred))

# 混淆矩阵中的行表示实际类别，列表示预测类别。本例中第一行表示所有“非5”（负类）的图片中：
# 53892张被正确地分为“非5”类别（真负类），687张被错误地分类成了“5”（假正类）；
# 第二行表示所有“5”（正类）的图片中：1891张被错误地分为“非5”类别（假负类），3530张被正确地分在了“5”这一类别（真正类）。

# 一个完美的分类器只有真正类和真负类，所以它的混淆矩阵只会在其对角线（左上到右下）上有非零值
# 下面我们来看一下一个完美预测的混淆矩阵
y_train_perfect_predictions = y_train_5
print("完美预测混淆矩阵:")
print(confusion_matrix(y_train_5, y_train_perfect_predictions))

# 为了更准确地评估性能，使用下列的指标更好

# Scikit-Learn提供了计算多种分类器指标的函数，包括精度和召回率，我们重新评价一下性能
from sklearn.metrics import precision_score, recall_score
pre5 = precision_score(y_train_5, y_train_pred)
# 该值是这么来的， pre5 = 53892 / (53892+687)
rec5 = recall_score(y_train_5, y_train_pred)
# 该值是这么来的， rec5 = 3530 / (3530+1891)
print("精确度：", pre5, "召回率:", rec5)
# 结果显示，模型预测的精确度为83.7% ，召回率为65.1%
# 也就是说，当一张图是5时，模型只有83.7%的概率猜对它是5
# 而一张图不是5时，模型有65.1%的概率猜对它不是5


# 将精度和召回率组合成一个单一的指标，称为F1分数F1分数是精度和召回率的谐波平均值
# 在sklearn中，要计算F1分数，只需要调用f1_score（）即可
from sklearn.metrics import f1_score
F1_5 = f1_score(y_train_5, y_train_pred)
print("F1值：", F1_5)

# 3.3.4 精度/召回率的权衡
# Scikit-Learn不允许直接设置阈值，但是可以访问它用于预测的决策分数。
# 不是调用分类器的predict（）方法，而是调用decision_function（）方法，
# 这种方法返回每个实例的分数，然后就可以根据这些分数，使用任意阈值进行预测了

# 使用cross_val_predict（）函数获取训练集中所有实例的分数
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

# 有了这些分数，可以使用precision_recall_curve（）函数来计算所有可能的阈值的精度和召回率
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

# 使用Matplotlib绘制精度和召回率相对于阈值的函数图
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16)
    plt.xlabel("Threshold", fontsize=16)
    plt.grid(True)
    plt.axis([-50000, 50000, 0, 1])

recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "r:")
plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "r:")
plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "r:")
plt.plot([threshold_90_precision], [0.9], "ro")
plt.plot([threshold_90_precision], [recall_90_precision], "ro")
plt.show()

# 再看看精度和召回率的关系曲线
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "r:")
plt.plot([0.0, recall_90_precision], [0.9, 0.9], "r:")
plt.plot([recall_90_precision], [0.9], "ro")
plt.show()


# 我们设置一个精确度为90%的阈值，查看一下其精度和召回率
y_train_pred_90 = (y_scores >= threshold_90_precision)
print(precision_score(y_train_5, y_train_pred_90), recall_score(y_train_5, y_train_pred_90))

# ROC曲线
# 要绘制ROC曲线，首先需要使用roc_curve（）函数计算多种阈值的TPR和FPR

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

# 然后，使用Matplotlib绘制FPR对TPR的曲线。下面的代码可以绘制出图3-6的曲线：
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16)
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
fpr_90 = fpr[np.argmax(tpr >= recall_90_precision)]
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")
plt.plot([fpr_90], [recall_90_precision], "ro")
plt.show()

# 同样这里再次面临一个折中权衡：召回率（TPR）越高，分类器产生的假正类（FPR）就越多。
# 虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）

# 同样这里再次面临一个折中权衡：召回率（TPR）越高，分类器产生的假正类（FPR）就越多。
# 虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_train_5, y_scores))

#  下面，我们训练一下另一个模型（RandomForestClassifier分类器），比较两个模型的ROC曲线
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

recall_for_forest = tpr_forest[np.argmax(fpr_forest >= fpr_90)]

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")
plt.plot([fpr_90], [recall_90_precision], "ro")
plt.plot([fpr_90, fpr_90], [0., recall_for_forest], "r:")
plt.plot([fpr_90], [recall_for_forest], "ro")
plt.grid(True)
plt.legend(loc="lower right", fontsize=16)
plt.show()


# 查看随机森林ROC面积
print("随机森林ROC分数:", roc_auc_score(y_train_5, y_scores_forest))

# 再看一下随机森林的精度和召回率
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
print("随机森林模型精度和召回率：")
print(precision_score(y_train_5, y_train_pred_forest), recall_score(y_train_5, y_train_pred_forest))