本文是深度学习入门（deep learning tutorial, DLT）系列的第三篇文章，主要介绍一下多元线性回归。想要学习深度学习或者想要了解机器学习的同学可以关注公众号GeodataAnalysis，我会逐步更新这一系列的文章。

多元线性回归和一元线性回归的模型表达基本一致，区别只在于输入的变量个数更多。因此，二者在假设函数和梯度下降方面有些区别，本文会在上一篇一元线性回归的基础上对其进行详细介绍。为了准确的衡量模型的精度，本文介绍了最小二乘法，使用该方法计算出的假设函数能最小化使损失函数的值。此外，多个变量的量纲不一定一致，因此本文还会介绍一下无量纲化在机器学习中的作用。

1 测试数据集

关注公众号GeodataAnalysis，回复20230302下载测试数据。测试数据为波士顿房价数据集，数据集的因变量为房价；共有十三个自变量，表示影响房价的十三个因素。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('./housing.data', delim_whitespace=' ')
df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

x = df[df.columns[:-1]].to_numpy().T
y = df[df.columns[-1]].to_numpy()
x.shape, y.shape

((13, 506), (506,))

2 假设函数

多元线性回归的输入变量是多维变量，其表示方法为：

容纳这些多个特征的假设函数的多变量形式如下：

使用矩阵乘法的定义，我们的多变量假设函数可以简洁地表示为如下形式，其中为全是1的矩阵，用于表示偏置项：

代码表示如下：

def hypothesis_fun(x, parameters):
    return parameters @ x

3 损失函数

多元线性回归的损失函数与一元线性回归一致，也是平⽅误差函数，即：

代码表示如下：

def loss_fun(x, y, parameters):
    y_predict = hypothesis_fun(x, parameters)
    return np.sum(np.square(y_predict-y))/(2*y.size)

4 梯度下降

多元线性回归的梯度下降算法和一元线性回归基本一致，区别只在于参数的数量多了一些，具体如下：

代码表示如下：

def gradient_decent(x, y, parameters, learning_rate):
    y_predict = hypothesis_fun(x, parameters)
    return parameters - learning_rate*(y_predict-y) @ x.T/y.size

5 最小二乘法

最小二乘法就是要找到一组使得（残差平方和) 最小，即，求。

最小二乘法的代数法解法就是对求偏导数，令偏导数为0，再解方程组，得到。矩阵法比代数法要简洁，下面主要讲解下矩阵法解法。

假设函数的矩阵并表达形式为：

其中，为维的矩阵，代表样本的特征数，代表样本的个数。假设函数为的向量, 为的向量，里面有个代数法的模型参数。

损失函数定义为，其中是样本的输出向量，维度为。在这主要是为了求导后系数为1，方便计算。

根据最小二乘法的原理，我们要对这个损失函数的向量求导取0。结果如下式：

对上述求导等式整理后可得：

如果不可逆，通常有以下两种原因：

冗余特征，其中两个特征非常密切相关（即它们是线性依赖的）
特征太多（例如m≤n）。在这种情况下，应删除某些特征或使用“正则化”。

使用最小二乘法计算假设函数的参数的代码如下：

x2 = np.vstack((np.ones(x.shape[1]), x))
parameters = np.linalg.inv(x2 @ x2.T) @ x2 @ y

6 无量纲化

本文介绍的无量纲化方法有两种，一是特征缩放，其计算公式为：

二是均值标准化，其计算公式为：

这两个无量纲化算法的代码表示如下：

def feature_scale(x):
    x_min = np.min(x, axis=1)[..., np.newaxis]
    x_max = np.max(x, axis=1)[..., np.newaxis]
    return (x - x_min)/(x_max - x_min)

def mean_normalization(x):
    x_min = np.min(x, axis=1)[..., np.newaxis]
    x_max = np.max(x, axis=1)[..., np.newaxis]
    x_mean = np.mean(x, axis=1)[..., np.newaxis]
    return (x - x_mean)/(x_max - x_min)

7 模型训练及预测

多元线性回归的训练步骤和一元线性回归基本一致，唯一的区别在于在正式训练之前对输入变量进行了无量纲化。此外，在输入变量x矩阵的第一行前插入全是1的一行，用于表示偏置项。

parameters = np.random.rand(x.shape[0]+1)
learning_rate = 0.1
losses = []

x2 = feature_scale(x)
x2 = np.vstack((np.ones(x2.shape[1]), x2))

batch_size = 300
epoch_size = 100

for epoch in range(epoch_size):
    for i in range(y.size//batch_size+1):
        random_samples = np.random.choice(x2.shape[1], batch_size)
        parameters = gradient_decent(x2[:, random_samples], 
                              y[random_samples], 
                              parameters, 
                              learning_rate)
        losses.append(loss_fun(x2[:, random_samples], 
                               y[random_samples], 
                               parameters))
plt.plot(losses);

现在我们的模型已经训练好了，下面我把模型预测的代码封装一下，具体如下：

def predict(x, parameters):
    x = feature_scale(x)
    x = np.vstack((np.ones(x.shape[1]), x))
    return parameters @ x

可视化预测结果的代码如下，这里我们对比了最小二乘法的结算结果和我们模型的结果，可以发现二者相差不大，具体如下：

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24, 9))
ax1.plot(losses)
ax1.set_title('Loss', fontdict={'size': 30})
ax2.scatter(predict(x, parameters), y)
ax2.plot(y, y, 'r-')
ax2.set_title('Linear Regression', fontdict={'size': 30})

# Best theta
x2 = np.vstack((np.ones(x.shape[1]), x))
theta = np.linalg.inv(x2 @ x2.T) @ x2 @ y
ax3.scatter(theta @ x2, y);
ax3.plot(y, y, 'r-')
ax3.set_title('Least squares', fontdict={'size': 30});

8 对模型进行封装

这一节我们将上文构建的模型封装成了Python的一个类，方便对其调用进行训练、预测等任务。目前这个类较为简单，可以将其视为一个建议的线性神经网络模型，理解这个类的构造对理解后文的代码十分有帮助，建议好好看一下。

class MutiLinearRe():

    def __init__(self, input_shape):
        self.input_shape = input_shape
        self.parameters = np.random.rand(self.input_shape[0]+1)
  
    def hypothesis_fun(self, x, parameters):
        return parameters @ x

    def loss_fun(self, x, y, parameters):
        y_predict = self.hypothesis_fun(x, parameters)
        return np.sum(np.square(y_predict-y))/(2*y.size)
  
    def gradient_decent(self, x, y, parameters, learning_rate):
        y_predict = self.hypothesis_fun(x, parameters)
        return parameters - learning_rate*(y_predict-y) @ x.T/y.size
  
    def feature_scale(self, x):
        x_min = np.min(x, axis=1)[..., np.newaxis]
        x_max = np.max(x, axis=1)[..., np.newaxis]
        return (x - x_min)/(x_max - x_min)

    def mean_normalization(self, x):
        x_min = np.min(x, axis=1)[..., np.newaxis]
        x_max = np.max(x, axis=1)[..., np.newaxis]
        x_mean = np.mean(x, axis=1)[..., np.newaxis]
        return (x - x_mean)/(x_max - x_min)

    def _normalization(self, x, normalization):
        if normalization:
            if 'feature' == normalization:
                x = self.feature_scale(x)
            elif 'mean' == normalization:
                x = self.mean_normalization(x)
        return x

    def fit(self, x, y, epoch_size, batch_size, learning_rate, normalization="feature"):
        self.learning_rate = learning_rate
        self.loss = []
        self.normalization = normalization
        self.x = self._normalization(x, self.normalization)
        self.x = np.vstack((np.ones(self.input_shape[1]), self.x))
        self.y = y.copy()
        
        for epoch in range(epoch_size):
            for i in range(y.size//batch_size+1):
                random_samples = np.random.choice(x2.shape[1], batch_size)
                self.parameters = self.gradient_decent(self.x[:, random_samples], 
                self.y[random_samples],
                self.parameters, 
                self.learning_rate)
                loss_ = self.loss_fun(self.x[:, random_samples], 
                                      self.y[random_samples], 
                                      self.parameters)
                if np.isinf(loss_):
                    raise ValueError("Overflow, please change the learning_rate")
                self.loss.append(loss_)
    
    def predict(self, x):
        x = self._normalization(x, self.normalization)
        x = np.vstack((np.ones(x.shape[1]), x))
        return self.parameters @ x

model = MutiLinearRe(input_shape=x.shape)
model.fit(x, y, epoch_size=100, batch_size=300, 
    learning_rate=0.1, normalization='feature')

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24, 9))
ax1.plot(model.loss[700:])
ax1.set_title('Loss', fontdict={'size': 30})
ax2.scatter(model.predict(x), y)
ax2.plot(y, y, 'r-')
ax2.set_title('Linear Regression', fontdict={'size': 30})

# Least squares: Best theta
x2 = np.vstack((np.ones(x.shape[1]), x))
theta = np.linalg.inv(x2 @ x2.T) @ x2 @ y
ax3.scatter(theta @ x2, y);
ax3.plot(y, y, 'r-')
ax3.set_title('Least squares', fontdict={'size': 30});