特征工程和多项式回归

Feature Engineering and Polynomial Regression

内容

探索特征工程和多项式回归，它们可以利用线性回归的机制来拟合非常复杂，甚至是高度非线性的函数。
了解了线性回归如何使用特征工程对复杂甚至高度非线性的函数进行建模
认识到在进行特征工程时应用特征缩放非常重要

工具

你将使用前几次实验中开发的函数，以及 matplotlib 和 NumPy 工具库。

import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays

特征工程和多项式回归概述

Feature Engineering and Polynomial Regression Overview

线性回归可以直接构建如下形式的模型：

f_{\mathbf{w},b} = w_0x_0 + w_1x_1+ ... + w_{n-1}x_{n-1} + b \tag{1}

但如果你的特征或数据是非线性的，或者是特征的组合呢？例如，房价往往不是与居住面积线性相关的，而是对过小或过大的房屋进行惩罚，从而形成如上图所示的曲线。我们如何利用线性回归的机制来拟合这样的曲线呢？

回想一下，我们的“机制”是通过调整 (1) 式中的参数 $\mathbf{w}$ 和 $\mathbf{b}$ ，使方程适应训练数据。然而，仅仅调整 (1) 式中的 $\mathbf{w}$ 和 $\mathbf{b}$ 是无法使其拟合非线性曲线的。

多项式特征

Polynomial Features

前面我们讨论了数据为非线性的场景。现在让我们尝试利用已有的知识来拟合一条非线性曲线。我们从一个简单的二次函数开始： $y = 1 + x^2$

你已经熟悉我们使用的所有工具和方法，它们可以在 lab_utils.py 文件中找到并查看。我们将使用 np.c_[..]，这是一个 NumPy 函数，用于沿列边界拼接数组。

# 创建目标数据
# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2
X = x.reshape(-1, 1)

model_w,model_b = run_gradient_descent_feng(X,y,iterations=1000, alpha = 1e-2)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("no feature engineering")
plt.plot(x,X@model_w + model_b, label="Predicted Value");  plt.xlabel("X"); plt.ylabel("y"); plt.legend(); plt.show()

Iteration         0, Cost: 1.65756e+03
Iteration       100, Cost: 6.94549e+02
Iteration       200, Cost: 5.88475e+02
Iteration       300, Cost: 5.26414e+02
Iteration       400, Cost: 4.90103e+02
Iteration       500, Cost: 4.68858e+02
Iteration       600, Cost: 4.56428e+02
Iteration       700, Cost: 4.49155e+02
Iteration       800, Cost: 4.44900e+02
Iteration       900, Cost: 4.42411e+02
w,b found by gradient descent: w: [18.7], b: -52.0834

png

正如预期的那样，模型的拟合效果并不理想。我们需要的是类似于 $y = w_0x_0^2 + b$ 的形式，也就是多项式特征。

为了实现这一点，可以通过修改输入数据来“设计”所需的特征。如果将原始数据替换为平方后的 $x$ 值，就可以实现 $y = w_0x_0^2 + b$ 的形式。让我们试试，在下面的代码中将 X 替换为 X**2：

# 创建目标数据
x = np.arange(0, 20, 1)  # 生成从0到19的序列
y = 1 + x**2             # 定义目标函数 y = 1 + x^2

# 设计特征
X = x**2                 # <-- 添加特征工程，将 x 替换为 x^2

# X 应该是一个二维矩阵
X = X.reshape(-1, 1)  #X should be a 2-D Matrix
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Added x**2 feature")
plt.plot(x, np.dot(X,model_w) + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Iteration         0, Cost: 7.32922e+03
Iteration      1000, Cost: 2.24844e-01
Iteration      2000, Cost: 2.22795e-01
Iteration      3000, Cost: 2.20764e-01
Iteration      4000, Cost: 2.18752e-01
Iteration      5000, Cost: 2.16758e-01
Iteration      6000, Cost: 2.14782e-01
Iteration      7000, Cost: 2.12824e-01
Iteration      8000, Cost: 2.10884e-01
Iteration      9000, Cost: 2.08962e-01
w,b found by gradient descent: w: [1.], b: 0.0490

png

太棒了！拟合效果接近完美。注意，在图表上方打印出的 $\mathbf{w}$ 和 $b$ 的值： 通过梯度下降找到的 w,b 值：w: [1.], b: 0.0490。

梯度下降将初始的 $\mathbf{w}$ 和 $b$ 修改为 (1.0, 0.049)，对应的模型为： $y = 1 \cdot x_0^2 + 0.049$
这与目标函数 $y = 1 \cdot x_0^2 + 1$ 非常接近。如果你运行更多次迭代，模型可能会更加贴合目标。

特征选择

Selecting Features

在上面的例子中，我们知道需要一个 $x^2$ 项。但在实际问题中，可能并不总是显而易见需要哪些特征。我们可以添加多种潜在特征，尝试找出最有用的特征。例如，如果我们尝试以下模型： $y = w_0x_0 + w_1x_1^2 + w_2x_2^3 + b$
运行接下来的代码单元格来验证结果。

# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features .
X = np.c_[x, x**2, x**3]   #<-- added engineered feature

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Iteration         0, Cost: 1.14029e+03
Iteration      1000, Cost: 3.28539e+02
Iteration      2000, Cost: 2.80443e+02
Iteration      3000, Cost: 2.39389e+02
Iteration      4000, Cost: 2.04344e+02
Iteration      5000, Cost: 1.74430e+02
Iteration      6000, Cost: 1.48896e+02
Iteration      7000, Cost: 1.27100e+02
Iteration      8000, Cost: 1.08495e+02
Iteration      9000, Cost: 9.26132e+01
w,b found by gradient descent: w: [0.08 0.54 0.03], b: 0.0106

png

注意到 $\mathbf{w}$ 的值为 [0.08, 0.54, 0.03]，而 $b$ 的值为 0.0106。这意味着模型在拟合/训练后的形式为：

y = 0.08x + 0.54x^2 + 0.03x^3 + 0.0106

梯度下降通过增加 $w_1$ （对应 $x^2$ 特征的权重）相对于其他特征的值，突出了与 $x^2$ 数据最匹配的部分。如果你运行足够长时间，梯度下降会进一步减少其他特征的影响。

梯度下降通过强调与正确特征相关的参数，帮助我们选择“正确”的特征。

让我们回顾这个想法：

最初，特征经过重新缩放，以使它们具有可比性。
较小的权重值意味着特征的重要性较低；在极端情况下，当权重趋近于零时，说明该特征对拟合模型无用。
在上面的例子中，经过拟合后，与 $x^2$ 特征相关的权重远大于 $x$ 或 $x^3$ 的权重，因为 $x^2$ 对拟合数据最有用。

通过上面的方法我们可以找打各个 future 的权重，例子中的 $x^2$ 特征权重远大于其他的 future

另一种视角

An Alternate View

在上面的例子中，多项式特征是根据它们与目标数据的匹配程度来选择的。另一种思考方式是，注意到一旦我们创建了新的特征，实际上我们仍然是在使用线性回归。因此，最好的特征将是相对于目标值呈线性关系的特征。

通过一个示例可以更好地理解这一点。

# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features .
X = np.c_[x, x**2, x**3]   #<-- added engineered feature
X_features = ['x','x^2','x^3']

fig,ax=plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X[:,i],y)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("y")
plt.show()

png

上面的例子中，显然 $x^2$ 特征与目标值 $y$ 的关系是线性的。因此，线性回归可以轻松地利用该特征生成模型。

特征缩放

Scaling features

正如上一个实验中所描述的，如果数据集中包含尺度差异显著的特征，应进行特征缩放，以加快梯度下降的收敛速度。在上述示例中，特征 $x$ 、 $x^2$ 和 $x^3$ 的尺度自然会有很大的差异。让我们对这个示例应用 Z-score 标准化。

# 创建目标数据
x = np.arange(0, 20, 1)  # 生成从0到19的序列
X = np.c_[x, x**2, x**3]  # 创建特征矩阵，包括 x、x^2 和 x^3
print(f"原始 X 每列的峰峰值范围: {np.ptp(X, axis=0)}")

# 添加均值归一化
X = zscore_normalize_features(X)  # 对特征进行 Z-score 标准化
print(f"归一化后 X 每列的峰峰值范围: {np.ptp(X, axis=0)}")

原始 X 每列的峰峰值范围: [  19  361 6859]
归一化后 X 每列的峰峰值范围: [3.3  3.18 3.28]

现在我们可以尝试使用一个更大的学习率 ( $\alpha$ ) 再次运行模型：

x = np.arange(0,20,1)
y = x**2

X = np.c_[x, x**2, x**3]
X = zscore_normalize_features(X) 

model_w, model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Iteration         0, Cost: 9.42147e+03
Iteration      1000, Cost: 4.21521e+00
Iteration      2000, Cost: 3.23649e+00
Iteration      3000, Cost: 2.48501e+00
Iteration      4000, Cost: 1.90802e+00
Iteration      5000, Cost: 1.46500e+00
Iteration      6000, Cost: 1.12484e+00
Iteration      7000, Cost: 8.63665e-01
Iteration      8000, Cost: 6.63131e-01
Iteration      9000, Cost: 5.09160e-01
w,b found by gradient descent: w: [ 7.67 93.95 12.29], b: 123.5000

png

特征缩放使得模型收敛速度显著加快。
请再次注意 $\mathbf{w}$ 的值，其中 $w_1$ （对应 $x^2$ 特征的权重）被最为强调，而梯度下降几乎完全消除了 $x^3$ 特征的影响。

复杂函数

Complex Functions

通过特征工程，即使是相当复杂的函数也可以被建模：

x = np.arange(0,20,1)
y = np.cos(x/2)

X = np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13]
X = zscore_normalize_features(X) 

model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha = 1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Iteration         0, Cost: 2.20188e-01
Iteration    100000, Cost: 1.70074e-02
Iteration    200000, Cost: 1.27603e-02
Iteration    300000, Cost: 9.73032e-03
Iteration    400000, Cost: 7.56440e-03
Iteration    500000, Cost: 6.01412e-03
Iteration    600000, Cost: 4.90251e-03
Iteration    700000, Cost: 4.10351e-03
Iteration    800000, Cost: 3.52730e-03
Iteration    900000, Cost: 3.10989e-03
w,b found by gradient descent: w: [ -1.34 -10.    24.78   5.96 -12.49 -16.26  -9.51   0.59   8.7   11.94
   9.27   0.79 -12.82], b: -0.0073

png