Double Machine Learning(DML) 原理及其应用

1. 为什么需要DML？
2. DML原理
- 2.1 符号定义
- 2.2 DML训练过程
- 2.3 为什么残差正交化可得到无偏差因果效应?
- 2.4 使用DML估计ATE
- 2.5 使用DML估计CATE
- 2.6 直接预测反事实的Y
3. Econml DML应用实战

1. 为什么需要DML？

用来做因果推断
- 优势
  - 减少函数形式的假设
  - 可以对高维数据进行建模
  - 自带正则化可以达到变量选择的目的
  - 劣势
    
    只关注预测效果
    
    对Treatment effect的估计可能是有偏的，需要权衡偏差和方差
    
    的收敛速度一般小于
    
    不能给出置信区间
- 消除偏差
- 收敛速度
- 可以构建置信区间

2. DML原理

2.1 符号定义

Y是实验影响的核心指标
T是treatment，通常是0/1变量，代表样本进入实验组还是对照组，对随机AB实验T⊥X
X是Confounder，可以简单理解为未被实验干预过的用户特征，通常是高维向量

最直接的方法就是用X和T一起对Y建模，直接估计。但这样估计出的往往是有偏的，偏差部分来自于对样本的过拟合，部分来自于估计的偏差

2.2 DML训练过程

利用任意ML模型拟合Y和T得到残差，
对，利用任何ML模型拟合

的拟合可以是参数模型也可以是非参数模型，参数模型可以直接拟合。而非参数模型因为只接受输入和输出所以需要再做如下变换，模型target变为 , 样本权重为
Cross-fitting

保证估计无偏很重要的一步就是，用来降低带来的估计偏差。先把总样本分成两份: ，。先用估计残差，估计，再用估计残差，估计，取平均得到最终的估计。当然也可以进一步使用来增加估计的稳健性。

下图比较了不使用，使用但是不用以及使用的估计效果，从图中可以看出消除偏差的效果还是挺明显的。

2.3 为什么残差正交化可得到无偏差因果效应?

直观角度上，线性回归是拟合Y在特征空间X的最佳投影(既误差最小化)，所以残差是垂直样本空间的，既最大限度消除了(独立) 的相关性！

2.4 使用DML估计ATE

具体到因果推断的例子上，我们只关心对的影响，因此我们可以首先使用回归，得到一个T的残差(实际 -预测 )，然后使用回归，得到一个的残差(实际 -预测 )，最后使用的残差回归的残差，估计的参数即我们想要的ATE。

具体的方法也就出来了，其核心思想即分别用机器学习算法基于预测和，然后使用的残差回归的残差：

其中建模，建模。最后，我们再对残差建模得到。

为什么说DML能去偏呢？

其关键在于，它对Treatment实施了去偏。的残差可以看作将对的作用从中去除后剩下的量，此时的残差独立于。而的作用在于去除的方差，即将引起的Y的方差从中去除。

2.5 使用DML估计CATE

同样地，我们首先基于使用获得的残差和的残差，之后使用逻辑回归拟合残差，不同的是，这次我们把和的交互项加进来，即

然后我们可以计算CATE的值了，

其中表示最后的逻辑回归模型

2.6 直接预测反事实的Y

在处理非线性CATE时，另一种方案是我们将不再尝试估计CATE的线性近似。相反，我们将做出反事实的预测。 (这种方案在实际中使用也很多，但并没有严格的理论证明！)

其流程分为三个步骤：

第一步依然是估计和的残差，
第二步基于和的残差使用模型预测的残差
最后在预测的上加上，即得到最后的值。

3. Econml DML应用实战

Econml 官方使用示例Double Machine Learning Notebook
该案例有非常多小的案例
- Example Usage with Single Continuous Treatment Synthetic Data and Model Selection
- Example Usage with Single Binary Treatment Synthetic Data and Confidence Intervals
- Example Usage with Multiple Continuous Treatment Synthetic Data
- Example Usage with Single Continuous Treatment Observational Data
- Example Usage with Multiple Continuous Treatment, Multiple Outcome Observational Data

其中包括了：连续数值的T，二分类01的T，多个T，多个Y

# 连续数值的T
array([ 0.41083324,  0.57893171,  1.08130798, ..., -0.43799495,
        1.61941775,  1.64209826])

# 二分类01的T
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0])

# Multi-T
array([[1.35325451, 1.15373159, 0.46373402],
       [1.35325451, 1.15373159, 0.98954119],
       [1.35325451, 0.87129337, 0.73716407],
       ...,
       [1.13462273, 1.03673688, 0.46373402],
       [1.0260416 , 0.95165788, 0.39877612],
       [1.05779029, 0.78390154, 0.55961579]])

# Multi-Y
array([[ 9.01869549,  8.40737833,  9.26482856],
       [ 8.72323127,  8.44934252,  8.98719682],
       [ 8.25322765,  9.91145572,  8.83171192],
       ...,
       [ 9.93653541,  9.51546936,  9.50599061],
       [10.85591733,  9.31793838, 10.9273763 ],
       [10.23652533, 11.20553036,  8.85936345]])

方法包括：
- LinearDML：默认
- SparseLinearDML：Polynomial Features for Heterogeneity
- DML：Polynomial Features with regularization
- CausalForestDML： Non-Parametric Heterogeneity with Causal Forest

数据生成

import econml
## Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Main imports
from econml.dml import DML, LinearDML, SparseLinearDML, CausalForestDML

# Helper imports
import numpy as np
from itertools import product
from sklearn.linear_model import (Lasso, LassoCV, LogisticRegression,
                                  LogisticRegressionCV,LinearRegression,
                                  MultiTaskElasticNet,MultiTaskElasticNetCV)
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split

%matplotlib inline
# Treatment effect function
def exp_te(x):
    return np.exp(2*x[0])

# DGP constants
np.random.seed(123)
n = 2000
n_w = 30
support_size = 5
n_x = 1
# Outcome support
support_Y = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_Y = np.random.uniform(0, 1, size=support_size)
epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n)
# Treatment support
support_T = support_Y
coefs_T = np.random.uniform(0, 1, size=support_size)
eta_sample = lambda n: np.random.uniform(-1, 1, size=n)

# Generate controls, covariates, treatments and outcomes
W = np.random.normal(0, 1, size=(n, n_w))
X = np.random.uniform(0, 1, size=(n, n_x))
# Heterogeneous treatment effects
TE = np.array([exp_te(x_i) for x_i in X])
T = np.dot(W[:, support_T], coefs_T) + eta_sample(n)
Y = TE * T + np.dot(W[:, support_Y], coefs_Y) + epsilon_sample(n)

Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val = train_test_split(Y, T, X, W, test_size=.2)
# Generate test data
X_test = np.array(list(product(np.arange(0, 1, 0.01), repeat=n_x)))

模型

# 模型
est = LinearDML(model_y=RandomForestRegressor(),
                model_t=RandomForestRegressor(),
                random_state=123)
est.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred = est.effect(X_test)
# 模型
est1 = SparseLinearDML(model_y=RandomForestRegressor(),
                       model_t=RandomForestRegressor(),
                       featurizer=PolynomialFeatures(degree=3),
                       random_state=123)
est1.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred1 = est1.effect(X_test)

# 模型
est2 = DML(model_y=RandomForestRegressor(),
           model_t=RandomForestRegressor(),
           model_final=Lasso(alpha=0.1, fit_intercept=False),
           featurizer=PolynomialFeatures(degree=10),
           random_state=123)
est2.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred2 = est2.effect(X_test)
# 模型
est3 = CausalForestDML(model_y=RandomForestRegressor(),
                       model_t=RandomForestRegressor(),
                       criterion='mse', n_estimators=1000,
                       min_impurity_decrease=0.001,
                       random_state=123)
est3.tune(Y_train, T_train, X=X_train, W=W_train)
est3.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred3 = est3.effect(X_test)

# 画图
plt.figure(figsize=(10,6))
plt.plot(X_test, te_pred, label='DML default')
plt.plot(X_test, te_pred1, label='DML polynomial degree=3')
plt.plot(X_test, te_pred2, label='DML polynomial degree=10 with Lasso')
plt.plot(X_test, te_pred3, label='ForestDML')
expected_te = np.array([exp_te(x_i) for x_i in X_test])
plt.plot(X_test, expected_te, 'b--', label='True effect')
plt.ylabel('Treatment Effect')
plt.xlabel('x')
plt.legend()
plt.show()

模型选择

score={}
score["DML default"] = est.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=3"] = est1.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=10 with Lasso"] = est2.score(Y_val, T_val, X_val, W_val)
score["ForestDML"] = est3.score(Y_val, T_val, X_val, W_val)

mse_te={}
mse_te["DML default"] = ((expected_te - te_pred)**2).mean()
mse_te["DML polynomial degree=3"] = ((expected_te - te_pred1)**2).mean()
mse_te["DML polynomial degree=10 with Lasso"] = ((expected_te - te_pred2)**2).mean()
mse_te["ForestDML"] = ((expected_te - te_pred3)**2).mean()

print("best model selected by MSE of TE: ", min(mse_te, key=lambda x: mse_te.get(x)))

score指的是MSE，越小越好