当前位置：网站首页>Paper notes: generalized random forests

Paper notes: generalized random forests

2022-06-25 16:36:00 【＃Super Pig】

This paper ignores the asymptote proof for the time being , Focus on GRF Of prediction and split Method
ref：
Zhihu blog ：https://zhuanlan.zhihu.com/p/448524822
S. Athey, J. Tibshirani, and S. Wager, “Generalized random forests,” Ann. Statist., vol. 47, no. 2, Apr. 2019, doi: 10.1214/18-AOS1709.

Motivation

This article aims to find a general Of forest-based Estimation method of , It's right random forest Generalization extension of . This is also the greatest contribution of the work . To be specific , The work proposed General Object yes ：
$\mathbb{E}[\Psi_{\theta(x),\nu(x)}(O_i)|X_i=x]=0 \tag{1}$
among , $\Psi(\cdot)$ yes scoring function, It can be understood as loss function Or optimization objectives , $\theta(x)$ Is the quantity we expect to estimate , $\nu(x)$ It's optional nuisance parameter, $O_i$ Is and $\theta(x)$ The relevant quantity . The purpose is to build a forest to make Eq(1) establish .

To achieve these goals （Eq(1)）, We need to solve the following optimization problems ：
$(\hat\theta(x),\hat\nu(x))=\arg\min_{\theta,\nu} \|\sum_{i=1}^{n}\alpha_i(x)\cdot\Psi_{\theta,\nu}(O_i)\|_2 \tag{2}$
And the optimal solution of the optimization problem $\hat\theta(x)$ That's our estimate ;

as for $\alpha_i(x)$ , It represents the training sample $i$ And test samples （ from $x$ Express ） The similarity , Play the role of weighting , The specific calculation method is as follows ：
$\alpha_i(x)=\frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \tag{3}$
$\alpha_{b_i}(x)=\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \tag{4}$
among , $B$ Represents the number of trees , $b$ It stands for the second $b$ tree , $L_b(x)$ Represent and test samples $x$ Fall in the second place $b$ A collection of training samples on the same leaf of a tree . therefore , $\alpha_{b_i}(x)$ It means in the second paragraph $b$ The third in the tree $i$ Samples and $x$ The frequency of falling on the same leaf node 【 This frequency reflects the similarity 】. Be careful , $\sum_{i=1}^n\alpha_i(x)=1$ ！

To sum up ,Eq(1) and Eq(2) It's actually equivalent , After the forest is built, it can be based on Eq(1) or (2) Make a forecast assessment , These two formulas are GRF At the heart of , Many statistical problems （ Such as , least square 、 maximum likelihood 、 Quantile regression, etc ） Can be seen as Eq(1) The special case of .

Case of Regression
Take the regression problem as an example , prove random forest yes GRF A special case of ：
For regression problems , Estimates we care about $\mu(x)=\mathbb{E}[Y_i|X_i =x]$ （ here $\mu(x)$ Namely $\theta(x)$ ), Corresponding scoring function Namely $\Psi_{\mu(x)}(O_i)=Y_i-\mu(x)$ .
meanwhile , We know random forest After the forest is built , Given the test sample $x$ , The predicted value is $x$ Of the training sample set of the leaf node Y mean value , The formal expression is as follows ：
$\hat\mu(x)=\frac{1}{B}\cdot\sum_{b=1}^B\hat\mu_b(x), \ \hat\mu_b(x)=\frac{\sum_{\{i:X_i\in L_b(x)\}}Y_i}{|L_b(x)|} \tag{5}$
Now? , We just need to prove that when scoring function by $\Psi_{\mu(x)}(O_i)=Y_i-\mu(x)$ when ,Eq(5) Establishment is Eq(1) The necessary and sufficient conditions for its establishment . Prove the following ：
Eq(1) The establishment is equivalent to Eq(6) establish ：
$\sum_{i=1}^n\alpha_i(x)\cdot (Y_i-\hat\mu(x))=0 \tag{6}$
And because of $\sum_{i=1}^n\alpha_i(x)=1$ establish , therefore Eq(6) It can be transformed into Eq(7)：
$\begin{aligned} \hat\mu(x) &=\sum_{i=1}^n\alpha_i(x)\cdot Y_i \\ &=\sum_{i=1}^n \frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B\cdot\sum_{i=1}^n\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B \hat\mu_b(x) \end{aligned} \tag{7}$

thus it can be seen , When Eq(1) When established , Can be launched Eq(6) establish , therefore ,random forest yes GRF A special case of .

split criterion

The original idea is , Minimize the error between the evaluated value and the true value of child nodes , Which is to minimize $err(C_1,C_2)$ ：
$err(C_1,C_2)=\sum_{j=1}^2\mathbb{P}[X\in C_j|X\in P]\cdot\mathbb{E}[(\hat\theta_{C_j}(\mathcal{J})-\theta(x))^2|X\in C_j] \tag{8}$
however , Due to the true value $\theta(x)$ Unknown , therefore , After some derivation , We will minimize $err(C_1,C_2)$ To maximize $Delta(C_1,C_2)$ ：
$\Delta(C_1,C_2)=\frac{n_{c_1}\cdot n_{c_2}}{n_p^2}\cdot(\hat\theta_{C_1}(\mathcal{J})-\hat\theta_{C_2}(\mathcal{J}))^2 \tag{9}$
After transformation , You can find , Maximize Eq(7) To maximize the heterogeneity between child nodes .

thus , We know the splitting criterion of nodes , But in practice , because $\hat\theta_{C_j}(\mathcal{J})$ The computational overhead is large , therefore , The author puts forward a method based on gradient The approximate solution of ：

gradient tree algorithm

First ,PROPOSITION1 Pointed out that , $\hat\theta_{C}$ There is the following approximate solution $\tilde\theta_{C}$ ：
$\tilde\theta_{C}=\hat \theta_p-\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in C\}}\xi^T\cdot A_p^{-1}\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{10}$
among , $\hat \theta_p$ Represents the parent node $P$ On $\theta$ The estimate of , Can be Eq(1)orEq(2) Get ; as for $\xi$ , The paper says it is from $(\theta,\nu)$ Filter out... From the vector $\theta$ -coordinate Vector , But I see in other papers that everyone has omitted this thing ; and $A_p$ The meaning is $\Psi_{\hat\theta_p,\hat\nu_p}(O_i)$ The desired gradient of , The calculation formula is as follows ：
$A_p=\nabla\mathbb{E}[\Psi_{\hat\theta_p,\hat\nu_p}(O_i)|X_i\in P]=\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in P\}}\nabla\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{11}$
But I don't quite understand who the derivative here is for .

When $\hat\theta_{C}$ There is an approximate solution $\tilde\theta_{C}$ after , Can be launched $\Delta(C_1,C_2)$ There are also corresponding approximate solutions $\tilde\Delta(C_1,C_2)$ ：【 The derivation of this step is omitted for the time being 】
$\tilde\Delta(C_1,C_2)=\sum_{j=1}^2\frac{1}{|\{i:X_i\in C_j\}|}\cdot(\sum_{\{i:X_i\in C_j\}}\rho_i)^2 \tag{12}$
among , $\rho_i=-\xi^T\cdot A_p^{-1}\cdot\Psi_{\hat\theta_p,\hat\nu_p}(O_i)$ , It means the first one i Samples are calculating $\hat\theta_p$ The impact of time .

thus , We can summarize node splitting into the following two steps ：
1. labeling step
This step , First you need to calculate $\hat\theta_p$ and $A_p$ , And then calculate $\rho_i$ ; Be careful , At each split , Just calculate one $\rho_i$ 【 Because the parent node has been determined 】
2. regreession step
Look for child nodes , bring $\tilde\Delta(C_1,C_2)$ Maximum . This step can pass the standard CART Regression split realization .

GRF for CATE

next , Let's see GRF How to apply to CATE Of the assessment .
In this application , The author still uses Partially Linear model Based on $\Psi(\cdot)$ , So-called Partially Linear Model It means that the data meets the following structure ：
$Y=\theta(x)\cdot T+g(x)+\epsilon, \ T=f(x)+\eta \tag{13}$
So-called ” Partial linearity “ Mainly reflected in Y Structurally .
Put it in CATE Evaluation questions , $\theta(x)$ It means $x$ Treatment effect under , Formalized as $\theta(x)=\mathbb{E}[Y(T=1)-Y(T=0)|X=x]$ .
be based on Partially Linear Model, The author constructed scoring function by $\Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\theta(x)\cdot T_i-\nu(x)$ , It can be understood as this scoring function Our aim is to find a $(\hat\theta(x),\hat\nu(x))$ bring $Y_i$ And $\theta(x)\cdot T_i+\nu(x)$ As close as possible 【 The essence is the fitting problem 】.

Under this setting , The values are solved as follows ：
$\hat\theta(x)=\xi^T\cdot\frac{Cov(T_i,Y_i|Xi=x)}{Var(T_i|X_i=x)} \tag{14}$
$A_p=\frac{1}{|\{i:X_i\in P\}|}\cdot\sum_{\{i:X_i\in C_j\}}(T_i-\bar T_p)^{\bigotimes 2} \tag{15}$
$\rho_i=\xi^T\cdot A_p^{-1}\cdot (Y_i-\bar Y_p-(T_i-\bar T_p)\cdot \hat\theta_p) \tag{16}$
The derivation of these values , At present, I only understand Eq(14) The source of the ：
consider $Y=\theta(x)\cdot T+g(x)$ , The optimal $\theta(x)$ The solution of can be regarded as the solution of one - dimensional equation $y = a x + b$ The slope of , And this slope can be expressed by variance and covariance 【 Reference material 】
It should be noted that , The mean here 、 variance 、 Covariance is weighted , And the weight is $\alpha_i$ .

CausalForestDML

seeing the name of a thing one thinks of its function ,CausalForestDML It's fusion CausalForest and DML.DML In estimation CATE The core idea of time is based on the following equation ：
$Y-\mathbb{E}[Y|X]=\theta(x)\cdot (T-\mathbb{E}[T|X]) \ Equivalent to \ \tilde Y=\theta(x)\cdot \tilde T \tag{17}$
That is to say , take CATE Assessment questions for , Convert into T To fit the residuals of Y Residual of , And the regression coefficient is CATE.【 The process of calculating residuals , In fact, it is the process of orthogonalization 】
be based on DML Thought ,CausalForestDML The following scoring function： $\Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\mathbb{E}[Y_i|X]-\theta(x)\cdot (T_i-\mathbb{E}[T_i|X])-\nu(x)$ .
The corresponding optimal $\theta(x)$ It becomes $\hat\theta(x)=\xi^T\cdot\frac{Cov(Y_i-\mathbb{E}[Y_i|X_i],T_i-\mathbb{E}[T_i|X_i]|Xi=x)}{Var(T_i-\mathbb{E}[Y_i|X_i]|X_i=x)}$ .

Be careful ： The original text calls this method Centered GRF！ And watch 1 It turns out that GRF with centering Better than GRF without centering The effect is better .

GRF vs Causal Forest

These two articles are Susan Of ,GRF Compared with Causal Forest The biggest difference is

Casual Forest Use exact loss criterion（ namely Eq(9)） Split rather than gradient-based loss criterion（ namely Eq(12));
Causal Forest Calculate again treatment effect Weighted average is not used ;

原网站

版权声明
本文为[＃Super Pig]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206251553176147.html