当前位置:网站首页>Paper notes: generalized random forests
Paper notes: generalized random forests
2022-06-25 16:36:00 【#Super Pig】
This paper ignores the asymptote proof for the time being , Focus on GRF Of prediction and split Method
ref:
- Zhihu blog :https://zhuanlan.zhihu.com/p/448524822
- S. Athey, J. Tibshirani, and S. Wager, “Generalized random forests,” Ann. Statist., vol. 47, no. 2, Apr. 2019, doi: 10.1214/18-AOS1709.
Motivation
This article aims to find a general Of forest-based Estimation method of , It's right random forest Generalization extension of . This is also the greatest contribution of the work . To be specific , The work proposed General Object yes :
E [ Ψ θ ( x ) , ν ( x ) ( O i ) ∣ X i = x ] = 0 (1) \mathbb{E}[\Psi_{\theta(x),\nu(x)}(O_i)|X_i=x]=0 \tag{1} E[Ψθ(x),ν(x)(Oi)∣Xi=x]=0(1)
among , Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅) yes scoring function, It can be understood as loss function Or optimization objectives , θ ( x ) \theta(x) θ(x) Is the quantity we expect to estimate , ν ( x ) \nu(x) ν(x) It's optional nuisance parameter, O i O_i Oi Is and θ ( x ) \theta(x) θ(x) The relevant quantity . The purpose is to build a forest to make Eq(1) establish .
To achieve these goals (Eq(1)), We need to solve the following optimization problems :
( θ ^ ( x ) , ν ^ ( x ) ) = arg min θ , ν ∥ ∑ i = 1 n α i ( x ) ⋅ Ψ θ , ν ( O i ) ∥ 2 (2) (\hat\theta(x),\hat\nu(x))=\arg\min_{\theta,\nu} \|\sum_{i=1}^{n}\alpha_i(x)\cdot\Psi_{\theta,\nu}(O_i)\|_2 \tag{2} (θ^(x),ν^(x))=argθ,νmin∥i=1∑nαi(x)⋅Ψθ,ν(Oi)∥2(2)
And the optimal solution of the optimization problem θ ^ ( x ) \hat\theta(x) θ^(x) That's our estimate ;
as for α i ( x ) \alpha_i(x) αi(x), It represents the training sample i i i And test samples ( from x x x Express ) The similarity , Play the role of weighting , The specific calculation method is as follows :
α i ( x ) = 1 B ⋅ ∑ b = 1 B α b i ( x ) (3) \alpha_i(x)=\frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \tag{3} αi(x)=B1⋅b=1∑Bαbi(x)(3)
α b i ( x ) = 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ (4) \alpha_{b_i}(x)=\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \tag{4} αbi(x)=∣Lb(x)∣1({ Xi∈Lb(x)})(4)
among , B B B Represents the number of trees , b b b It stands for the second b b b tree , L b ( x ) L_b(x) Lb(x) Represent and test samples x x x Fall in the second place b b b A collection of training samples on the same leaf of a tree . therefore , α b i ( x ) \alpha_{b_i}(x) αbi(x) It means in the second paragraph b b b The third in the tree i i i Samples and x x x The frequency of falling on the same leaf node 【 This frequency reflects the similarity 】. Be careful , ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1!
To sum up ,Eq(1) and Eq(2) It's actually equivalent , After the forest is built, it can be based on Eq(1) or (2) Make a forecast assessment , These two formulas are GRF At the heart of , Many statistical problems ( Such as , least square 、 maximum likelihood 、 Quantile regression, etc ) Can be seen as Eq(1) The special case of .
Case of Regression
Take the regression problem as an example , prove random forest yes GRF A special case of :
For regression problems , Estimates we care about μ ( x ) = E [ Y i ∣ X i = x ] \mu(x)=\mathbb{E}[Y_i|X_i =x] μ(x)=E[Yi∣Xi =x]( here μ ( x ) \mu(x) μ(x) Namely θ ( x ) \theta(x) θ(x)), Corresponding scoring function Namely Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x).
meanwhile , We know random forest After the forest is built , Given the test sample x x x, The predicted value is x x x Of the training sample set of the leaf node Y mean value , The formal expression is as follows :
μ ^ ( x ) = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) , μ ^ b ( x ) = ∑ { i : X i ∈ L b ( x ) } Y i ∣ L b ( x ) ∣ (5) \hat\mu(x)=\frac{1}{B}\cdot\sum_{b=1}^B\hat\mu_b(x), \ \hat\mu_b(x)=\frac{\sum_{\{i:X_i\in L_b(x)\}}Y_i}{|L_b(x)|} \tag{5} μ^(x)=B1⋅b=1∑Bμ^b(x), μ^b(x)=∣Lb(x)∣∑{ i:Xi∈Lb(x)}Yi(5)
Now? , We just need to prove that when scoring function by Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x) when ,Eq(5) Establishment is Eq(1) The necessary and sufficient conditions for its establishment . Prove the following :
Eq(1) The establishment is equivalent to Eq(6) establish :
∑ i = 1 n α i ( x ) ⋅ ( Y i − μ ^ ( x ) ) = 0 (6) \sum_{i=1}^n\alpha_i(x)\cdot (Y_i-\hat\mu(x))=0 \tag{6} i=1∑nαi(x)⋅(Yi−μ^(x))=0(6)
And because of ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1 establish , therefore Eq(6) It can be transformed into Eq(7):
μ ^ ( x ) = ∑ i = 1 n α i ( x ) ⋅ Y i = ∑ i = 1 n 1 B ⋅ ∑ b = 1 B α b i ( x ) ⋅ Y i = 1 B ⋅ ∑ b = 1 B ⋅ ∑ i = 1 n 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ ⋅ Y i = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) (7) \begin{aligned} \hat\mu(x) &=\sum_{i=1}^n\alpha_i(x)\cdot Y_i \\ &=\sum_{i=1}^n \frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B\cdot\sum_{i=1}^n\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B \hat\mu_b(x) \end{aligned} \tag{7} μ^(x)=i=1∑nαi(x)⋅Yi=i=1∑nB1⋅b=1∑Bαbi(x)⋅Yi=B1⋅b=1∑B⋅i=1∑n∣Lb(x)∣1({ Xi∈Lb(x)})⋅Yi=B1⋅b=1∑Bμ^b(x)(7)
thus it can be seen , When Eq(1) When established , Can be launched Eq(6) establish , therefore ,random forest yes GRF A special case of .
split criterion
The original idea is , Minimize the error between the evaluated value and the true value of child nodes , Which is to minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2):
e r r ( C 1 , C 2 ) = ∑ j = 1 2 P [ X ∈ C j ∣ X ∈ P ] ⋅ E [ ( θ ^ C j ( J ) − θ ( x ) ) 2 ∣ X ∈ C j ] (8) err(C_1,C_2)=\sum_{j=1}^2\mathbb{P}[X\in C_j|X\in P]\cdot\mathbb{E}[(\hat\theta_{C_j}(\mathcal{J})-\theta(x))^2|X\in C_j] \tag{8} err(C1,C2)=j=1∑2P[X∈Cj∣X∈P]⋅E[(θ^Cj(J)−θ(x))2∣X∈Cj](8)
however , Due to the true value θ ( x ) \theta(x) θ(x) Unknown , therefore , After some derivation , We will minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2) To maximize D e l t a ( C 1 , C 2 ) Delta(C_1,C_2) Delta(C1,C2):
Δ ( C 1 , C 2 ) = n c 1 ⋅ n c 2 n p 2 ⋅ ( θ ^ C 1 ( J ) − θ ^ C 2 ( J ) ) 2 (9) \Delta(C_1,C_2)=\frac{n_{c_1}\cdot n_{c_2}}{n_p^2}\cdot(\hat\theta_{C_1}(\mathcal{J})-\hat\theta_{C_2}(\mathcal{J}))^2 \tag{9} Δ(C1,C2)=np2nc1⋅nc2⋅(θ^C1(J)−θ^C2(J))2(9)
After transformation , You can find , Maximize Eq(7) To maximize the heterogeneity between child nodes .
thus , We know the splitting criterion of nodes , But in practice , because θ ^ C j ( J ) \hat\theta_{C_j}(\mathcal{J}) θ^Cj(J) The computational overhead is large , therefore , The author puts forward a method based on gradient The approximate solution of :
gradient tree algorithm
First ,PROPOSITION1 Pointed out that , θ ^ C \hat\theta_{C} θ^C There is the following approximate solution θ ~ C \tilde\theta_{C} θ~C:
θ ~ C = θ ^ p − 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ C } ξ T ⋅ A p − 1 Ψ θ ^ p , ν ^ p ( O i ) (10) \tilde\theta_{C}=\hat \theta_p-\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in C\}}\xi^T\cdot A_p^{-1}\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{10} θ~C=θ^p−∣{ i:Xi∈C}∣1⋅{ i:Xi∈C}∑ξT⋅Ap−1Ψθ^p,ν^p(Oi)(10)
among , θ ^ p \hat \theta_p θ^p Represents the parent node P P P On θ \theta θ The estimate of , Can be Eq(1)orEq(2) Get ; as for ξ \xi ξ, The paper says it is from ( θ , ν ) (\theta,\nu) (θ,ν) Filter out... From the vector θ \theta θ-coordinate Vector , But I see in other papers that everyone has omitted this thing ; and A p A_p Ap The meaning is Ψ θ ^ p , ν ^ p ( O i ) \Psi_{\hat\theta_p,\hat\nu_p}(O_i) Ψθ^p,ν^p(Oi) The desired gradient of , The calculation formula is as follows :
A p = ∇ E [ Ψ θ ^ p , ν ^ p ( O i ) ∣ X i ∈ P ] = 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ P } ∇ Ψ θ ^ p , ν ^ p ( O i ) (11) A_p=\nabla\mathbb{E}[\Psi_{\hat\theta_p,\hat\nu_p}(O_i)|X_i\in P]=\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in P\}}\nabla\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{11} Ap=∇E[Ψθ^p,ν^p(Oi)∣Xi∈P]=∣{ i:Xi∈C}∣1⋅{ i:Xi∈P}∑∇Ψθ^p,ν^p(Oi)(11)
But I don't quite understand who the derivative here is for .
When θ ^ C \hat\theta_{C} θ^C There is an approximate solution θ ~ C \tilde\theta_{C} θ~C after , Can be launched Δ ( C 1 , C 2 ) \Delta(C_1,C_2) Δ(C1,C2) There are also corresponding approximate solutions Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2):【 The derivation of this step is omitted for the time being 】
Δ ~ ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ { i : X i ∈ C j } ∣ ⋅ ( ∑ { i : X i ∈ C j } ρ i ) 2 (12) \tilde\Delta(C_1,C_2)=\sum_{j=1}^2\frac{1}{|\{i:X_i\in C_j\}|}\cdot(\sum_{\{i:X_i\in C_j\}}\rho_i)^2 \tag{12} Δ~(C1,C2)=j=1∑2∣{ i:Xi∈Cj}∣1⋅({ i:Xi∈Cj}∑ρi)2(12)
among , ρ i = − ξ T ⋅ A p − 1 ⋅ Ψ θ ^ p , ν ^ p ( O i ) \rho_i=-\xi^T\cdot A_p^{-1}\cdot\Psi_{\hat\theta_p,\hat\nu_p}(O_i) ρi=−ξT⋅Ap−1⋅Ψθ^p,ν^p(Oi), It means the first one i Samples are calculating θ ^ p \hat\theta_p θ^p The impact of time .
thus , We can summarize node splitting into the following two steps :
1. labeling step
This step , First you need to calculate θ ^ p \hat\theta_p θ^p and A p A_p Ap, And then calculate ρ i \rho_i ρi; Be careful , At each split , Just calculate one ρ i \rho_i ρi【 Because the parent node has been determined 】
2. regreession step
Look for child nodes , bring Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2) Maximum . This step can pass the standard CART Regression split realization .
GRF for CATE
next , Let's see GRF How to apply to CATE Of the assessment .
In this application , The author still uses Partially Linear model Based on Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅), So-called Partially Linear Model It means that the data meets the following structure :
Y = θ ( x ) ⋅ T + g ( x ) + ϵ , T = f ( x ) + η (13) Y=\theta(x)\cdot T+g(x)+\epsilon, \ T=f(x)+\eta \tag{13} Y=θ(x)⋅T+g(x)+ϵ, T=f(x)+η(13)
So-called ” Partial linearity “ Mainly reflected in Y Structurally .
Put it in CATE Evaluation questions , θ ( x ) \theta(x) θ(x) It means x x x Treatment effect under , Formalized as θ ( x ) = E [ Y ( T = 1 ) − Y ( T = 0 ) ∣ X = x ] \theta(x)=\mathbb{E}[Y(T=1)-Y(T=0)|X=x] θ(x)=E[Y(T=1)−Y(T=0)∣X=x].
be based on Partially Linear Model, The author constructed scoring function by Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − θ ( x ) ⋅ T i − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\theta(x)\cdot T_i-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−θ(x)⋅Ti−ν(x), It can be understood as this scoring function Our aim is to find a ( θ ^ ( x ) , ν ^ ( x ) ) (\hat\theta(x),\hat\nu(x)) (θ^(x),ν^(x)) bring Y i Y_i Yi And θ ( x ) ⋅ T i + ν ( x ) \theta(x)\cdot T_i+\nu(x) θ(x)⋅Ti+ν(x) As close as possible 【 The essence is the fitting problem 】.
Under this setting , The values are solved as follows :
θ ^ ( x ) = ξ T ⋅ C o v ( T i , Y i ∣ X i = x ) V a r ( T i ∣ X i = x ) (14) \hat\theta(x)=\xi^T\cdot\frac{Cov(T_i,Y_i|Xi=x)}{Var(T_i|X_i=x)} \tag{14} θ^(x)=ξT⋅Var(Ti∣Xi=x)Cov(Ti,Yi∣Xi=x)(14)
A p = 1 ∣ { i : X i ∈ P } ∣ ⋅ ∑ { i : X i ∈ C j } ( T i − T ˉ p ) ⨂ 2 (15) A_p=\frac{1}{|\{i:X_i\in P\}|}\cdot\sum_{\{i:X_i\in C_j\}}(T_i-\bar T_p)^{\bigotimes 2} \tag{15} Ap=∣{ i:Xi∈P}∣1⋅{ i:Xi∈Cj}∑(Ti−Tˉp)⨂2(15)
ρ i = ξ T ⋅ A p − 1 ⋅ ( Y i − Y ˉ p − ( T i − T ˉ p ) ⋅ θ ^ p ) (16) \rho_i=\xi^T\cdot A_p^{-1}\cdot (Y_i-\bar Y_p-(T_i-\bar T_p)\cdot \hat\theta_p) \tag{16} ρi=ξT⋅Ap−1⋅(Yi−Yˉp−(Ti−Tˉp)⋅θ^p)(16)
The derivation of these values , At present, I only understand Eq(14) The source of the :
consider Y = θ ( x ) ⋅ T + g ( x ) Y=\theta(x)\cdot T+g(x) Y=θ(x)⋅T+g(x), The optimal θ ( x ) \theta(x) θ(x) The solution of can be regarded as the solution of one - dimensional equation y = a x + b y=ax+b y=ax+b The slope of , And this slope can be expressed by variance and covariance 【 Reference material 】
It should be noted that , The mean here 、 variance 、 Covariance is weighted , And the weight is α i \alpha_i αi.
CausalForestDML
seeing the name of a thing one thinks of its function ,CausalForestDML It's fusion CausalForest and DML.DML In estimation CATE The core idea of time is based on the following equation :
Y − E [ Y ∣ X ] = θ ( x ) ⋅ ( T − E [ T ∣ X ] ) etc. price On Y ~ = θ ( x ) ⋅ T ~ (17) Y-\mathbb{E}[Y|X]=\theta(x)\cdot (T-\mathbb{E}[T|X]) \ Equivalent to \ \tilde Y=\theta(x)\cdot \tilde T \tag{17} Y−E[Y∣X]=θ(x)⋅(T−E[T∣X]) etc. price On Y~=θ(x)⋅T~(17)
That is to say , take CATE Assessment questions for , Convert into T To fit the residuals of Y Residual of , And the regression coefficient is CATE.【 The process of calculating residuals , In fact, it is the process of orthogonalization 】
be based on DML Thought ,CausalForestDML The following scoring function: Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − E [ Y i ∣ X ] − θ ( x ) ⋅ ( T i − E [ T i ∣ X ] ) − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\mathbb{E}[Y_i|X]-\theta(x)\cdot (T_i-\mathbb{E}[T_i|X])-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−E[Yi∣X]−θ(x)⋅(Ti−E[Ti∣X])−ν(x).
The corresponding optimal θ ( x ) \theta(x) θ(x) It becomes θ ^ ( x ) = ξ T ⋅ C o v ( Y i − E [ Y i ∣ X i ] , T i − E [ T i ∣ X i ] ∣ X i = x ) V a r ( T i − E [ Y i ∣ X i ] ∣ X i = x ) \hat\theta(x)=\xi^T\cdot\frac{Cov(Y_i-\mathbb{E}[Y_i|X_i],T_i-\mathbb{E}[T_i|X_i]|Xi=x)}{Var(T_i-\mathbb{E}[Y_i|X_i]|X_i=x)} θ^(x)=ξT⋅Var(Ti−E[Yi∣Xi]∣Xi=x)Cov(Yi−E[Yi∣Xi],Ti−E[Ti∣Xi]∣Xi=x).
Be careful : The original text calls this method Centered GRF! And watch 1 It turns out that GRF with centering Better than GRF without centering The effect is better .
GRF vs Causal Forest
These two articles are Susan Of ,GRF Compared with Causal Forest The biggest difference is
- Casual Forest Use exact loss criterion( namely Eq(9)) Split rather than gradient-based loss criterion( namely Eq(12));
- Causal Forest Calculate again treatment effect Weighted average is not used ;
边栏推荐
- Activation and value transfer of activity
- Multiple decorators decorate a function
- Principle analysis of ThreadLocal source code
- 深入浅出对话系统——自己实现Transformer
- WPF开发随笔收录-心电图曲线绘制
- [100 questions of Blue Bridge Cup intensive training] scratch command mobile Blue Bridge Cup scratch competition special prediction programming question intensive training simulation exercise question
- Record learning of hystrix knowledge --20210929
- Prototype chain analysis
- 根据先序遍历和中序遍历生成后序遍历
- What can one line of code do?
猜你喜欢

Detailed explanation of IVX low code platform series -- Overview (I)

Bombard the headquarters. Don't let a UI framework destroy you

数字经济时代文化消费新特征

1-8Vmware中的文件共享

Blue Bridge Cup - practice system login

Rxjs TakeUntil 操作符的学习笔记

炮打司令部,别让一个UI框架把你毁了

GO语言-什么是临界资源安全问题?

Optimization of lazyagg query rewriting in parsing data warehouse

Day_ eleven
随机推荐
解析数仓lazyagg查询重写优化
Bombard the headquarters. Don't let a UI framework destroy you
WPF开发随笔收录-心电图曲线绘制
How to view the change trend of cloud database from the behind of the launch of tidb to Alibaba cloud
Read mysql45 the next day
Bugly hot update usage
有哪些新手程序员不知道的小技巧?
一个 TDD 示例
User registration, information writing to file
Day_ 16 set
Helsinki traffic safety improvement project deploys velodyne lidar Intelligent Infrastructure Solution
iVX低代码平台系列详解 -- 概述篇(一)
Stop "outsourcing" Ai models! The latest research finds that some "back doors" that undermine the security of machine learning models cannot be detected
Apijson simple to use
Lifeifei's team applied vit to the robot, increased the maximum speed of planning reasoning by 512 times, and also cued hekaiming's MAE
What is backbone network
Understand the execution sequence of try catch finally in one diagram
【效率】又一款笔记神器开源了!
A TDD example
Lecun predicts AgI: big model and reinforcement learning are both ramps! My "world model" is the new way