当前位置:网站首页>Paper notes: generalized random forests
Paper notes: generalized random forests
2022-06-25 16:36:00 【#Super Pig】
This paper ignores the asymptote proof for the time being , Focus on GRF Of prediction and split Method
ref:
- Zhihu blog :https://zhuanlan.zhihu.com/p/448524822
- S. Athey, J. Tibshirani, and S. Wager, “Generalized random forests,” Ann. Statist., vol. 47, no. 2, Apr. 2019, doi: 10.1214/18-AOS1709.
Motivation
This article aims to find a general Of forest-based Estimation method of , It's right random forest Generalization extension of . This is also the greatest contribution of the work . To be specific , The work proposed General Object yes :
E [ Ψ θ ( x ) , ν ( x ) ( O i ) ∣ X i = x ] = 0 (1) \mathbb{E}[\Psi_{\theta(x),\nu(x)}(O_i)|X_i=x]=0 \tag{1} E[Ψθ(x),ν(x)(Oi)∣Xi=x]=0(1)
among , Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅) yes scoring function, It can be understood as loss function Or optimization objectives , θ ( x ) \theta(x) θ(x) Is the quantity we expect to estimate , ν ( x ) \nu(x) ν(x) It's optional nuisance parameter, O i O_i Oi Is and θ ( x ) \theta(x) θ(x) The relevant quantity . The purpose is to build a forest to make Eq(1) establish .
To achieve these goals (Eq(1)), We need to solve the following optimization problems :
( θ ^ ( x ) , ν ^ ( x ) ) = arg min θ , ν ∥ ∑ i = 1 n α i ( x ) ⋅ Ψ θ , ν ( O i ) ∥ 2 (2) (\hat\theta(x),\hat\nu(x))=\arg\min_{\theta,\nu} \|\sum_{i=1}^{n}\alpha_i(x)\cdot\Psi_{\theta,\nu}(O_i)\|_2 \tag{2} (θ^(x),ν^(x))=argθ,νmin∥i=1∑nαi(x)⋅Ψθ,ν(Oi)∥2(2)
And the optimal solution of the optimization problem θ ^ ( x ) \hat\theta(x) θ^(x) That's our estimate ;
as for α i ( x ) \alpha_i(x) αi(x), It represents the training sample i i i And test samples ( from x x x Express ) The similarity , Play the role of weighting , The specific calculation method is as follows :
α i ( x ) = 1 B ⋅ ∑ b = 1 B α b i ( x ) (3) \alpha_i(x)=\frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \tag{3} αi(x)=B1⋅b=1∑Bαbi(x)(3)
α b i ( x ) = 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ (4) \alpha_{b_i}(x)=\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \tag{4} αbi(x)=∣Lb(x)∣1({ Xi∈Lb(x)})(4)
among , B B B Represents the number of trees , b b b It stands for the second b b b tree , L b ( x ) L_b(x) Lb(x) Represent and test samples x x x Fall in the second place b b b A collection of training samples on the same leaf of a tree . therefore , α b i ( x ) \alpha_{b_i}(x) αbi(x) It means in the second paragraph b b b The third in the tree i i i Samples and x x x The frequency of falling on the same leaf node 【 This frequency reflects the similarity 】. Be careful , ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1!
To sum up ,Eq(1) and Eq(2) It's actually equivalent , After the forest is built, it can be based on Eq(1) or (2) Make a forecast assessment , These two formulas are GRF At the heart of , Many statistical problems ( Such as , least square 、 maximum likelihood 、 Quantile regression, etc ) Can be seen as Eq(1) The special case of .
Case of Regression
Take the regression problem as an example , prove random forest yes GRF A special case of :
For regression problems , Estimates we care about μ ( x ) = E [ Y i ∣ X i = x ] \mu(x)=\mathbb{E}[Y_i|X_i =x] μ(x)=E[Yi∣Xi =x]( here μ ( x ) \mu(x) μ(x) Namely θ ( x ) \theta(x) θ(x)), Corresponding scoring function Namely Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x).
meanwhile , We know random forest After the forest is built , Given the test sample x x x, The predicted value is x x x Of the training sample set of the leaf node Y mean value , The formal expression is as follows :
μ ^ ( x ) = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) , μ ^ b ( x ) = ∑ { i : X i ∈ L b ( x ) } Y i ∣ L b ( x ) ∣ (5) \hat\mu(x)=\frac{1}{B}\cdot\sum_{b=1}^B\hat\mu_b(x), \ \hat\mu_b(x)=\frac{\sum_{\{i:X_i\in L_b(x)\}}Y_i}{|L_b(x)|} \tag{5} μ^(x)=B1⋅b=1∑Bμ^b(x), μ^b(x)=∣Lb(x)∣∑{ i:Xi∈Lb(x)}Yi(5)
Now? , We just need to prove that when scoring function by Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x) when ,Eq(5) Establishment is Eq(1) The necessary and sufficient conditions for its establishment . Prove the following :
Eq(1) The establishment is equivalent to Eq(6) establish :
∑ i = 1 n α i ( x ) ⋅ ( Y i − μ ^ ( x ) ) = 0 (6) \sum_{i=1}^n\alpha_i(x)\cdot (Y_i-\hat\mu(x))=0 \tag{6} i=1∑nαi(x)⋅(Yi−μ^(x))=0(6)
And because of ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1 establish , therefore Eq(6) It can be transformed into Eq(7):
μ ^ ( x ) = ∑ i = 1 n α i ( x ) ⋅ Y i = ∑ i = 1 n 1 B ⋅ ∑ b = 1 B α b i ( x ) ⋅ Y i = 1 B ⋅ ∑ b = 1 B ⋅ ∑ i = 1 n 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ ⋅ Y i = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) (7) \begin{aligned} \hat\mu(x) &=\sum_{i=1}^n\alpha_i(x)\cdot Y_i \\ &=\sum_{i=1}^n \frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B\cdot\sum_{i=1}^n\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B \hat\mu_b(x) \end{aligned} \tag{7} μ^(x)=i=1∑nαi(x)⋅Yi=i=1∑nB1⋅b=1∑Bαbi(x)⋅Yi=B1⋅b=1∑B⋅i=1∑n∣Lb(x)∣1({ Xi∈Lb(x)})⋅Yi=B1⋅b=1∑Bμ^b(x)(7)
thus it can be seen , When Eq(1) When established , Can be launched Eq(6) establish , therefore ,random forest yes GRF A special case of .
split criterion
The original idea is , Minimize the error between the evaluated value and the true value of child nodes , Which is to minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2):
e r r ( C 1 , C 2 ) = ∑ j = 1 2 P [ X ∈ C j ∣ X ∈ P ] ⋅ E [ ( θ ^ C j ( J ) − θ ( x ) ) 2 ∣ X ∈ C j ] (8) err(C_1,C_2)=\sum_{j=1}^2\mathbb{P}[X\in C_j|X\in P]\cdot\mathbb{E}[(\hat\theta_{C_j}(\mathcal{J})-\theta(x))^2|X\in C_j] \tag{8} err(C1,C2)=j=1∑2P[X∈Cj∣X∈P]⋅E[(θ^Cj(J)−θ(x))2∣X∈Cj](8)
however , Due to the true value θ ( x ) \theta(x) θ(x) Unknown , therefore , After some derivation , We will minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2) To maximize D e l t a ( C 1 , C 2 ) Delta(C_1,C_2) Delta(C1,C2):
Δ ( C 1 , C 2 ) = n c 1 ⋅ n c 2 n p 2 ⋅ ( θ ^ C 1 ( J ) − θ ^ C 2 ( J ) ) 2 (9) \Delta(C_1,C_2)=\frac{n_{c_1}\cdot n_{c_2}}{n_p^2}\cdot(\hat\theta_{C_1}(\mathcal{J})-\hat\theta_{C_2}(\mathcal{J}))^2 \tag{9} Δ(C1,C2)=np2nc1⋅nc2⋅(θ^C1(J)−θ^C2(J))2(9)
After transformation , You can find , Maximize Eq(7) To maximize the heterogeneity between child nodes .
thus , We know the splitting criterion of nodes , But in practice , because θ ^ C j ( J ) \hat\theta_{C_j}(\mathcal{J}) θ^Cj(J) The computational overhead is large , therefore , The author puts forward a method based on gradient The approximate solution of :
gradient tree algorithm
First ,PROPOSITION1 Pointed out that , θ ^ C \hat\theta_{C} θ^C There is the following approximate solution θ ~ C \tilde\theta_{C} θ~C:
θ ~ C = θ ^ p − 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ C } ξ T ⋅ A p − 1 Ψ θ ^ p , ν ^ p ( O i ) (10) \tilde\theta_{C}=\hat \theta_p-\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in C\}}\xi^T\cdot A_p^{-1}\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{10} θ~C=θ^p−∣{ i:Xi∈C}∣1⋅{ i:Xi∈C}∑ξT⋅Ap−1Ψθ^p,ν^p(Oi)(10)
among , θ ^ p \hat \theta_p θ^p Represents the parent node P P P On θ \theta θ The estimate of , Can be Eq(1)orEq(2) Get ; as for ξ \xi ξ, The paper says it is from ( θ , ν ) (\theta,\nu) (θ,ν) Filter out... From the vector θ \theta θ-coordinate Vector , But I see in other papers that everyone has omitted this thing ; and A p A_p Ap The meaning is Ψ θ ^ p , ν ^ p ( O i ) \Psi_{\hat\theta_p,\hat\nu_p}(O_i) Ψθ^p,ν^p(Oi) The desired gradient of , The calculation formula is as follows :
A p = ∇ E [ Ψ θ ^ p , ν ^ p ( O i ) ∣ X i ∈ P ] = 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ P } ∇ Ψ θ ^ p , ν ^ p ( O i ) (11) A_p=\nabla\mathbb{E}[\Psi_{\hat\theta_p,\hat\nu_p}(O_i)|X_i\in P]=\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in P\}}\nabla\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{11} Ap=∇E[Ψθ^p,ν^p(Oi)∣Xi∈P]=∣{ i:Xi∈C}∣1⋅{ i:Xi∈P}∑∇Ψθ^p,ν^p(Oi)(11)
But I don't quite understand who the derivative here is for .
When θ ^ C \hat\theta_{C} θ^C There is an approximate solution θ ~ C \tilde\theta_{C} θ~C after , Can be launched Δ ( C 1 , C 2 ) \Delta(C_1,C_2) Δ(C1,C2) There are also corresponding approximate solutions Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2):【 The derivation of this step is omitted for the time being 】
Δ ~ ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ { i : X i ∈ C j } ∣ ⋅ ( ∑ { i : X i ∈ C j } ρ i ) 2 (12) \tilde\Delta(C_1,C_2)=\sum_{j=1}^2\frac{1}{|\{i:X_i\in C_j\}|}\cdot(\sum_{\{i:X_i\in C_j\}}\rho_i)^2 \tag{12} Δ~(C1,C2)=j=1∑2∣{ i:Xi∈Cj}∣1⋅({ i:Xi∈Cj}∑ρi)2(12)
among , ρ i = − ξ T ⋅ A p − 1 ⋅ Ψ θ ^ p , ν ^ p ( O i ) \rho_i=-\xi^T\cdot A_p^{-1}\cdot\Psi_{\hat\theta_p,\hat\nu_p}(O_i) ρi=−ξT⋅Ap−1⋅Ψθ^p,ν^p(Oi), It means the first one i Samples are calculating θ ^ p \hat\theta_p θ^p The impact of time .
thus , We can summarize node splitting into the following two steps :
1. labeling step
This step , First you need to calculate θ ^ p \hat\theta_p θ^p and A p A_p Ap, And then calculate ρ i \rho_i ρi; Be careful , At each split , Just calculate one ρ i \rho_i ρi【 Because the parent node has been determined 】
2. regreession step
Look for child nodes , bring Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2) Maximum . This step can pass the standard CART Regression split realization .
GRF for CATE
next , Let's see GRF How to apply to CATE Of the assessment .
In this application , The author still uses Partially Linear model Based on Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅), So-called Partially Linear Model It means that the data meets the following structure :
Y = θ ( x ) ⋅ T + g ( x ) + ϵ , T = f ( x ) + η (13) Y=\theta(x)\cdot T+g(x)+\epsilon, \ T=f(x)+\eta \tag{13} Y=θ(x)⋅T+g(x)+ϵ, T=f(x)+η(13)
So-called ” Partial linearity “ Mainly reflected in Y Structurally .
Put it in CATE Evaluation questions , θ ( x ) \theta(x) θ(x) It means x x x Treatment effect under , Formalized as θ ( x ) = E [ Y ( T = 1 ) − Y ( T = 0 ) ∣ X = x ] \theta(x)=\mathbb{E}[Y(T=1)-Y(T=0)|X=x] θ(x)=E[Y(T=1)−Y(T=0)∣X=x].
be based on Partially Linear Model, The author constructed scoring function by Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − θ ( x ) ⋅ T i − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\theta(x)\cdot T_i-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−θ(x)⋅Ti−ν(x), It can be understood as this scoring function Our aim is to find a ( θ ^ ( x ) , ν ^ ( x ) ) (\hat\theta(x),\hat\nu(x)) (θ^(x),ν^(x)) bring Y i Y_i Yi And θ ( x ) ⋅ T i + ν ( x ) \theta(x)\cdot T_i+\nu(x) θ(x)⋅Ti+ν(x) As close as possible 【 The essence is the fitting problem 】.
Under this setting , The values are solved as follows :
θ ^ ( x ) = ξ T ⋅ C o v ( T i , Y i ∣ X i = x ) V a r ( T i ∣ X i = x ) (14) \hat\theta(x)=\xi^T\cdot\frac{Cov(T_i,Y_i|Xi=x)}{Var(T_i|X_i=x)} \tag{14} θ^(x)=ξT⋅Var(Ti∣Xi=x)Cov(Ti,Yi∣Xi=x)(14)
A p = 1 ∣ { i : X i ∈ P } ∣ ⋅ ∑ { i : X i ∈ C j } ( T i − T ˉ p ) ⨂ 2 (15) A_p=\frac{1}{|\{i:X_i\in P\}|}\cdot\sum_{\{i:X_i\in C_j\}}(T_i-\bar T_p)^{\bigotimes 2} \tag{15} Ap=∣{ i:Xi∈P}∣1⋅{ i:Xi∈Cj}∑(Ti−Tˉp)⨂2(15)
ρ i = ξ T ⋅ A p − 1 ⋅ ( Y i − Y ˉ p − ( T i − T ˉ p ) ⋅ θ ^ p ) (16) \rho_i=\xi^T\cdot A_p^{-1}\cdot (Y_i-\bar Y_p-(T_i-\bar T_p)\cdot \hat\theta_p) \tag{16} ρi=ξT⋅Ap−1⋅(Yi−Yˉp−(Ti−Tˉp)⋅θ^p)(16)
The derivation of these values , At present, I only understand Eq(14) The source of the :
consider Y = θ ( x ) ⋅ T + g ( x ) Y=\theta(x)\cdot T+g(x) Y=θ(x)⋅T+g(x), The optimal θ ( x ) \theta(x) θ(x) The solution of can be regarded as the solution of one - dimensional equation y = a x + b y=ax+b y=ax+b The slope of , And this slope can be expressed by variance and covariance 【 Reference material 】
It should be noted that , The mean here 、 variance 、 Covariance is weighted , And the weight is α i \alpha_i αi.
CausalForestDML
seeing the name of a thing one thinks of its function ,CausalForestDML It's fusion CausalForest and DML.DML In estimation CATE The core idea of time is based on the following equation :
Y − E [ Y ∣ X ] = θ ( x ) ⋅ ( T − E [ T ∣ X ] ) etc. price On Y ~ = θ ( x ) ⋅ T ~ (17) Y-\mathbb{E}[Y|X]=\theta(x)\cdot (T-\mathbb{E}[T|X]) \ Equivalent to \ \tilde Y=\theta(x)\cdot \tilde T \tag{17} Y−E[Y∣X]=θ(x)⋅(T−E[T∣X]) etc. price On Y~=θ(x)⋅T~(17)
That is to say , take CATE Assessment questions for , Convert into T To fit the residuals of Y Residual of , And the regression coefficient is CATE.【 The process of calculating residuals , In fact, it is the process of orthogonalization 】
be based on DML Thought ,CausalForestDML The following scoring function: Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − E [ Y i ∣ X ] − θ ( x ) ⋅ ( T i − E [ T i ∣ X ] ) − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\mathbb{E}[Y_i|X]-\theta(x)\cdot (T_i-\mathbb{E}[T_i|X])-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−E[Yi∣X]−θ(x)⋅(Ti−E[Ti∣X])−ν(x).
The corresponding optimal θ ( x ) \theta(x) θ(x) It becomes θ ^ ( x ) = ξ T ⋅ C o v ( Y i − E [ Y i ∣ X i ] , T i − E [ T i ∣ X i ] ∣ X i = x ) V a r ( T i − E [ Y i ∣ X i ] ∣ X i = x ) \hat\theta(x)=\xi^T\cdot\frac{Cov(Y_i-\mathbb{E}[Y_i|X_i],T_i-\mathbb{E}[T_i|X_i]|Xi=x)}{Var(T_i-\mathbb{E}[Y_i|X_i]|X_i=x)} θ^(x)=ξT⋅Var(Ti−E[Yi∣Xi]∣Xi=x)Cov(Yi−E[Yi∣Xi],Ti−E[Ti∣Xi]∣Xi=x).
Be careful : The original text calls this method Centered GRF! And watch 1 It turns out that GRF with centering Better than GRF without centering The effect is better .
GRF vs Causal Forest
These two articles are Susan Of ,GRF Compared with Causal Forest The biggest difference is
- Casual Forest Use exact loss criterion( namely Eq(9)) Split rather than gradient-based loss criterion( namely Eq(12));
- Causal Forest Calculate again treatment effect Weighted average is not used ;
边栏推荐
- DDD概念复杂难懂,实际落地如何设计代码实现模型?
- Error: homebrew core is a shallow clone
- Flutter textfield setting can input multiple lines
- 加密潮流:时尚向元宇宙的进阶
- Bugly hot update usage
- Nsurlsession learning notes (III) download task
- Detailed explanation of IVX low code platform series -- Overview (I)
- Message format of Modbus (PLC)
- One minute to familiarize yourself with the meaning of all fluent question marks
- What can one line of code do?
猜你喜欢
随机推荐
Catheon gaming appointed mark Aubrey, former Asia Pacific head of Activision Blizzard, as CEO
This latest research has revealed two secrets of cloudy development
Coredata data persistence
【 apprentissage automatique】 cas de prévision et d'analyse de l'examen d'entrée à l'Université basé sur des séries chronologiques multiples
Day_ thirteen
A TDD example
Collection overview, array encapsulation
Generate post order traversal according to pre order traversal and mid order traversal
What exactly is a handler
Navicat Premium 15 for Mac(数据库开发工具)中文版
使用hbuilder X创建uniapp项目
Day_ seventeen
File operation, serialization, recursive copy
The database records are read through the system time under the Android system, causing the problem of incomplete Reading Records!
Built in function globals() locals()
This article will help you understand the common concepts, advantages and disadvantages of JWT
What plug-ins are available for vscade?
炮打司令部,别让一个UI框架把你毁了
Blue Bridge Cup - practice system login
Flutter textfield setting can input multiple lines