当前位置:网站首页>[fundamentals of machine learning 04] matrix factorization

[fundamentals of machine learning 04] matrix factorization

2022-06-22 06:54:00 chad_ lee

Completed the basic learning of machine learning , The author also shared the matrix decomposition based CTR Model for reference
Matrix decomposition advanced :FM、FFM
Matrix decomposition and deep learning :DeepFM、xDeepFM
Matrix decomposition and characteristic intersection :Wide & Deep、Deep & Cross Network

Matrix decomposition (Matrix Factorization)

image-20210510000251264

For datasets D \mathcal D D , The square error based error measurement of the assumed function is :
E in  ( { w m } , { v n } ) = 1 ∑ m = 1 M ∣ D m ∣ ∑ user  n  rated movie  m ( r n m − w m T v n ) 2 E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right)=\frac{1}{\sum_{m=1}^{M}\left|\mathcal{D}_{m}\right|} \sum_{\text {user } n \text { rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2} Ein ({ wm},{ vn})=m=1MDm1user n rated movie m(rnmwmTvn)2
So now it's time to use the data set D \mathcal D D Conduct v n \mathbf { v } _ { n } vn and w m \mathbf { w } _ { m } wm To ensure the minimum error :
min ⁡ W , V E in  ( { w m } , { v n } ) ∝ ∑ usern rated movie  m ( r n m − w m T v n ) 2 = ∑ m = 1 M ( ∑ ( x n , r n m ) ∈ D m ( r n m − w m T v n ) 2 ) = ∑ n = 1 N ( ∑ ( x n , r n m ) ∈ D m ( r n m − v n T w m ) 2 ) \begin{aligned} \min _{\mathrm{W}, \mathrm{V}} E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right) & \propto \sum_{\text {usern rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2} \\ &=\sum_{m=1}^{M}\left(\sum_{\left(\mathbf{x}_{n}, r_{n m}\right) \in \mathcal{D}_{m}}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}\right) \\ &=\sum_{n=1}^{N}\left(\sum_{\left(\mathbf{x}_{n}, r_{n m}\right) \in \mathcal{D}_{m}}\left(r_{n m}-\mathbf{v}_{n}^{T} \mathbf{w}_{m}\right)^{2}\right) \end{aligned} W,VminEin ({ wm},{ vn})usern rated movie m(rnmwmTvn)2=m=1M(xn,rnm)Dm(rnmwmTvn)2=n=1N(xn,rnm)Dm(rnmvnTwm)2
Because there is v n v_n vn and w m w_m wm Two variables , It may be difficult to optimize at the same time , So the basic idea is to use alternate minimization operations (alternating minimization):

  1. Fix v n \mathbf{v}_{n} vn, That is to say, fixed user eigenvectors , Then find each one w m ≡ minimize ⁡ E in  \mathbf{w}_{m} \equiv \operatorname{minimize} E_{\text {in }} wmminimizeEin  within D m \mathcal{D}_{m} Dm.
  2. Fix w m \mathbf{w}_{m} wm, That is to say, the feature vector of the movie , Then find each one v n ≡ minimize ⁡ E in  \mathbf{v}_{n} \equiv \operatorname{minimize} E_{\text {in }} vnminimizeEin  within D m  .  \mathcal{D}_{m \text { . }} Dm . 

This process is called alternating least squares algorithm (alternating least squares algorithm). specific working means :

initialize d ~ \tilde{d} d~ dimension vectors { w m } , { v n } \left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\} { wm},{ vn}

alternating optimization of E in  : E_{\text {in }}: Ein :repeatedly

  1. optimize w 1 , w 2 , … , w M : \mathbf{w}_{1}, \mathbf{w}_{2}, \ldots, \mathbf{w}_{M}: w1,w2,,wM: update w m \mathbf{w}_{m} wm by m m m -th-movie linear regression on { ( v n , r n m ) } \left\{\left(\mathbf{v}_{n}, r_{n m}\right)\right\} { (vn,rnm)}

  2. optimize v 1 , v 2 , … , v N : \mathbf{v}_{1}, \mathbf{v}_{2}, \ldots, \mathbf{v}_{N}: v1,v2,,vN: update v n \mathbf{v}_{n} vn by n n n -th-user linear regression on { ( w m , r n m ) } \left\{\left(\mathbf{w}_{m}, r_{n m}\right)\right\} { (wm,rnm)}

until converge

SGD for Matrix Factorization

Remember that our error function is :
E in  ( { w m } , { v n } ) = 1 ∑ m = 1 M ∣ D m ∣ ∑ user  n  rated movie  m ( r n m − w m T v n ) 2 E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right)=\frac{1}{\sum_{m=1}^{M}\left|\mathcal{D}_{m}\right|} \sum_{\text {user } n \text { rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2} Ein ({ wm},{ vn})=m=1MDm1user n rated movie m(rnmwmTvn)2
Because only one sample is taken out for optimization each time , Then first observe the error measurement of a single sample :
 err  (  user  n ,  movie  m ,  rating  r n m ) = ( r n m − w m T v n ) 2 \text { err }\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}  err ( user n, movie m, rating rnm)=(rnmwmTvn)2
So partial derivative :

∇ v n err ⁡ (  user  n ,  movie  m ,  rating  r n m ) = − 2 ( r n m − w m T v n ) w m ∇ w m err ⁡ (  user  n ,  movie  m ,  rating  r n m ) = − 2 ( r n m − w m T v n ) v n \nabla_{\mathbf{v}_{n}} \quad \operatorname{err}\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=-2\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right) \mathbf{w}_{m} \\ \nabla_{\mathbf{w}_{m}} \quad \operatorname{err}\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=-2\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right) \mathbf{v}_{n} vnerr( user n, movie m, rating rnm)=2(rnmwmTvn)wmwmerr( user n, movie m, rating rnm)=2(rnmwmTvn)vn
That is to say, only for the current sample v n \mathbf{v}_{n} vn and w m \mathbf{w}_{m} wm Have an impact on , The partial derivatives of other parameters are all zero . In conclusion, it is :
 per-example gradient  ∝ − (  residual  ) (  the other feature vector  ) \text { per-example gradient } \propto-(\text { residual })(\text { the other feature vector })  per-example gradient ( residual )( the other feature vector )
Then the practical steps of using the random gradient descent method to solve the matrix decomposition are :

image-20210510003336817

Be careful. , If you recommend anything related to timing , For example, the test set is the data after the training set time , Then you can put SGD Medium “S” It's about time , That is, according to the time sequence pick (n, m).

原网站

版权声明
本文为[chad_ lee]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202220543470663.html