当前位置：网站首页>[fundamentals of machine learning 04] matrix factorization

[fundamentals of machine learning 04] matrix factorization

2022-06-22 06:54:00 【chad_ lee】

Completed the basic learning of machine learning , The author also shared the matrix decomposition based CTR Model for reference
Matrix decomposition advanced ：FM、FFM
Matrix decomposition and deep learning ：DeepFM、xDeepFM
Matrix decomposition and characteristic intersection ：Wide & Deep、Deep & Cross Network

Matrix decomposition （Matrix Factorization）

For datasets $\mathcal D$ , The square error based error measurement of the assumed function is ：
$E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right)=\frac{1}{\sum_{m=1}^{M}\left|\mathcal{D}_{m}\right|} \sum_{\text {user } n \text { rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}$
So now it's time to use the data set $\mathcal D$ Conduct $\mathbf { v } _ { n }$ and $\mathbf { w } _ { m }$ To ensure the minimum error :
$\begin{aligned} \min _{\mathrm{W}, \mathrm{V}} E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right) & \propto \sum_{\text {usern rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2} \\ &=\sum_{m=1}^{M}\left(\sum_{\left(\mathbf{x}_{n}, r_{n m}\right) \in \mathcal{D}_{m}}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}\right) \\ &=\sum_{n=1}^{N}\left(\sum_{\left(\mathbf{x}_{n}, r_{n m}\right) \in \mathcal{D}_{m}}\left(r_{n m}-\mathbf{v}_{n}^{T} \mathbf{w}_{m}\right)^{2}\right) \end{aligned}$
Because there is $v_n$ and $w_m$ Two variables , It may be difficult to optimize at the same time , So the basic idea is to use alternate minimization operations （alternating minimization）：

Fix $\mathbf{v}_{n}$ , That is to say, fixed user eigenvectors , Then find each one $\mathbf{w}_{m} \equiv \operatorname{minimize} E_{\text {in }}$ within $\mathcal{D}_{m}$ .
Fix $\mathbf{w}_{m}$ , That is to say, the feature vector of the movie , Then find each one $\mathbf{v}_{n} \equiv \operatorname{minimize} E_{\text {in }}$ within $\mathcal{D}_{m \text { . }}$

This process is called alternating least squares algorithm （alternating least squares algorithm）. specific working means ：

initialize $\tilde{d}$ dimension vectors $\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}$

alternating optimization of $E_{\text {in }}:$ repeatedly

optimize $\mathbf{w}_{1}, \mathbf{w}_{2}, \ldots, \mathbf{w}_{M}:$ update $\mathbf{w}_{m}$ by $m$ -th-movie linear regression on $\left\{\left(\mathbf{v}_{n}, r_{n m}\right)\right\}$
optimize $\mathbf{v}_{1}, \mathbf{v}_{2}, \ldots, \mathbf{v}_{N}:$ update $\mathbf{v}_{n}$ by $n$ -th-user linear regression on $\left\{\left(\mathbf{w}_{m}, r_{n m}\right)\right\}$

until converge

SGD for Matrix Factorization

Remember that our error function is ：
$E_{\text {in }}\left(\left\{\mathbf{w}_{m}\right\},\left\{\mathbf{v}_{n}\right\}\right)=\frac{1}{\sum_{m=1}^{M}\left|\mathcal{D}_{m}\right|} \sum_{\text {user } n \text { rated movie } m}\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}$
Because only one sample is taken out for optimization each time , Then first observe the error measurement of a single sample ：
$\text { err }\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right)^{2}$
So partial derivative ：

$\nabla_{\mathbf{v}_{n}} \quad \operatorname{err}\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=-2\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right) \mathbf{w}_{m} \\ \nabla_{\mathbf{w}_{m}} \quad \operatorname{err}\left(\text { user } n, \text { movie } m, \text { rating } r_{n m}\right)=-2\left(r_{n m}-\mathbf{w}_{m}^{T} \mathbf{v}_{n}\right) \mathbf{v}_{n}$
That is to say, only for the current sample $\mathbf{v}_{n}$ and $\mathbf{w}_{m}$ Have an impact on , The partial derivatives of other parameters are all zero . In conclusion, it is ：
$\text { per-example gradient } \propto-(\text { residual })(\text { the other feature vector })$
Then the practical steps of using the random gradient descent method to solve the matrix decomposition are ：

Be careful. , If you recommend anything related to timing , For example, the test set is the data after the training set time , Then you can put SGD Medium “S” It's about time , That is, according to the time sequence pick (n, m).

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202220543470663.html

当前位置：网站首页>[fundamentals of machine learning 04] matrix factorization

[fundamentals of machine learning 04] matrix factorization

Matrix decomposition （Matrix Factorization）

SGD for Matrix Factorization

边栏推荐

猜你喜欢

随机推荐