当前位置:网站首页>[notes of wuenda] fundamentals of machine learning

[notes of wuenda] fundamentals of machine learning

2022-06-24 21:52:00 Zzu dish

Fundamentals of machine learning

What is machine learning ?

A program is thought to be able to learn from experience E Middle school learning , Solve the task T, Performance metrics reached P, If and only if , Experience gained E after , after P judge , The program is processing T Improved performance when .

I think experience E It is the experience and task of tens of thousands of self-practice procedures T Is playing chess . Performance metrics P Well , It's when it plays against some new opponents , Probability of winning the game .

Supervised Learning Supervised learning

Supervised learning : There are more than features in a dataset feature-X, And labels target-Y

We'll talk about an algorithm later , It's called support vector machine , There is a clever mathematical skill , It allows the computer to process an infinite number of features .

Unsupervised Learning Unsupervised learning

Unsupervised learning : There are only features in the dataset feature

clustering algorithm : Separate audio at different distances , Distinguish whether the mailbox is a junk mailbox, etc

[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

Self supervised learning

Explain a : Self supervised learning enables us to obtain high-quality representation without large-scale annotation data , Instead, we can use a large number of unlabeled data and optimize predefined pretext Mission . We can then use these features to learn about new tasks that lack data .

Explain two : self-supervised learning It's a kind of unsupervised learning , The main purpose is to learn a common feature expression for downstream tasks . The main way is to supervise yourself , For example, remove a few words from a paragraph , Use his context to predict missing words , Or remove some parts of the picture , Rely on the information around it to predict the missing patch.

effect :

Learn useful information from unlabeled data , For subsequent tasks .

Self monitoring tasks ( Also known as pretext Mission ) We are asked to consider the supervisory loss function . However , We usually don't care about the final performance of the task . actually , We are only interested in the intermediate representations we have learned , We expect these representations to cover good semantic or structural meaning , And it can be beneficial to various downstream practical tasks .

Specific understanding

Linear Regression with One Variable Linear regression of single variable

Univariate linear regression :

  • One possible expression is : h θ ( x ) = θ 0 + θ 1 x h_\theta \left( x \right)=\theta_{0} + \theta_{1}x hθ(x)=θ0+θ1x, Because it contains only one feature / The input variable , So this kind of problem is called univariate linear regression problem .

Selling a house : Already know the price of the previous sale , Based on the previous data set, predict the price at which your friend's house can be sold .

Training Set( Training set ) as follows :

m m m Represents the number of instances in the training set

x x x On behalf of the characteristic / The input variable

y y y Represents the target variable / Output variables

( x , y ) \left( x,y \right) (x,y) Represents an instance of a training set

( x ( i ) , y ( i ) ) ({ {x}^{(i)}},{ {y}^{(i)}}) (x(i),y(i)) On behalf of the i i i Two observation examples

h h h Solutions or functions that represent learning algorithms are also called assumptions (hypothesis

Cost Function Cost function

The cost function is also called the square error function , Sometimes called the square error cost function . The reason why we ask for the sum of squares of errors , Because the square cost function of the error , For most problems , Especially the problem of return , Is a reasonable choice . There are other cost functions that work well , But the square error cost function is probably the most common way to solve regression problems .

The cost function makes us h θ ( x ) = θ 0 + θ 1 x h_\theta \left( x \right)=\theta_{0} + \theta_{1}x hθ(x)=θ0+θ1x Better choice of parameters **parameters ** θ 0 θ 1 \theta_{0}\theta_{1} θ0θ1, So that the most possible lines and data fit each other .

Our goal is to select the model parameters that minimize the sum of squares of modeling errors .

  • Even if you get the cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J \left( \theta_0, \theta_1 \right) = \frac{1}{2m}\sum\limits_{i=1}^m \left( h_{\theta}(x^{(i)})-y^{(i)} \right)^{2} J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2 Minimum .

θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1 and J ( θ 0 , θ 1 ) J(\theta_{0}, \theta_{1}) J(θ0,θ1) Visualization of relationships

At present, the global minimum cost function is obtained , Simplify θ 0 = 0 \theta_{0}=0 θ0=0

Yes θ 1 \theta_{1} θ1 Continuously assign values to get the corresponding J ( θ 1 ) J(\theta_{1}) J(θ1), obtain J ( θ 1 ) J(\theta_{1}) J(θ1) and θ 1 \theta_{1} θ1 Relationship

Contour map : Corresponding θ 0 = 360 \theta_{0}=360 θ0=360, θ 1 = 0 \theta_{1}=0 θ1=0, Corresponding to the position in the contour map

Gradient Descent gradient descent

gradient descent : To solve the cost function J ( θ 0 , θ 1 ) J(\theta_{0}, \theta_{1}) J(θ0,θ1) Minimum value θ 0 \theta_{0} θ0, θ 1 \theta_{1} θ1

The idea behind the gradient drop is : At first, we randomly choose a combination of parameters ( θ 0 , θ 1 , . . . . . . , θ n ) \left( {\theta_{0}},{\theta_{1}},......,{\theta_{n}} \right) (θ0,θ1,......,θn), Computational cost function , Then we look for the next parameter combination that can reduce the cost function value the most . We keep doing this until we find a local minimum (local minimum), Because we haven't tried all the parameter combinations , So it's not sure if the local minimum we get is the global minimum (global minimum), Choose different combinations of initial parameters , Different local minima may be found .

Gradient descent algorithm

  • a a a It's the learning rate (learning rate), It determines how far down we go in the direction where the cost function can go down the most , In the decline of batch gradients , Each time we subtract all the parameters from the learning rate times the derivative of the cost function .
  • The right is right , Assign values after all values are calculated , The left side is wrong

The gradient descent algorithm is as follows :

θ j : = θ j − α ∂ ∂ θ j J ( θ ) {\theta_{j}}:={\theta_{j}}-\alpha \frac{\partial }{\partial {\theta_{j}}}J\left(\theta \right) θj:=θjαθjJ(θ)

describe : Yes $\theta Fu value , send have to assignment , bring Fu value , send have to J\left( \theta \right) Press ladder degree Next drop most fast Fang towards Into the That's ok , One straight Overlapping generation Next Go to , most end have to To game Ministry most Small value . Its in Proceed in the direction of fastest gradient descent , And you keep iterating , You end up with a local minimum . among Press ladder degree Next drop most fast Fang towards Into the That's ok , One straight Overlapping generation Next Go to , most end have to To game Ministry most Small value . Its in a$ It's the learning rate (learning rate), It determines how far down we go in the direction where the cost function can go down the most .

If only consider θ 1 \theta_{1} θ1, θ 0 = 0 \theta_{0}=0 θ0=0 when , α ∂ ∂ θ 1 J ( θ ) \alpha \frac{\partial }{\partial {\theta_{1}}}J\left(\theta \right) αθ1J(θ), The front is the learning rate , Then comes the cost function J ( θ ) J\left( \theta \right) J(θ) About θ 1 \theta_{1} θ1 The derivative of .

Cost function J ( θ 1 ) J\left( \theta_{1} \right) J(θ1) and θ 1 \theta_{1} θ1 Image ,

If the learning rate is too small , More iterations

If the learning rate is too high , May cross the local minimum , Back and forth oscillation deviates from the local minimum value point .

take Gradient descent and cost function combination , And apply it to the specific linear regression algorithm of fitting straight line .

Gradient descent algorithm and linear regression algorithm are shown in the figure below :

For our previous linear regression problem, we use the gradient descent method , The key is to find the derivative of the cost function , namely :

∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \frac{\partial }{\partial { {\theta }_{j}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{\partial }{\partial { {\theta }_{j}}}\frac{1}{2m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}^{2}} θjJ(θ0,θ1)=θj2m1i=1m(hθ(x(i))y(i))2

j = 0 j=0 j=0 when : ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) \frac{\partial }{\partial { {\theta }_{0}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}} θ0J(θ0,θ1)=m1i=1m(hθ(x(i))y(i))

j = 1 j=1 j=1 when : ∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) \frac{\partial }{\partial { {\theta }_{1}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)} θ1J(θ0,θ1)=m1i=1m((hθ(x(i))y(i))x(i))

Then the algorithm is rewritten to :

Repeat {

θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) {\theta_{0}}:={\theta_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{ \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)} θ0:=θ0am1i=1m(hθ(x(i))y(i))

θ 1 : = θ 1 − a 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) {\theta_{1}}:={\theta_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)} θ1:=θ1am1i=1m((hθ(x(i))y(i))x(i))

​ **}

Refer to https://zhuanlan.zhihu.com/p/328261042

Batch gradient descent , All training sets are used for each gradient descent

summary

1. Hypothesis function (Hypothesis)

A linear function is used to fit the sample data set , It can be simply defined as :

img

among

img

and

img

Is the parameter .

2. Cost function (Cost Function)

Measure the... Of a hypothetical function “ Loss ”, Also known as “ Square sum error function ”(Square Error Function), The following definitions are given :

img

amount to , Sum up the square of the difference between the assumed value and the true value of all samples , Divide by the number of samples m, Get an average of “ Loss ”. Our task is to find out

img

and

img

Make this “ Loss ” Minimum .

3. gradient descent (Gradient Descent)

gradient : The directional derivative of a function at that point is maximized in that direction , That is, the rate of change at this point ( Slope ) Maximum .

gradient descent : Make the argument

img

Along make

img

Move in the direction of the fastest descent , Get... As soon as possible

img

The minimum value of , The following definitions are given :

img

In wuenda's course, I learned , Gradient descent requires all independent variables to be simultaneously “ falling ” Of , therefore , We can translate it into separate pairs

img

and

img

Find the partial derivative , It's fixed

img

take

img

Take the derivative as a variable , The opposite is true of

img

equally .

We know that the cost function is

img

, among

img

, that , According to the derivation principle of composite function ,dx/dy=(*du/dy)*∗(dx/du), That is to translate it into :

img

Finally, the results of the course :

img

img

原网站

版权声明
本文为[Zzu dish]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241450266921.html