当前位置：网站首页>[notes of wuenda] fundamentals of machine learning

[notes of wuenda] fundamentals of machine learning

2022-06-24 21:52:00 【Zzu dish】

Fundamentals of machine learning

What is machine learning ？

A program is thought to be able to learn from experience E Middle school learning , Solve the task T, Performance metrics reached P, If and only if , Experience gained E after , after P judge , The program is processing T Improved performance when .

I think experience E It is the experience and task of tens of thousands of self-practice procedures T Is playing chess . Performance metrics P Well , It's when it plays against some new opponents , Probability of winning the game .

Supervised Learning Supervised learning

Supervised learning ： There are more than features in a dataset feature-X, And labels target-Y

We'll talk about an algorithm later , It's called support vector machine , There is a clever mathematical skill , It allows the computer to process an infinite number of features .

Unsupervised Learning Unsupervised learning

Unsupervised learning : There are only features in the dataset feature

clustering algorithm : Separate audio at different distances , Distinguish whether the mailbox is a junk mailbox, etc

[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

Self supervised learning

Explain a : Self supervised learning enables us to obtain high-quality representation without large-scale annotation data , Instead, we can use a large number of unlabeled data and optimize predefined pretext Mission . We can then use these features to learn about new tasks that lack data .

Explain two : self-supervised learning It's a kind of unsupervised learning , The main purpose is to learn a common feature expression for downstream tasks . The main way is to supervise yourself , For example, remove a few words from a paragraph , Use his context to predict missing words , Or remove some parts of the picture , Rely on the information around it to predict the missing patch.

effect :

Learn useful information from unlabeled data , For subsequent tasks .

Self monitoring tasks （ Also known as pretext Mission ） We are asked to consider the supervisory loss function . However , We usually don't care about the final performance of the task . actually , We are only interested in the intermediate representations we have learned , We expect these representations to cover good semantic or structural meaning , And it can be beneficial to various downstream practical tasks .

Specific understanding

Linear Regression with One Variable Linear regression of single variable

Univariate linear regression ：

One possible expression is ： $h_\theta \left( x \right)=\theta_{0} + \theta_{1}x$ , Because it contains only one feature / The input variable , So this kind of problem is called univariate linear regression problem .

Selling a house : Already know the price of the previous sale , Based on the previous data set, predict the price at which your friend's house can be sold .

Training Set（ Training set ） as follows :

$m$ Represents the number of instances in the training set

$x$ On behalf of the characteristic / The input variable

$y$ Represents the target variable / Output variables

$\left( x,y \right)$ Represents an instance of a training set

${ {x}^{(i)}},{ {y}^{(i)}})$ On behalf of the $i$ Two observation examples

$h$ Solutions or functions that represent learning algorithms are also called assumptions （hypothesis）

Cost Function Cost function

The cost function is also called the square error function , Sometimes called the square error cost function . The reason why we ask for the sum of squares of errors , Because the square cost function of the error , For most problems , Especially the problem of return , Is a reasonable choice . There are other cost functions that work well , But the square error cost function is probably the most common way to solve regression problems .

The cost function makes us $h_\theta \left( x \right)=\theta_{0} + \theta_{1}x$ Better choice of parameters **parameters ** $\theta_{0}\theta_{1}$ , So that the most possible lines and data fit each other .

Our goal is to select the model parameters that minimize the sum of squares of modeling errors .

Even if you get the cost function $\left( \theta_0, \theta_1 \right) = \frac{1}{2m}\sum\limits_{i=1}^m \left( h_{\theta}(x^{(i)})-y^{(i)} \right)^{2}$ Minimum .

$\theta_{0}$ and $\theta_{1}$ and $J(\theta_{0}, \theta_{1})$ Visualization of relationships

At present, the global minimum cost function is obtained , Simplify $\theta_{0}=0$

Yes $\theta_{1}$ Continuously assign values to get the corresponding $J(\theta_{1})$ , obtain $J(\theta_{1})$ and $\theta_{1}$ Relationship

Contour map ： Corresponding $\theta_{0}=360$ , $\theta_{1}=0$ , Corresponding to the position in the contour map

Gradient Descent gradient descent

gradient descent ： To solve the cost function $J(\theta_{0}, \theta_{1})$ Minimum value $\theta_{0}$ , $\theta_{1}$

The idea behind the gradient drop is ： At first, we randomly choose a combination of parameters $\left( {\theta_{0}},{\theta_{1}},......,{\theta_{n}} \right)$ , Computational cost function , Then we look for the next parameter combination that can reduce the cost function value the most . We keep doing this until we find a local minimum （local minimum）, Because we haven't tried all the parameter combinations , So it's not sure if the local minimum we get is the global minimum （global minimum）, Choose different combinations of initial parameters , Different local minima may be found .

Gradient descent algorithm

$a$ It's the learning rate （learning rate）, It determines how far down we go in the direction where the cost function can go down the most , In the decline of batch gradients , Each time we subtract all the parameters from the learning rate times the derivative of the cost function .

The right is right , Assign values after all values are calculated , The left side is wrong

The gradient descent algorithm is as follows ：

${\theta_{j}}:={\theta_{j}}-\alpha \frac{\partial }{\partial {\theta_{j}}}J\left(\theta \right)$

describe ： Yes $\theta $Fu value, send have to$ J\left( \theta \right) $Press ladder degree Next drop most fast Fang towards Into the That's ok, One straight Overlapping generation Next Go to, most end have to To game Ministry most Small value . Its in$ a$ It's the learning rate （learning rate）, It determines how far down we go in the direction where the cost function can go down the most .

If only consider $\theta_{1}$ , $\theta_{0}=0$ when , $\alpha \frac{\partial }{\partial {\theta_{1}}}J\left(\theta \right)$ , The front is the learning rate , Then comes the cost function $J\left( \theta \right)$ About $\theta_{1}$ The derivative of .

Cost function $J\left( \theta_{1} \right)$ and $\theta_{1}$ Image ,

If the learning rate is too small , More iterations

If the learning rate is too high , May cross the local minimum , Back and forth oscillation deviates from the local minimum value point .

take Gradient descent and cost function combination , And apply it to the specific linear regression algorithm of fitting straight line .

Gradient descent algorithm and linear regression algorithm are shown in the figure below :

For our previous linear regression problem, we use the gradient descent method , The key is to find the derivative of the cost function , namely ：

$\frac{\partial }{\partial { {\theta }_{j}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{\partial }{\partial { {\theta }_{j}}}\frac{1}{2m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}^{2}}$

$j = 0$ when ： $\frac{\partial }{\partial { {\theta }_{0}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}}$

$j = 1$ when ： $\frac{\partial }{\partial { {\theta }_{1}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)}$

Then the algorithm is rewritten to ：

Repeat {

${\theta_{0}}:={\theta_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{ \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}$

${\theta_{1}}:={\theta_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)}$

**}

Refer to https://zhuanlan.zhihu.com/p/328261042

Batch gradient descent , All training sets are used for each gradient descent

summary

1. Hypothesis function （Hypothesis）

A linear function is used to fit the sample data set , It can be simply defined as ：

among

and

Is the parameter .

2. Cost function （Cost Function）

Measure the... Of a hypothetical function “ Loss ”, Also known as “ Square sum error function ”（Square Error Function）, The following definitions are given ：

amount to , Sum up the square of the difference between the assumed value and the true value of all samples , Divide by the number of samples m, Get an average of “ Loss ”. Our task is to find out

and

Make this “ Loss ” Minimum .

3. gradient descent （Gradient Descent）

gradient ： The directional derivative of a function at that point is maximized in that direction , That is, the rate of change at this point （ Slope ） Maximum .

gradient descent ： Make the argument

Along make

Move in the direction of the fastest descent , Get... As soon as possible

The minimum value of , The following definitions are given ：

In wuenda's course, I learned , Gradient descent requires all independent variables to be simultaneously “ falling ” Of , therefore , We can translate it into separate pairs