当前位置:网站首页>[notes of wuenda] fundamentals of machine learning
[notes of wuenda] fundamentals of machine learning
2022-06-24 21:52:00 【Zzu dish】
Fundamentals of machine learning
What is machine learning ?
A program is thought to be able to learn from experience E Middle school learning , Solve the task T, Performance metrics reached P, If and only if , Experience gained E after , after P judge , The program is processing T Improved performance when .
I think experience E It is the experience and task of tens of thousands of self-practice procedures T Is playing chess . Performance metrics P Well , It's when it plays against some new opponents , Probability of winning the game .
Supervised Learning Supervised learning

Supervised learning : There are more than features in a dataset feature-X, And labels target-Y
We'll talk about an algorithm later , It's called support vector machine , There is a clever mathematical skill , It allows the computer to process an infinite number of features .
Unsupervised Learning Unsupervised learning

Unsupervised learning : There are only features in the dataset feature
clustering algorithm : Separate audio at different distances , Distinguish whether the mailbox is a junk mailbox, etc
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
Self supervised learning
Explain a : Self supervised learning enables us to obtain high-quality representation without large-scale annotation data , Instead, we can use a large number of unlabeled data and optimize predefined pretext Mission . We can then use these features to learn about new tasks that lack data .
Explain two : self-supervised learning It's a kind of unsupervised learning , The main purpose is to learn a common feature expression for downstream tasks . The main way is to supervise yourself , For example, remove a few words from a paragraph , Use his context to predict missing words , Or remove some parts of the picture , Rely on the information around it to predict the missing patch.
effect :
Learn useful information from unlabeled data , For subsequent tasks .
Self monitoring tasks ( Also known as pretext Mission ) We are asked to consider the supervisory loss function . However , We usually don't care about the final performance of the task . actually , We are only interested in the intermediate representations we have learned , We expect these representations to cover good semantic or structural meaning , And it can be beneficial to various downstream practical tasks .
Linear Regression with One Variable Linear regression of single variable
Univariate linear regression :
- One possible expression is : h θ ( x ) = θ 0 + θ 1 x h_\theta \left( x \right)=\theta_{0} + \theta_{1}x hθ(x)=θ0+θ1x, Because it contains only one feature / The input variable , So this kind of problem is called univariate linear regression problem .
Selling a house : Already know the price of the previous sale , Based on the previous data set, predict the price at which your friend's house can be sold .
Training Set( Training set ) as follows :

m m m Represents the number of instances in the training set
x x x On behalf of the characteristic / The input variable
y y y Represents the target variable / Output variables
( x , y ) \left( x,y \right) (x,y) Represents an instance of a training set
( x ( i ) , y ( i ) ) ({ {x}^{(i)}},{ {y}^{(i)}}) (x(i),y(i)) On behalf of the i i i Two observation examples

h h h Solutions or functions that represent learning algorithms are also called assumptions (hypothesis)
Cost Function Cost function
The cost function is also called the square error function , Sometimes called the square error cost function . The reason why we ask for the sum of squares of errors , Because the square cost function of the error , For most problems , Especially the problem of return , Is a reasonable choice . There are other cost functions that work well , But the square error cost function is probably the most common way to solve regression problems .
The cost function makes us h θ ( x ) = θ 0 + θ 1 x h_\theta \left( x \right)=\theta_{0} + \theta_{1}x hθ(x)=θ0+θ1x Better choice of parameters **parameters ** θ 0 θ 1 \theta_{0}\theta_{1} θ0θ1, So that the most possible lines and data fit each other .

Our goal is to select the model parameters that minimize the sum of squares of modeling errors .
- Even if you get the cost function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J \left( \theta_0, \theta_1 \right) = \frac{1}{2m}\sum\limits_{i=1}^m \left( h_{\theta}(x^{(i)})-y^{(i)} \right)^{2} J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2 Minimum .
θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1 and J ( θ 0 , θ 1 ) J(\theta_{0}, \theta_{1}) J(θ0,θ1) Visualization of relationships

At present, the global minimum cost function is obtained , Simplify θ 0 = 0 \theta_{0}=0 θ0=0

Yes θ 1 \theta_{1} θ1 Continuously assign values to get the corresponding J ( θ 1 ) J(\theta_{1}) J(θ1), obtain J ( θ 1 ) J(\theta_{1}) J(θ1) and θ 1 \theta_{1} θ1 Relationship

Contour map : Corresponding θ 0 = 360 \theta_{0}=360 θ0=360, θ 1 = 0 \theta_{1}=0 θ1=0, Corresponding to the position in the contour map

Gradient Descent gradient descent
gradient descent : To solve the cost function J ( θ 0 , θ 1 ) J(\theta_{0}, \theta_{1}) J(θ0,θ1) Minimum value θ 0 \theta_{0} θ0, θ 1 \theta_{1} θ1
The idea behind the gradient drop is : At first, we randomly choose a combination of parameters ( θ 0 , θ 1 , . . . . . . , θ n ) \left( {\theta_{0}},{\theta_{1}},......,{\theta_{n}} \right) (θ0,θ1,......,θn), Computational cost function , Then we look for the next parameter combination that can reduce the cost function value the most . We keep doing this until we find a local minimum (local minimum), Because we haven't tried all the parameter combinations , So it's not sure if the local minimum we get is the global minimum (global minimum), Choose different combinations of initial parameters , Different local minima may be found .

Gradient descent algorithm

- a a a It's the learning rate (learning rate), It determines how far down we go in the direction where the cost function can go down the most , In the decline of batch gradients , Each time we subtract all the parameters from the learning rate times the derivative of the cost function .

- The right is right , Assign values after all values are calculated , The left side is wrong
The gradient descent algorithm is as follows :
θ j : = θ j − α ∂ ∂ θ j J ( θ ) {\theta_{j}}:={\theta_{j}}-\alpha \frac{\partial }{\partial {\theta_{j}}}J\left(\theta \right) θj:=θj−α∂θj∂J(θ)
describe : Yes $\theta Fu value , send have to assignment , bring Fu value , send have to J\left( \theta \right) Press ladder degree Next drop most fast Fang towards Into the That's ok , One straight Overlapping generation Next Go to , most end have to To game Ministry most Small value . Its in Proceed in the direction of fastest gradient descent , And you keep iterating , You end up with a local minimum . among Press ladder degree Next drop most fast Fang towards Into the That's ok , One straight Overlapping generation Next Go to , most end have to To game Ministry most Small value . Its in a$ It's the learning rate (learning rate), It determines how far down we go in the direction where the cost function can go down the most .

If only consider θ 1 \theta_{1} θ1, θ 0 = 0 \theta_{0}=0 θ0=0 when , α ∂ ∂ θ 1 J ( θ ) \alpha \frac{\partial }{\partial {\theta_{1}}}J\left(\theta \right) α∂θ1∂J(θ), The front is the learning rate , Then comes the cost function J ( θ ) J\left( \theta \right) J(θ) About θ 1 \theta_{1} θ1 The derivative of .
Cost function J ( θ 1 ) J\left( \theta_{1} \right) J(θ1) and θ 1 \theta_{1} θ1 Image ,
If the learning rate is too small , More iterations
If the learning rate is too high , May cross the local minimum , Back and forth oscillation deviates from the local minimum value point .
take Gradient descent and cost function combination , And apply it to the specific linear regression algorithm of fitting straight line .
Gradient descent algorithm and linear regression algorithm are shown in the figure below :

For our previous linear regression problem, we use the gradient descent method , The key is to find the derivative of the cost function , namely :
∂ ∂ θ j J ( θ 0 , θ 1 ) = ∂ ∂ θ j 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 \frac{\partial }{\partial { {\theta }_{j}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{\partial }{\partial { {\theta }_{j}}}\frac{1}{2m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}^{2}} ∂θj∂J(θ0,θ1)=∂θj∂2m1i=1∑m(hθ(x(i))−y(i))2
j = 0 j=0 j=0 when : ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) \frac{\partial }{\partial { {\theta }_{0}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}{ {\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)}}} ∂θ0∂J(θ0,θ1)=m1i=1∑m(hθ(x(i))−y(i))
j = 1 j=1 j=1 when : ∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) \frac{\partial }{\partial { {\theta }_{1}}}J({ {\theta }_{0}},{ {\theta }_{1}})=\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left( { {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)} ∂θ1∂J(θ0,θ1)=m1i=1∑m((hθ(x(i))−y(i))⋅x(i))
Then the algorithm is rewritten to :
Repeat {
θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) {\theta_{0}}:={\theta_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{ \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)} θ0:=θ0−am1i=1∑m(hθ(x(i))−y(i))
θ 1 : = θ 1 − a 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x ( i ) ) {\theta_{1}}:={\theta_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{\left( \left({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}} \right)\cdot { {x}^{(i)}} \right)} θ1:=θ1−am1i=1∑m((hθ(x(i))−y(i))⋅x(i))
**}
Refer to https://zhuanlan.zhihu.com/p/328261042

Batch gradient descent , All training sets are used for each gradient descent
summary
1. Hypothesis function (Hypothesis)
A linear function is used to fit the sample data set , It can be simply defined as :

among

and

Is the parameter .
2. Cost function (Cost Function)
Measure the... Of a hypothetical function “ Loss ”, Also known as “ Square sum error function ”(Square Error Function), The following definitions are given :

amount to , Sum up the square of the difference between the assumed value and the true value of all samples , Divide by the number of samples m, Get an average of “ Loss ”. Our task is to find out

and

Make this “ Loss ” Minimum .
3. gradient descent (Gradient Descent)
gradient : The directional derivative of a function at that point is maximized in that direction , That is, the rate of change at this point ( Slope ) Maximum .
gradient descent : Make the argument

Along make

Move in the direction of the fastest descent , Get... As soon as possible

The minimum value of , The following definitions are given :

In wuenda's course, I learned , Gradient descent requires all independent variables to be simultaneously “ falling ” Of , therefore , We can translate it into separate pairs

and

Find the partial derivative , It's fixed

take

Take the derivative as a variable , The opposite is true of

equally .
We know that the cost function is

, among

, that , According to the derivation principle of composite function ,dx/dy=(*du/dy)*∗(dx/du), That is to translate it into :

Finally, the results of the course :


边栏推荐
猜你喜欢

CondaValueError: The target prefix is the base prefix. Aborting.

Li Kou daily question - day 26 -496 Next larger element I

2022国际女性工程师日:戴森设计大奖彰显女性设计实力

Shengzhe technology AI intelligent drowning prevention service launched

socket(1)

我国SaaS产业的发展趋势与路径

Data link layer & some other protocols or technologies

【吴恩达笔记】机器学习基础
![[featured] how do you design unified login with multiple accounts?](/img/df/9b4fc11a6971ebe8162ae84250a782.png)
[featured] how do you design unified login with multiple accounts?

Application practice | massive data, second level analysis! Flink+doris build a real-time data warehouse scheme
随机推荐
The most important thing at present
Unity about conversion between local and world coordinates
【论】Deep learning in the COVID-19 epidemic: A deep model for urban traffic revitalization index
2022国际女性工程师日:戴森设计大奖彰显女性设计实力
Make tea and talk about heroes! Leaders of Fujian Provincial Development and Reform Commission and Fujian municipal business office visited Yurun Health Division for exchange and guidance
Volcano becomes spark default batch scheduler
Blender's landscape
Volcano成Spark默认batch调度器
Structured interview of state-owned enterprises and central enterprises employment of state-owned enterprises Modou Interactive Employment Service steward
Slider控制Animator动画播放进度
【吴恩达笔记】机器学习基础
建木持续集成平台v2.5.0发布
About transform InverseTransformPoint, transform. InverseTransofrmDirection
(待补充)GAMES101作业7提高-实现微表面模型你需要了解的知识
2022 international women engineers' Day: Dyson design award shows women's design strength
传输层 udp && tcp
[精选] 多账号统一登录,你如何设计?
心楼:华为运动健康的七年筑造之旅
socket(2)
[theory] deep learning in the covid-19 epic: a deep model for urban traffic revitalization index