当前位置：网站首页>Machine learning strong foundation program 0-4: popular understanding of Occam razor and no free lunch theorem

Machine learning strong foundation program 0-4: popular understanding of Occam razor and no free lunch theorem

2022-07-25 13:07:00 【Mr.Winter`】

Catalog

0 Write it at the front
1 Occam's razor principle
2 There is no such thing as a free lunch
3 Ugly duckling Theorem

0 Write it at the front

The machine learning strong foundation program focuses on depth and breadth , Deepen the understanding and application of machine learning models .“ deep ” The mathematical principle behind the detailed derivation algorithm model ;“ wide ” In analyzing multiple machine learning models ： Decision tree 、 Support vector machine 、 Bayesian and Markov decision 、 Strengthen learning, etc .

details ： Machine learning strong foundation program

stay What is machine learning ？ and AI What does it matter ？ We mentioned in Machine learning is a means of studying how to calculate , The discipline of using experience generation models to improve the performance of the system itself . The core knowledge of data set is crosstalk , Construction method analysis Explained the data ( Experience ) Related concepts of , In this section, we discuss the idea of machine learning model .

1 Occam's razor principle

Models are also called truth hypothesis (hypothesis), Regardless of performance and accuracy , Several hypotheses about the truth can be obtained from the data set , The set of all assumptions is called Hypothetical space (hypothesis space).

Once the representation of a hypothesis or model is determined , Assume that the space and its size are immediately determined .

Insert picture description here

The set of hypotheses in the hypothesis space that conform to the training set pattern is called Version space (version space). Because all assumptions of version space lead to the correct conclusion about the training set data , Therefore, this equivalence will cause the output uncertainty of machine learning algorithm —— Possible version space assumptions A Judge that the new sample is a positive example and assume B Judge as a counterexample .

Therefore, any effective machine learning algorithm must tend to choose certain assumptions in version space , be called Generalize preferences (inductive bias), Inductive preferences can be seen as heuristics to perform a search on the hypothesis space .

The above concept is a little abstract , Old rules , Let's take an example .

There is such a series of data samples on the two-dimensional plane , We need to build a model , Find the pattern of the data

We might as well choose a model form , Set as polynomial model
$y=\sum_{i=0}^n{a_ix^i}$

Then we call all polynomial functions to form the assumption space of the problem , Suppose that there are many curves in the space that conform to the law of sample distribution to form the version space , Like the following

Insert picture description here
We can think that these different curves are the models of this data set , Based on these models, we can predict .

But in the end, we can't submit so many models to users to choose by themselves , So it leads to the inductive preference introduced before , We must have a preference for a certain model , To decide what model to ultimately adopt

Occam's razor principle It's an inductive preference , It provides direction for our decision model , In a word ：

If not necessary , Do not add entities.

That is, if there are multiple assumptions consistent with observation , Choose the simplest one .

In this case , We can choose the curve with the lowest polynomial degree as the final model

Insert picture description here

2 There is no such thing as a free lunch

There is no free lunch to teach us Everything has a price The truth . So in machine learning No free lunch theorem What are you talking about ？

give an example ： There's an algorithm A, One of its properties , For example, the calculation speed increases , Get the algorithm B; There must be another performance degradation , For example, the storage occupation increases . Algorithm A and B Which good ？

You can say Algorithm A Ratio algorithm B Better ？ You can't , Because in the scenario that requires high computing speed B good , In scenarios with high storage requirements A good . Therefore, the performance improvement between algorithms has to pay a price , And the algorithm performance comparison cannot be divorced from the practical significance .

Now we formally give the specific content of the no free lunch theorem ： When the probability of all problems is the same , The expected performance between any two algorithms is the same .

Let's simply prove this theorem .

Suppose the sample space is $\mathcal{X}$ , The training set is $X$ , Then the test set is $\mathcal{X}-X$ , The probability of sample occurrence is recorded as $P\left( \boldsymbol{x} \right)$ . Set the algorithm as $\mathfrak{L}$ , How do we measure the performance of this algorithm ？ It can be calculated by the prediction error on the test set
$E_{ote}\left( \mathfrak{L} |X, f \right)$

among $f$ It is an objective model in the sample space that does not take human will as the transfer , The above formula translates directly into ： Given the objective sample space and limited training set , Get the expected performance of the algorithm .

We use the full probability formula to expand

$E_{ote}\left( \mathfrak{L} |X, f \right) =\sum_h{\sum_{\boldsymbol{x}\in \mathcal{X} -X}{P\left( \boldsymbol{x} \right) \ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right) P\left( h|X, \mathfrak{L} \right)}}$

among $\ell$ Is the performance measurement function , $h$ It is the model trained after the given algorithm and training set , $\ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right)$ It is the cost when the model prediction is not equal to the real situation .

Review the example above , There may be some problems in reality that require high computing speed , Some require high storage performance , let me put it another way , All problems are possible , And the probability of occurrence is the same , Therefore, the sample space for each problem is $f$ There are also several equal possibilities , We sum these possibilities ( That is to expect )

$\sum_f{E_{ote}\left( \mathfrak{L} |X, f \right)}=\sum_f{\sum_h{\sum_{\boldsymbol{x}\in \mathcal{X} -X}{P\left( \boldsymbol{x} \right) \ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right) P\left( h|X, \mathfrak{L} \right)}}}$

Change the order of integration

$\sum_f{E_{ote}\left( \mathfrak{L} |X, f \right)}=\sum_{\boldsymbol{x}\in \mathcal{X} -X}{P\left( \boldsymbol{x} \right) \sum_h{P\left( h|X, \mathfrak{L} \right) \sum_f{\ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right)}}}$

Without losing generality, consider the problem of two classifications , be

$\sum_f{\ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right)}=\frac{1}{2}\ell \left( h\left( \boldsymbol{x} \right) \ne f\left( \boldsymbol{x} \right) \right) +\frac{1}{2}\ell \left( h\left( \boldsymbol{x} \right) =f\left( \boldsymbol{x} \right) \right) =\sigma$

Is a fixed value .

The result is

$\sum_f{E_{ote}\left( \mathfrak{L} |X, f \right)}=\sigma \sum_{\boldsymbol{x}\in \mathcal{X} -X}{P\left( \boldsymbol{x} \right)}$

Its performance has nothing to do with the algorithm itself .

Last , To sum up, there is no free lunch theorem ： It's meaningless to talk about the advantages and disadvantages of learning algorithms out of specific application scenarios , Whether the inductive preference matches the current problem directly determines whether the algorithm can achieve good performance .

3 Ugly duckling Theorem

Ugly duckling Theorem Asked a question ：

You think there is a big difference between the ugly duckling and the White Swan , Or is there a big difference between a white swan and another white swan ？

From the perspective of morphology , There is a big difference between the ugly duckling and the White Swan ; From a genetic perspective , There is a big difference between a white swan and another white swan ！

What does this ？

It shows that it is impossible for computers to perform machine learning tasks for us automatically and objectively , We must set a standard for it , For example, in the ugly duckling problem , The standard is morphological comparison or genetic comparison . As long as the criteria are introduced , It must be subjective .

Therefore, no judgment in this world can be completely fair without discrimination .

In machine learning , Any algorithm has its subjectivity , Therefore, there is always room for improvement in the grasp of objective laws .

The source code for · Technical communication · Tug of learning · Consultation and sharing Please contact the

原网站

版权声明
本文为[Mr.Winter`]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251242224681.html