当前位置：网站首页>Clustering, dimension reduction and measurement techniques for machine learning

Clustering, dimension reduction and measurement techniques for machine learning

2022-06-21 19:44:00 【WihauShe】

clustering

Clustering tasks

“ Unsupervised learning ” Our goal is to reveal the intrinsic properties and laws of the data by learning the unlabeled training samples , It provides the basis for further data analysis .

Clustering attempts to divide the samples in a dataset into several usually disjoint subsets , Each subset is called a “ cluster ”. By such a division , Each cluster may correspond to some potential concepts （ Category ）, These concepts are unknown to clustering algorithm in advance , Clustering can only form cluster structure automatically , The concept semantics corresponding to clusters are grasped and named by users .

Performance metrics

Clustering performance measure is also called clustering “ Indicators of effectiveness ”.

Category ： Combine the clustering result with a “ Reference model ” Compare , be called “ External indicators ”; Direct reference to clustering results without using any reference model , be called “ Internal indicators ”.

Commonly used clustering performance measures external indicators ：

Jaccard coefficient (Jaccard Coefficient, abbreviation JC)
FM Index (Fowlkes and Mallows Index, abbreviation FMI)
Rand Index (Rand Index, abbreviation RI)

Commonly used clustering performance measures internal indicators ：

DB Index (Davies-Bouldin Index, abbreviation DBI)
Dunn Index (Dunn Index, abbreviation DI)

Distance calculation

Distance measurement needs to satisfy some basic properties ： Nonnegativity 、 identity 、 symmetry 、 Directness , Mainly Minkowski distance 、 Euclidean distance 、 Manhattan distance

Attribute classification ：

Continuous attributes ： There are infinite possible values in the definition field
Discrete properties ： There are a finite number of values in the definition field
Ordered attributes ： You can calculate the distance directly on the attribute value
Disorder property ： The distance cannot be calculated directly on the attribute value

The ordered attribute can use Minkowski distance , Unordered attributes use VDM

Prototype clustering

Prototype clustering is also called “ Clustering based on prototypes ”, This algorithm assumes that the clustering structure can be characterized by a set of prototypes . Usually , The algorithm first constructs the prototype , Then the prototype is updated and solved iteratively .

k Mean algorithm
Learning vector quantization
Gaussian mixture clustering

Density clustering

Density clustering “ Clustering based on density ”, This kind of algorithm assumes that the clustering structure can be determined by the tightness of sample distribution . Usually , Density clustering algorithm from the perspective of sample density to examine the connectivity between samples , Based on the connectable samples, the cluster is expanded to get the final clustering results .【DBSCAN】

Hierarchical clustering

Hierarchical clustering view divides data sets at different levels , To form a tree like cluster structure , The data set can be divided by “ Bottom up ” The aggregation strategy of , Also can use “ The top-down ” The spin off strategy of .【AGNES Algorithm 】

Dimension reduction and measurement technology

k Neighbor learning

k Working mechanism of neighbor learning ： Given the test sample , Based on a certain distance measure, find the closest k Training samples , And based on that k individual “ neighbor ” To predict the information of .

“ laws and regulations governing balloting ”： Choose this k The most frequent category markers in the samples are used as prediction results
“ Average method ”： Will this k The average of the real value output markers of samples is used as the prediction result
“ Lazy study ”： In the training phase, just keep the samples , Training time cost is zero , Wait until the test sample is received
“ Eager to learn ”： In the training stage, we learn how to deal with the samples

The nearest neighbor classifier is simple , But its generalization error rate is not more than twice that of Bayesian optimal classifier .

Low dimensional embedding

Dimension disaster ： Data samples appearing in high-dimensional clear form are sparse , It is difficult to calculate the distance .

Dimension reduction ： The original high-order attribute space is transformed into a low dimension by some mathematical change “ Subspace ”, In this subspace, the sample density is greatly increased , Distance calculation has also become easier .
Low dimensional embedding ： In many cases , Although the data samples observed or collected by people are high-dimensional , But perhaps only a low dimensional distribution is closely related to the learning task .

Principal component analysis

For sample points in orthogonal attribute space , Use a hyperplane （ High dimensional generalization of straight lines ） Properly express all samples ：

Recently refactoring ： The sample point is close enough to the hyperplane
Maximum separability ： The projection of sample points on this hyperplane can be separated as far as possible

PCA Just keep W And the mean vector of the sample can be reduced by a simple vector subtraction and matrix - Vector multiplication projects new samples into low dimensional space , The discarded part of the information can increase the sampling density of the sample , It can also play the effect of De-noising to a certain extent .

Kernel linear dimensionality reduction

“ Genuineness ” Low dimensional space ：“ Originally sampled ” Low dimensional space
Nonlinear dimensionality reduction ： Based on kernel technique, the linear dimensionality reduction method is studied “ Nucleation ”

Manifold learning

“ manifold ” Is a space which is locally homeomorphic with a Euclidean space , It has the property of Euclidean space locally , Can use Euclidean distance to calculate distance .

If a low dimensional manifold is embedded into a high dimensional space , The distribution of data samples in high-dimensional space seems very complex , But it still has the properties of European Space locally , therefore , It is easy to establish a dimensionality reduction mapping relationship locally , Then try to extend the local mapping relation to the global mapping relation .

Isometric mapping
Basic starting point ： After the low dimensional manifold is embedded into the high dimensional space , It is misleading to calculate the wired distance directly in high-dimensional space , Because the linear distance in high-dimensional space is not reachable on low-dimensional embedded manifolds .
We can use the homeomorphism of manifold and Euclidean space locally , For each point, find its neighboring points based on Euclidean distance , Then we can establish a nearest neighbor connection graph , There is a connection between the nearest neighbors in the figure , There is no connection between non adjacent points . therefore , The problem of calculating the geodesic distance between two points , It is transformed into calculating the shortest path between two points on the nearest neighbor connection graph .
Construction of nearest neighbor graph ： Specify the number of nearest neighbors 、 Specify the distance threshold （ Points less than the threshold value are considered as nearest neighbors ）

Local linear embedding
Local linear embedding attempts to maintain the linear relationship between samples in the neighborhood

Measure learning

Basic motivation ：“ Study ” Work out a suitable distance measure
Markov distance ：

among M Also known as “ The metric matrix ”, Measurement learning is right M To study . To keep the distance nonnegative and symmetric ,M Must be （ And a half ） Positive definite symmetric matrices , That is, there must be an orthogonal basis P bring M Can be written as M=PP^T.
It can not only take the error rate as the optimization goal of measurement learning , It can also introduce domain knowledge into measurement learning .

原网站

版权声明
本文为[WihauShe]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211808060626.html

当前位置：网站首页>Clustering, dimension reduction and measurement techniques for machine learning

Clustering, dimension reduction and measurement techniques for machine learning

clustering

Dimension reduction and measurement technology

边栏推荐

猜你喜欢

随机推荐