当前位置:网站首页>Clustering, dimension reduction and measurement techniques for machine learning
Clustering, dimension reduction and measurement techniques for machine learning
2022-06-21 19:44:00 【WihauShe】
clustering
Clustering tasks
“ Unsupervised learning ” Our goal is to reveal the intrinsic properties and laws of the data by learning the unlabeled training samples , It provides the basis for further data analysis .
Clustering attempts to divide the samples in a dataset into several usually disjoint subsets , Each subset is called a “ cluster ”. By such a division , Each cluster may correspond to some potential concepts ( Category ), These concepts are unknown to clustering algorithm in advance , Clustering can only form cluster structure automatically , The concept semantics corresponding to clusters are grasped and named by users .
Performance metrics
Clustering performance measure is also called clustering “ Indicators of effectiveness ”.
Category : Combine the clustering result with a “ Reference model ” Compare , be called “ External indicators ”; Direct reference to clustering results without using any reference model , be called “ Internal indicators ”.
Commonly used clustering performance measures external indicators :
- Jaccard coefficient (Jaccard Coefficient, abbreviation JC)
- FM Index (Fowlkes and Mallows Index, abbreviation FMI)
- Rand Index (Rand Index, abbreviation RI)
Commonly used clustering performance measures internal indicators :
- DB Index (Davies-Bouldin Index, abbreviation DBI)
- Dunn Index (Dunn Index, abbreviation DI)
Distance calculation
Distance measurement needs to satisfy some basic properties : Nonnegativity 、 identity 、 symmetry 、 Directness , Mainly Minkowski distance 、 Euclidean distance 、 Manhattan distance
Attribute classification :
- Continuous attributes : There are infinite possible values in the definition field
- Discrete properties : There are a finite number of values in the definition field
- Ordered attributes : You can calculate the distance directly on the attribute value
- Disorder property : The distance cannot be calculated directly on the attribute value
The ordered attribute can use Minkowski distance , Unordered attributes use VDM
Prototype clustering
Prototype clustering is also called “ Clustering based on prototypes ”, This algorithm assumes that the clustering structure can be characterized by a set of prototypes . Usually , The algorithm first constructs the prototype , Then the prototype is updated and solved iteratively .
- k Mean algorithm
- Learning vector quantization
- Gaussian mixture clustering
Density clustering
Density clustering “ Clustering based on density ”, This kind of algorithm assumes that the clustering structure can be determined by the tightness of sample distribution . Usually , Density clustering algorithm from the perspective of sample density to examine the connectivity between samples , Based on the connectable samples, the cluster is expanded to get the final clustering results .【DBSCAN】
Hierarchical clustering
Hierarchical clustering view divides data sets at different levels , To form a tree like cluster structure , The data set can be divided by “ Bottom up ” The aggregation strategy of , Also can use “ The top-down ” The spin off strategy of .【AGNES Algorithm 】
Dimension reduction and measurement technology
k Neighbor learning
k Working mechanism of neighbor learning : Given the test sample , Based on a certain distance measure, find the closest k Training samples , And based on that k individual “ neighbor ” To predict the information of .
- “ laws and regulations governing balloting ”: Choose this k The most frequent category markers in the samples are used as prediction results
- “ Average method ”: Will this k The average of the real value output markers of samples is used as the prediction result
- “ Lazy study ”: In the training phase, just keep the samples , Training time cost is zero , Wait until the test sample is received
- “ Eager to learn ”: In the training stage, we learn how to deal with the samples
The nearest neighbor classifier is simple , But its generalization error rate is not more than twice that of Bayesian optimal classifier .
Low dimensional embedding
Dimension disaster : Data samples appearing in high-dimensional clear form are sparse , It is difficult to calculate the distance .
Dimension reduction : The original high-order attribute space is transformed into a low dimension by some mathematical change “ Subspace ”, In this subspace, the sample density is greatly increased , Distance calculation has also become easier .
Low dimensional embedding : In many cases , Although the data samples observed or collected by people are high-dimensional , But perhaps only a low dimensional distribution is closely related to the learning task .
Principal component analysis
For sample points in orthogonal attribute space , Use a hyperplane ( High dimensional generalization of straight lines ) Properly express all samples :
- Recently refactoring : The sample point is close enough to the hyperplane
- Maximum separability : The projection of sample points on this hyperplane can be separated as far as possible
PCA Just keep W And the mean vector of the sample can be reduced by a simple vector subtraction and matrix - Vector multiplication projects new samples into low dimensional space , The discarded part of the information can increase the sampling density of the sample , It can also play the effect of De-noising to a certain extent .
Kernel linear dimensionality reduction
“ Genuineness ” Low dimensional space :“ Originally sampled ” Low dimensional space
Nonlinear dimensionality reduction : Based on kernel technique, the linear dimensionality reduction method is studied “ Nucleation ”
Manifold learning
“ manifold ” Is a space which is locally homeomorphic with a Euclidean space , It has the property of Euclidean space locally , Can use Euclidean distance to calculate distance .
If a low dimensional manifold is embedded into a high dimensional space , The distribution of data samples in high-dimensional space seems very complex , But it still has the properties of European Space locally , therefore , It is easy to establish a dimensionality reduction mapping relationship locally , Then try to extend the local mapping relation to the global mapping relation .
Isometric mapping
Basic starting point : After the low dimensional manifold is embedded into the high dimensional space , It is misleading to calculate the wired distance directly in high-dimensional space , Because the linear distance in high-dimensional space is not reachable on low-dimensional embedded manifolds .
We can use the homeomorphism of manifold and Euclidean space locally , For each point, find its neighboring points based on Euclidean distance , Then we can establish a nearest neighbor connection graph , There is a connection between the nearest neighbors in the figure , There is no connection between non adjacent points . therefore , The problem of calculating the geodesic distance between two points , It is transformed into calculating the shortest path between two points on the nearest neighbor connection graph .
Construction of nearest neighbor graph : Specify the number of nearest neighbors 、 Specify the distance threshold ( Points less than the threshold value are considered as nearest neighbors )
Local linear embedding
Local linear embedding attempts to maintain the linear relationship between samples in the neighborhood
Measure learning
Basic motivation :“ Study ” Work out a suitable distance measure
Markov distance :
among M Also known as “ The metric matrix ”, Measurement learning is right M To study . To keep the distance nonnegative and symmetric ,M Must be ( And a half ) Positive definite symmetric matrices , That is, there must be an orthogonal basis P bring M Can be written as M=PP^T.
It can not only take the error rate as the optimization goal of measurement learning , It can also introduce domain knowledge into measurement learning .
边栏推荐
- 11 Beautiful Soup 解析庫的簡介及安裝
- 动态规划【二】(线性dp)
- Technology sharing | mysql:caching_ sha2_ Password quick Q & A
- 将图片背景设置为透明的方法介绍
- CPDA|数据分析师需要具备哪些基本功?
- homeassistant addons
- 谷歌浏览器80版本以后,如何处理出现的问题SameSite跨域问题
- Shang Silicon Valley Shang Silicon Valley | what is Clickhouse table engine memory and merge
- R语言caTools包进行数据划分、randomForest包构建随机森林模型、使用importance函数计算随机森林模型中每个特征的重要度、varImpPlot函数可视化特征的重要度
- 【综合笔试题】难度 2.5/5 :「树状数组」与「双树状数组优化」
猜你喜欢

企评家全面解读:【国家电网】中国电力财务有限公司企业成长性

Tableapi & SQL and example module of Flink

数据库主键一定要自增吗?有哪些场景不建议自增?

将图片背景设置为透明的方法介绍

Nepal graph has settled in Alibaba cloud computing nest to help enterprises build a super large-scale map database on the cloud

动态规划【一】(背包问题)

TensorFlow 2:使用神经网络对Fashion MNIST分类并进行比较分析

剑指 Offer II 029. 排序的循环链表

力扣今日题1108. IP 地址无效化

How to set the picture background to transparent
随机推荐
R语言使用epiDisplay包的statStack函数基于因子变量通过分层的方式查看连续变量的统计量(均值、中位数等)以及对应的假设检验
W10 add system environment variable path
R language uses neuralnet package to build neural network regression model (feedforward neural network regression model), visualize the scatter diagram between the actual target value and the predicte
The R language uses the follow up The plot function visualizes the longitudinal follow-up chart of multiple ID (case) monitoring indicators, and uses line Col parameter custom curve color (color)
【面试高频题】难度 1/5,难度较低的链表面试题
R语言使用plyr包的rbind.fill函数纵向合并两个数据列不同的dataframe数据
Hongmeng version of "Tiktok" is a great experience
尚硅谷 尚硅谷 | 什么是ClickHouse表引擎 Memory和Merge
11 introduction and installation of beautiful soup parsing library
Dynamic programming [II] (linear DP)
R语言dist函数计算dataframe数据中两两样本之间的距离并返回样本间距离矩阵,将距离矩阵输入给hclust函数进行层次聚类分析,method参数指定两个组合数据点间的距离计算方式
力扣每日一练之双指针1Day8
C# Mapster 对象映射器学习
Codeforces Round #394 (Div. 2) E. Dasha and Puzzle
linux-mysql-命令
2022年6月25日PMP考试通关宝典-4
在Qt中设置程序图标的方法介绍
出院小结识别api接口-医疗票据OCR识别/出院诊断记录/电子病历/理赔服务
Linux MySQL command
The GLM function of R language is used to build a binary logistic regression model (the family parameter is binomial), and the coef function is used to obtain the model coefficients and analyze the me
