当前位置:网站首页>What is cluster analysis? Categories of cluster analysis methods [easy to understand]
What is cluster analysis? Categories of cluster analysis methods [easy to understand]
2022-07-25 20:05:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Cluster analysis refers to the analysis process of grouping the set of data objects into multiple classes composed of similar objects .
Basic concepts
clustering (Clustering) It is a technology to find the internal structure between data . Clustering organizes all data instances into similar groups , These similar groups are called clusters . Data instances in the same cluster are the same as each other , Instances in different clusters are different from each other .
Clustering technology is often called unsupervised learning , Unlike supervised learning , There is no classification or grouping information that represents the data category in the cluster .
The similarity between data is judged by defining a distance or similarity coefficient . chart 1 Shows an example of clustering according to the distance between data objects , Data objects with similar distance are divided into a cluster .
chart 1 Schematic diagram of cluster analysis
Clustering analysis can be applied in the process of data preprocessing , For multidimensional data with complex structure, clustering analysis can be used to aggregate the data , Standardize complex structured data .
Cluster analysis can also be used to find the dependencies between data items , So as to remove or merge data items with close dependencies . Cluster analysis can also be used for some data mining methods ( Such as association rules 、 Rough set method ), Provide preprocessing function .
In business , Cluster analysis is an effective tool for market segmentation , Used to find different customer groups , And it describes the characteristics of different customer groups , Used to study consumer behavior , Look for new potential markets .
Biologically , Cluster analysis is used to classify animals, plants and genes , To gain an understanding of the inherent structure of the population .
In the insurance industry , Cluster analysis can identify the grouping of automobile insurance policy holders through average consumption , At the same time, according to the type of residence 、 value 、 Identify the real estate group of the city by geographical location .
In Internet applications , Cluster analysis is used to classify documents on the Internet .
In E-commerce , Cluster analysis clusters customers with similar browsing behavior by grouping , And analyze the common characteristics of customers , So as to help e-commerce enterprises understand their customers , Provide more appropriate services to customers .
Categories of cluster analysis methods
At present, there are a lot of clustering algorithms , The choice of algorithm depends on the type of data 、 The purpose and application of clustering . Clustering algorithms are mainly divided into 5 Categories: : Clustering method based on partition 、 Hierarchical clustering method 、 Density based clustering method 、 Grid based clustering method and model-based clustering method .
1. Clustering method based on partition
The clustering method based on partition is a top-down method , For a given n Data set of data objects D, Organize data objects into k(k≤n) Zones , among , Each partition represents a cluster . chart 2 It is the schematic diagram of clustering method based on partition .
chart 2 Schematic diagram of hierarchical clustering algorithm
In the clustering method based on partition , The classic one k- Average (k-means) Algorithm and k- center (k-medoids) Algorithm , Many algorithms are improved from these two algorithms .
The advantage of clustering method based on partition is , Fast convergence , The disadvantage is that , It requires the number of categories k It can be reasonably estimated that , And the selection of initial center and noise will have a great impact on the clustering results .
2. Hierarchical clustering method
Hierarchical clustering method refers to the hierarchical decomposition of a given data , Until certain conditions are met . The algorithm is divided into bottom-up method and top-down method according to the order of hierarchical decomposition , Namely agglomerative hierarchical clustering algorithm and split hierarchical clustering algorithm .
1) From the bottom up .
First , Each data object is a cluster , Calculate the distance between data objects , Merge the closest points into the same cluster each time . then , Calculate the distance between clusters , Merge the nearest cluster into a large cluster . Keep merging , Until a cluster is synthesized , Or until a certain termination condition is reached .
The shortest distance method is used to calculate the distance between clusters 、 Middle distance method 、 Quasi average method, etc , among , The shortest distance method defines the distance between clusters as the shortest distance of data objects between clusters . The representative algorithm of the bottom-up method is AGNES(AGglomerativeNESing) Algorithm .
2) From the top down .
At the beginning of this method, all individuals belong to a cluster , Then gradually subdivide into smaller clusters , Until finally, each data object is in a different cluster , Or until a certain termination condition is reached . The representative algorithm of the top-down method is DIANA(DivisiveANAlysis) Algorithm .
The main advantages of hierarchical clustering algorithm include , The similarity between distance and rule is easy to define , Less restrictions , There is no need to set the number of clusters in advance , You can find the hierarchical relationship of clusters . The main disadvantages of hierarchical clustering algorithm include , The computational complexity is too high , Singular values can also have a big impact , The algorithm is likely to cluster into chains .
3. Density based clustering method
The main goal of density based clustering method is to find high-density regions separated by low-density regions . Different from distance based clustering algorithm , The clustering result of distance based clustering algorithm is spherical cluster , The density based clustering algorithm can find clusters of arbitrary shape .
The density based clustering method starts with the density of the distribution area of the data object . If the data object in a given class is in a given range , Then the density of data objects exceeds a certain threshold and continue clustering .
This method connects dense areas , Can form clusters of different shapes , And it can eliminate the influence of outliers and noise on clustering quality , And finding clusters of arbitrary shapes , Pictured 3 Shown .
The most representative of density based clustering methods is DBSAN Algorithm 、OPTICS Algorithm and DENCLUE Algorithm . chart 2 It is the schematic diagram of hierarchical clustering algorithm , The upper part shows AGNES The steps of the algorithm , Below is the DIANA The steps of the algorithm . There is no difference between the two methods , Just in the actual application, we should according to the data characteristics and the number of clusters we want , Consider whether it is faster from the bottom up or from the top down .
chart 3 Schematic diagram of Density Clustering Algorithm
4. Grid based clustering method
The grid based clustering method quantifies the space into a finite number of cells , It can form a grid structure , All clustering is done on the grid . The basic idea is to divide the possible values of each attribute into many adjacent intervals , And create a collection of grid cells . Each object falls into a grid cell , The attribute space corresponding to the grid cell contains the value of the object , Pictured 4 Shown .
chart 4 Schematic diagram of Grid Based Clustering Algorithm
The main advantage of grid based clustering method is fast processing , Its processing time is independent of the number of data objects , It only depends on the number of units in each dimension of the quantization space . The disadvantage of this kind of algorithm is that it can only find clusters with horizontal or vertical boundaries , Oblique boundaries cannot be detected . in addition , When dealing with high-dimensional data , The number of grid cells will increase exponentially with the increase of attribute dimension .
5. Model based clustering method
The model-based clustering method attempts to optimize the adaptability between the given data and some mathematical models . This method assumes a model for each cluster , Then find the best fitting of the data to the given model . The assumed model may be a density function or other function representing the spatial distribution of data objects . The basic principle of this method is to assume that the target data set is determined by a series of potential probability distributions .
chart 5 The partition based clustering method and the model-based clustering method are compared . The result given on the left is a distance based clustering method , The core principle is to gather the points close together . The clustering method based on probability distribution model given on the right , The probability distribution model used here is an ellipse with a certain radian .
chart 5 Two solid points are marked in , These two points are very close , In the distance based clustering method , They gather in a cluster , But the clustering method based on probability distribution model divides them into different clusters , This is to meet the specific probability distribution model .
chart 5 Comparison of clustering methods
In the model-based clustering method , The number of clusters is automatically determined based on standard statistics , Noise or outliers are also analyzed through statistics . The model-based clustering method attempts to optimize the adaptability between the given data and some data models .
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/127746.html Link to the original text :https://javaforall.cn
边栏推荐
- [cloud native | learn kubernetes from scratch] VIII. Namespace resource quotas and labels
- 笔记——记录一个CannotFindDataSourceException: dynamic-datasource can not find primary datasource问题解决
- Add a subtitle of 3D effect to the container
- The JS paging plug-in supports tables, lists, text, and images
- 9. < tag dynamic programming and subsequence, subarray> lt.718. Longest repeated subarray + lt.1143. Longest common subsequence
- JVM(二十三) -- JVM运行时参数
- PreScan快速入门到精通第十八讲之PreScan轨迹编辑的特殊功能
- CarSim simulation quick start (16) - ADAS sensor objects of CarSim sensor simulation (2)
- 4、Nacos 配置中心源码解析之 服务端启动
- what is qml in qt
猜你喜欢

PMP采用最新考纲,这里有【敏捷项目管理】

Security Basics 4 - regular expressions

【云原生 | 从零开始学Kubernetes】八、命名空间资源配额以及标签

连接数据库警告 Establishing SSL connection without server‘s identity verification is not recommended.

When the V100 of mindpole 8 card is trained to 101 epochs, an error of reading data timeout is reported

wallys//IPQ5018/IPQ6010/PD-60 802.3AT Input Output 10/100/1000M

【神器】截图+贴图工具 Snipaste

Creative drop-down multi choice JS plug-in download

如何保证定制滑环质量

谷歌Pixel 6a屏下指纹扫描仪存在重大安全漏洞
随机推荐
谷歌Pixel 6a屏下指纹扫描仪存在重大安全漏洞
Summarize the level of intelligent manufacturing discussion [macro understanding]
9. < tag dynamic programming and subsequence, subarray> lt.718. Longest repeated subarray + lt.1143. Longest common subsequence
When the V100 of mindpole 8 card is trained to 101 epochs, an error of reading data timeout is reported
Log in to Baidu online disk with cookies (websites use cookies)
各厂商网络虚拟化的优势
Export and call of onnx file of pytorch model
高数_第3章重积分 学习体会与总结
High number_ Chapter 3 learning experience and summary of multiple integral
UNET and mask RCNN
Software designer afternoon real topic: 2009-2022
Array of sword finger offer question bank summary (I) (C language version)
【云原生 | 从零开始学Kubernetes】八、命名空间资源配额以及标签
From Tong Dai to "Tong Dai" and then to brand, the beauty of sudden profits has changed and remained unchanged
what is qml in qt
The query data returned by the print database is null or the default value. Does not match the value returned by the database
How does tiktok break zero?
统信UOS下配置安装cocos2dx开发环境
什么是唯心主义
10.< tag-动态规划和子序列, 子数组>lt.53. 最大子数组和 + lt.392. 判断子序列 dbc