当前位置：网站首页>What is cluster analysis? Categories of cluster analysis methods [easy to understand]

What is cluster analysis? Categories of cluster analysis methods [easy to understand]

2022-07-25 20:05:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

Cluster analysis refers to the analysis process of grouping the set of data objects into multiple classes composed of similar objects .

Basic concepts

clustering （Clustering） It is a technology to find the internal structure between data . Clustering organizes all data instances into similar groups , These similar groups are called clusters . Data instances in the same cluster are the same as each other , Instances in different clusters are different from each other .

Clustering technology is often called unsupervised learning , Unlike supervised learning , There is no classification or grouping information that represents the data category in the cluster .

The similarity between data is judged by defining a distance or similarity coefficient . chart 1 Shows an example of clustering according to the distance between data objects , Data objects with similar distance are divided into a cluster .

chart 1 Schematic diagram of cluster analysis

Clustering analysis can be applied in the process of data preprocessing , For multidimensional data with complex structure, clustering analysis can be used to aggregate the data , Standardize complex structured data .

Cluster analysis can also be used to find the dependencies between data items , So as to remove or merge data items with close dependencies . Cluster analysis can also be used for some data mining methods （ Such as association rules 、 Rough set method ）, Provide preprocessing function .

In business , Cluster analysis is an effective tool for market segmentation , Used to find different customer groups , And it describes the characteristics of different customer groups , Used to study consumer behavior , Look for new potential markets .

Biologically , Cluster analysis is used to classify animals, plants and genes , To gain an understanding of the inherent structure of the population .

In the insurance industry , Cluster analysis can identify the grouping of automobile insurance policy holders through average consumption , At the same time, according to the type of residence 、 value 、 Identify the real estate group of the city by geographical location .

In Internet applications , Cluster analysis is used to classify documents on the Internet .

In E-commerce , Cluster analysis clusters customers with similar browsing behavior by grouping , And analyze the common characteristics of customers , So as to help e-commerce enterprises understand their customers , Provide more appropriate services to customers .

Categories of cluster analysis methods

At present, there are a lot of clustering algorithms , The choice of algorithm depends on the type of data 、 The purpose and application of clustering . Clustering algorithms are mainly divided into 5 Categories: ： Clustering method based on partition 、 Hierarchical clustering method 、 Density based clustering method 、 Grid based clustering method and model-based clustering method .

1. Clustering method based on partition

The clustering method based on partition is a top-down method , For a given n Data set of data objects D, Organize data objects into k(k≤n) Zones , among , Each partition represents a cluster . chart 2 It is the schematic diagram of clustering method based on partition .

chart 2 Schematic diagram of hierarchical clustering algorithm

In the clustering method based on partition , The classic one k- Average （k-means） Algorithm and k- center （k-medoids） Algorithm , Many algorithms are improved from these two algorithms .

The advantage of clustering method based on partition is , Fast convergence , The disadvantage is that , It requires the number of categories k It can be reasonably estimated that , And the selection of initial center and noise will have a great impact on the clustering results .

2. Hierarchical clustering method

Hierarchical clustering method refers to the hierarchical decomposition of a given data , Until certain conditions are met . The algorithm is divided into bottom-up method and top-down method according to the order of hierarchical decomposition , Namely agglomerative hierarchical clustering algorithm and split hierarchical clustering algorithm .

1） From the bottom up .

First , Each data object is a cluster , Calculate the distance between data objects , Merge the closest points into the same cluster each time . then , Calculate the distance between clusters , Merge the nearest cluster into a large cluster . Keep merging , Until a cluster is synthesized , Or until a certain termination condition is reached .

The shortest distance method is used to calculate the distance between clusters 、 Middle distance method 、 Quasi average method, etc , among , The shortest distance method defines the distance between clusters as the shortest distance of data objects between clusters . The representative algorithm of the bottom-up method is AGNES(AGglomerativeNESing) Algorithm .

2） From the top down .

At the beginning of this method, all individuals belong to a cluster , Then gradually subdivide into smaller clusters , Until finally, each data object is in a different cluster , Or until a certain termination condition is reached . The representative algorithm of the top-down method is DIANA（DivisiveANAlysis） Algorithm .

The main advantages of hierarchical clustering algorithm include , The similarity between distance and rule is easy to define , Less restrictions , There is no need to set the number of clusters in advance , You can find the hierarchical relationship of clusters . The main disadvantages of hierarchical clustering algorithm include , The computational complexity is too high , Singular values can also have a big impact , The algorithm is likely to cluster into chains .

3. Density based clustering method

The main goal of density based clustering method is to find high-density regions separated by low-density regions . Different from distance based clustering algorithm , The clustering result of distance based clustering algorithm is spherical cluster , The density based clustering algorithm can find clusters of arbitrary shape .

The density based clustering method starts with the density of the distribution area of the data object . If the data object in a given class is in a given range , Then the density of data objects exceeds a certain threshold and continue clustering .

This method connects dense areas , Can form clusters of different shapes , And it can eliminate the influence of outliers and noise on clustering quality , And finding clusters of arbitrary shapes , Pictured 3 Shown .

The most representative of density based clustering methods is DBSAN Algorithm 、OPTICS Algorithm and DENCLUE Algorithm . chart 2 It is the schematic diagram of hierarchical clustering algorithm , The upper part shows AGNES The steps of the algorithm , Below is the DIANA The steps of the algorithm . There is no difference between the two methods , Just in the actual application, we should according to the data characteristics and the number of clusters we want , Consider whether it is faster from the bottom up or from the top down .

chart 3 Schematic diagram of Density Clustering Algorithm

4. Grid based clustering method

The grid based clustering method quantifies the space into a finite number of cells , It can form a grid structure , All clustering is done on the grid . The basic idea is to divide the possible values of each attribute into many adjacent intervals , And create a collection of grid cells . Each object falls into a grid cell , The attribute space corresponding to the grid cell contains the value of the object , Pictured 4 Shown .

chart 4 Schematic diagram of Grid Based Clustering Algorithm

The main advantage of grid based clustering method is fast processing , Its processing time is independent of the number of data objects , It only depends on the number of units in each dimension of the quantization space . The disadvantage of this kind of algorithm is that it can only find clusters with horizontal or vertical boundaries , Oblique boundaries cannot be detected . in addition , When dealing with high-dimensional data , The number of grid cells will increase exponentially with the increase of attribute dimension .

5. Model based clustering method

The model-based clustering method attempts to optimize the adaptability between the given data and some mathematical models . This method assumes a model for each cluster , Then find the best fitting of the data to the given model . The assumed model may be a density function or other function representing the spatial distribution of data objects . The basic principle of this method is to assume that the target data set is determined by a series of potential probability distributions .

chart 5 The partition based clustering method and the model-based clustering method are compared . The result given on the left is a distance based clustering method , The core principle is to gather the points close together . The clustering method based on probability distribution model given on the right , The probability distribution model used here is an ellipse with a certain radian .

chart 5 Two solid points are marked in , These two points are very close , In the distance based clustering method , They gather in a cluster , But the clustering method based on probability distribution model divides them into different clusters , This is to meet the specific probability distribution model .

chart 5 Comparison of clustering methods

In the model-based clustering method , The number of clusters is automatically determined based on standard statistics , Noise or outliers are also analyzed through statistics . The model-based clustering method attempts to optimize the adaptability between the given data and some data models .

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/127746.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251959035616.html

当前位置：网站首页>What is cluster analysis? Categories of cluster analysis methods [easy to understand]

What is cluster analysis? Categories of cluster analysis methods [easy to understand]

Basic concepts

Categories of cluster analysis methods

边栏推荐

猜你喜欢

随机推荐