当前位置：网站首页>Machine Learning Clustering - Experimental Report

Machine Learning Clustering - Experimental Report

2022-06-26 11:05:00 【Obviously easy to prove】

Machine learning experiment report

〇、 Experimental report pdf It can be downloaded from this website
One 、 The purpose and requirements of the experiment
Two 、 Experiment content and method
- 2.1 Clustering algorithm learning and review
3、 ... and 、 Experimental steps and processes
Four 、 Experimental conclusion or experience
- 4.1 The experimental conclusion
- 4.2 Experimental experience

〇、 Experimental report pdf It can be downloaded from this website

Machine learning experiment 5 ： clustering
This requires points to download （ Because the background of the experiment report checks the duplicate , It is not recommended to go whoring for nothing ）.
I suggest you read the blog , There will be many experimental reports in the blog, which will be used in the future 【…】 Bold notes .

One 、 The purpose and requirements of the experiment

sketch K-Means Clustering algorithm principle and algorithm process .
Have a good command of K-means Display of clustering algorithm and results , And code implementation , Make a two-dimensional or three-dimensional space 2~3 Class point （ Each class has 10 A little bit ） Clustering experiment , The clustering results are represented by different colors and symbols .
Realize face image （ Take before 2_{3 Personal face image ） Clustering experiments and rotating objects （ stay COIL20 Before fetching in the data set 2}3 Images of classes ）, The clustering results are represented by different colors and symbols , And put the corresponding image next to the corresponding point , Let people see at a glance whether the result is right ; meanwhile , The list gives its location in different databases K The clustering accuracy of .
See previous papers , Design a new clustering algorithm , Write the brief contents in this experiment report ; Write a long article and submit it to ” Paper submission office “;

Two 、 Experiment content and method

2.1 Clustering algorithm learning and review

2.1.1 Clustering tasks

According to teacher zhouzhihua 《 machine learning 》 In a Book , To illustrate the clustering task .

1） Concept of clustering task

Clustering attempts to divide the samples in a dataset into several usually disjoint subsets , Each subset is called a “ cluster ”, By such a division , Each cluster may correspond to some potential concepts . It should be noted that , These concepts are unknown to clustering algorithm in advance , Clustering can only form cluster structure automatically , The concept semantics corresponding to clusters need to be grasped and named by users .
Clustering can be a separate process , Used to find the internal distribution structure of data , It can also be used as a precursor to other learning tasks such as classification . for example , In some commercial applications, it is necessary to distinguish the types of new users , But definition “ The user types ” It may not be easy for businesses , here , It is often possible to cluster user data first , Each cluster is defined as a class according to the clustering results , Then the classification model is trained based on these classes , Used to identify the type of new user .

2） Symbol definition

Insert picture description here

3） Performance metrics

Clustering performance measure is also called clustering “ Indicators of effectiveness ”. The function of performance measurement in supervised learning is similar , For clustering results , We need some kind of performance measure to evaluate it ; On the other hand , If you specify performance metrics that will eventually be used , Then it can be directly used as the optimization objective of the clustering process , So as to better get the clustering results that meet the requirements .
Clustering is the clustering of sample sets D Divided into several disjoint selves , Sample cluster , that , What kind of clustering result is better ？ Intuitively , We hope “ Birds of a feather flock together ”, That is, the samples of the same cluster are as similar to each other as possible , The samples of different clusters should be as different as possible . In other words , Clustering results “ Similarity within the cluster ” high , And “ Similarity between clusters ” low .

There are two kinds of clustering performance measures , One is to compare the clustering results with a “ Reference model ” Compare , be called “ External indicators ”; The other is to examine the clustering results directly without using any reference model , be called “ Internal indicators ”.

2.1.2 K-means The algorithm model of

1） optimization problem

Insert picture description here

2） Iteration strategy

Insert picture description here

2.1.3 K-means Algorithm flow of

Insert picture description here

2.1.4 K-means The algorithm analysis of

1） Complexity ：

K-means The complexity of the algorithm is Ο(mnk), among m Is the sample dimension ,n It's the number of samples ,k It's the number of categories .

2） advantage ：

Easy to understand , Clustering effect is good , Although it is locally optimal , But often local optimization is enough ;
When dealing with big data sets , The algorithm can ensure good scalability ;
When the cluster approximates the Gaussian distribution , The effect is very good ;
The algorithm complexity is low .

3） shortcoming ：

K The value needs to be set manually , Different K The results are different ;
Sensitive to the initial cluster center , Different selection methods will get different results ;
Sensitive to outliers ;
Samples can only be classified into one category , Not suitable for multi category tasks ;
Not suitable for too discrete classification 、 Unbalanced classification of sample categories 、 Classification of nonconvex shapes .

3、 ... and 、 Experimental steps and processes

3.0 Experimental data set and class label alignment

3.0.1 Data sets

（1） ORL56_46 Face data set
This dataset has 40 personal , everyone 10 A picture . The pixel size of each picture is 56×46. This experiment takes the first three classes under this data set , Each class has 10 A sample points .

（2） AR Face data set
The database consists of 3 More than one database ;126 Facial frontal images of subjects 200 A color image . Each theme has 26 Different pictures . For each subject , These images are recorded in two different periods , Two weeks apart , Each period consists of 13 It's made up of images . All images are taken by the same camera under strictly controlled lighting and viewpoint conditions . Every image in the database is 768×576 Pixel size , Each pixel consists of 24 position RGB The color value indicates . This experiment takes the first three classes under this data set , Each class has 26 A sample points .

（3） FERET Face data set
The data set consists of 200 people , Everyone 7 Zhang , Classified , grayscale ,80x80 Pixels . The first 1 This is a standard unchanged image , The first 2,5 This is an image of a large range of attitude changes , The first 3,4 This is an image of small amplitude attitude change . The first 7 This is an image of illumination change . This experiment takes the first three classes under this data set , Each class has 7 A sample points .

（4） COIL-20 Data sets
COIL-20 A dataset is a collection of color pictures , Contains the 20 An object is photographed from different angles , every other 5 Take an image , Every object 72 Zhang image . The size of each image is uniformly treated as 128x128. This experiment takes the first three classes under this data set , Each class has 72 A sample points .

3.0.2 Class label alignment problems and stop conditions

Because this experiment involves the judgment of category , This needs to consider class label alignment . Here I use the Hungarian algorithm to solve the clustering problem. When calculating the recognition rate, the class labels are misaligned , Specific functions can be seen in ACC.m.
When the value of the objective function does not change by more than 0.01 when , Iteration stop .

3.1 In two or three dimensions 2-3 Class point （ Each class has 10 A little bit ） Clustering experiment

3.1.1 Experimental instructions

This experiment will randomly initialize some points for experiment . Each sample point has a coordinate and a class label . For the second case , Generate 20 A little bit ; For three types of situations , Generate 30 A little bit . Then the accuracy is calculated by comparing the clustering results with the original class labels .
Because the data of this experiment are consistent with ORL Datasets are very similar （ Each class has 10 A little bit ）, So directly adopt ORL Data sets for experiments and visualization . The experimental results are shown in 3.2.2.1.

3.2 Face image clustering experiment

3.2.1 Experimental design

This experiment is based on ORL56_46、AR and FERET Face data sets for experiments . In order to minimize large differences in results due to initial values , I will run each algorithm five times and take its average value as its measurement standard .

3.2.2 experimental result

1. ORL Data sets

Two category

Insert picture description here

Three types of

Insert picture description here

2. AR Data sets

Two category

Insert picture description here

Three types of

Insert picture description here

3. FERET Data sets

Two category

Insert picture description here

Three types of

Insert picture description here

3.2.3 experimental analysis

From the above three data sets in 2~3 In terms of clustering effect of classes ,ORL The clustering effect is the best ,FERET In the second category, it is better , In the three categories, the situation is relatively poor . and AR The data set is relatively poor in both class II and class III . according to AR The composition of the data set , I have derived for the second kind of case AR Data set clustering results , You can find ,K-means The algorithm is more inclined to gather the face photos in the same time period , Instead of clustering people by category . by comparison , stay AR The clustering effect of three classes on the data set has been significantly improved .
because K-means The clustering result of the algorithm will be seriously affected by initialization , Then we can also draw this conclusion by repeating the experiment , The accuracy of each experiment is not constant , This is also related to the number of iterations of the objective function convergence , The better the initialization , The more reasonable the number of iterations , The higher the accuracy .

3.3 Rotating object image clustering experiment

3.3.1 Experimental design

This experiment uses COIL-20 Before the dataset 2~3 Experiments in different categories . In order to minimize large differences in results due to initial values , I will run each algorithm five times and take its average value as its measurement standard .

3.3.2 experimental result

Two category

Insert picture description here

Three types of

Insert picture description here

3.3.3 experimental analysis

You can find ,K-means The second class clustering effect of the algorithm on rotating data sets , The three effects are general , This is because K-means The comparison applies to spherical data , During the shooting process of rotating data, the similarity within the picture class will be small due to the angle problem , So the clustering effect is not good . according to K-means Assumptions ：K The mean assumes that the distribution of each variable is spherical ; All variables have the same variance ; Have the same prior probability , Each class is required to have the same number of observations . It can be concluded that K-means Disadvantages of the algorithm , Just use K-means The algorithm is not only time-consuming , And the clustering effect is poor , At this point, if we combine PCA Dimension reduction and re clustering can obviously find ： Not only has the efficiency been improved 、 And the recognition rate of class II problems is the highest 100%, The effect is obviously improved . On the other hand, we can see that the initial transformation is very important for clustering . If the initialization point is close ,K-means Will tend to cluster into a class , As a result, the accuracy rate is only 50%.
Insert picture description here
According to the above two figures , We found that FCM The effect on face clustering is not very good , Often FCM Because of the size of membership, they tend to be clustered into a class , It leads to low accuracy , This leads to low accuracy .FCM stay UCI Data sets are often better than K-means More algorithms , On the one hand, it may be because the dimensions are different , On the other hand , Faces have common features .

3.4 Design a new clustering algorithm

【 Don't make a fool of yourself , Skip skip 】

Four 、 Experimental conclusion or experience

4.1 The experimental conclusion

The experimental conclusions of this experiment are as follows ：

K-means Algorithm ORL The clustering effect is the best ,FERET In the second category, it is better , In the three categories, the situation is relatively poor . and AR The data set is relatively poor in both class II and class III . according to AR The composition of the data set , I have derived for the second kind of case AR Data set clustering results , You can find ,K-means The algorithm is more inclined to gather the face photos in the same time period , Instead of clustering people by category . by comparison , stay AR The clustering effect of three classes on the data set has been significantly improved . because K-means The clustering result of the algorithm will be seriously affected by initialization , Then we can also draw this conclusion by repeating the experiment , The accuracy of each experiment is not constant , This is also related to the number of iterations of the objective function convergence , The better the initialization , The more reasonable the number of iterations , The higher the accuracy .
K-means The second class clustering effect of the algorithm on rotating data sets , The three effects are general , This is because K-means The comparison applies to spherical data , During the shooting process of rotating data, the similarity within the picture class will be small due to the angle problem , So the clustering effect is not good . according to K-means Assumptions ：K The mean assumes that the distribution of each variable is spherical ; All variables have the same variance ; Have the same prior probability , Each class is required to have the same number of observations . It can be concluded that K-means Disadvantages of the algorithm , Just use K-means The algorithm is not only time-consuming , And the clustering effect is poor , At this point, if we combine PCA Dimension reduction and re clustering can obviously find ： Not only has the efficiency been improved 、 And the recognition rate of class II problems is the highest 100%, The effect is obviously improved . On the other hand, we can see that the initial transformation is very important for clustering . If the initialization point is close ,K-means Will tend to cluster into a class , As a result, the accuracy rate is only 50%.
FCM The effect on face clustering is not very good , Often FCM Because of the size of membership, they tend to be clustered into a class , It leads to low accuracy , This leads to low accuracy .FCM stay UCI Data sets are often better than K-means More algorithms , On the one hand, it may be because the dimensions are different , On the other hand , Guess because faces have common features , So the membership degree of each sample tends to be consistent .

4.2 Experimental experience

The main content of this experiment is clustering , This kind of machine learning algorithm is relatively easy , It is also the most familiar machine learning module . This experiment will K-means,fastPCA+K-means（20 dimension ,100 dimension ,200 dimension ） and FCM The accuracy and efficiency of face clustering are compared . This experiment has also done a lot , I also participated in the research of clustering algorithm with my senior students ,PCM、RCM、PRFCM And other algorithms can also try to do experiments on face recognition . about K-means and FCM Will tend to converge to a class of problems , There are also special papers to study such problems , But this drawback will be more prominent in the task of face clustering .
Clustering is a method of evaluating classes in mathematical modeling , Very common and very interesting , I am also very interested in clustering , In the future, we may continue to try some research on clustering algorithms . Last , Pay tribute to the researchers who proposed and innovated algorithms ！

Insert picture description here