当前位置:网站首页>[graduation project] Research on emotion analysis based on semi supervised learning and integrated learning

[graduation project] Research on emotion analysis based on semi supervised learning and integrated learning

2022-06-22 14:54:00 God, stop it

summary : Research on emotion analysis based on semi supervised learning and ensemble learning

data :text/JDMilk.arff[tf-idf]

about baseline
7% As a training set 30% As test set

about SSL alg
7% As a training set
63% Dimensionless dataset 30% As test set

Reference resources :[ Gao Wei, female, master ] A semi supervised emotion classification method based on random subspace self training

Split training set and test set : Four fold cross validation
The specific way is : Divide the data set into four parts , Take turns to put 3 As training data ,1 As test data , Carry out experiments , In the end 10 The average of the accuracy of the secondary results is used as an estimate of the accuracy of the algorithm   obviously , This method of estimating accuracy has high time complexity

Test standard : Accuracy rate (Accuracy)

Environment configuration :
python2.7
scikit,numpy,scipy
docker

Algorithm :
Supervised learning (SL) Classifier selection for
Selection criteria : Capable of outputting a posteriori probability
1. Support vector machine (SVC)
2. Naive Bayes - Polynomial distribution assumption (MultinomialNB)

Semi-supervised learning (SSL)
1.Self-Training
The most primitive semi supervised learning algorithm , But easy to learn bad , There was no improvement at all , Even worse
Assumption:One’s own high confidence predictions are correct.

The main idea is to train a classifier by using small-scale labeled samples , Then classify the unlabeled samples , Select confidence ( Posterior probability ) The highest sample is automatically annotated and the annotation set is updated , Train the classifier iteratively

2.Co-Training
characteristic :Original(Blum & Mitshell) For multi view data ( Web page text and hyperlinks ), From different views ( angle ) Think about the problem , Based on differences

Original View is 2, They are website text and hyperlinks
p=1,n=3,k=30,u=75
Rule#1: Samples can be represented by two or more redundant conditional independent views
Rule#2: Each view can get a strong classifier from the training samples

Number of views 4 better [ From Su Yan literature ], The number of features contained in each view m by : Total number of features n/2[ From Wang Jiao's literature ]. however , General emotional comment text (nlp) There are no natural multiple views , Considering the huge number of features in emotional texts , Using the method of random characteristic subspace generation

[RandomSubspaceMethod,RSM] The text feature space is divided into multiple parts as multiple views
But at least between views ’redundant but notcompletely correlated’ Conditions

Because multiple views should be independent of each other , If they are all fully correlated , Then the classifiers trained based on multiple views are exactly the same for the same examples to be marked , thus Co-Training The algorithm degenerates into self-training Algorithm [ From plateau master's literature ]

source
First came from Tin Kam Ho Of The Random Subspace Method forConstructing Decision Forests The paper ,for improving weak classifiers.

① From Dr. Wang Jiao's literature
Assume that the original data feature space is n dimension , The random subspace is m dimension , Satisfy m < n. The tagged dataset has l Data , namely | L| = l. To any p ∈ L , Can be written as p = (p1 , p2 , …, pn), take p Project here m In the dimensional space , The resulting vector can be written as psub = ( ps1 , ps2 , …, psm) Owned by l individual psub A set of vectors Lsub , Is to tag data sets L In its m Projection in dimensional random machine subspace . Repeat the process K Time , Get the data feature space K Different views ,Lsubk(1 ≤k ≤K)
Q: I still haven't explained the projection clearly ( segmentation ) And random ?

②from wikipedia:
1.Let the number of training points be N and the number of features in the training data be D.
2.Choose L to be the number of individual models in the ensemble.
3.For each individual model l, choose dl (dl < D) to be the number of input variables for l. It is common to have only one value of dl for all the individual models.
4.For each individual model l, create a training set by choosing dl features from D with replacement and train the model.

③ Source Tin Kam Ho Of The Random Subspace Method forConstructing Decision Forests

  Download resources

https://download.csdn.net/download/s1t16/85724818https://download.csdn.net/download/s1t16/85724818

原网站

版权声明
本文为[God, stop it]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221319150291.html