当前位置:网站首页>[feature selection] several methods of feature selection
[feature selection] several methods of feature selection
2022-07-24 20:29:00 【Sunny qt01】
- feature selection *
Invalid variable
Irrelevant variables , Redundant variables
Feature selection of statistical methods
Variance thresholding 、 Chi square test 、ANOVA Inspection and T test 、 Pearson correlation coefficient
Selection of highly relevant features ( Redundant variables )
Feature selection of model mode
Decision tree 、 Logical regression , Random forests ,XGBoost
The model will automatically select variables
Recursive feature selection .
Slowly eliminate the features , Limit to a specific range .

When input increases , Data must be added , Otherwise, the model will be unstable ,
- Invalid variable
Irrelevant variables , Redundant variables

Redundancy: The correlation between the two variables is too high , explain 1 Whether the concepts of the two may be close , That is, redundant variables , You can adopt the method of merging . Even delete fields , Both bring information
Irrelevancy:X4,X3 Is irrelevant variables ,X4 When it gets larger, you will find the change of the target value . When X3 The predicted value is random when it changes , Unrelated , Unable to bring information .

- Feature selection of statistical methods
VT Variance thresholding : Calculate the variance of numeric fields , If below a certain value , It means that it contains insufficient information .
Variance cannot be standardized in advance . such as Z-scold Its variance is 1, The mean for 0
A threshold must be determined , Delete this field
Binary variable : Code one of them as 1, One code is 0 The variance is P(1-P)( First do feature transformation )

When the variance is larger , Description is the more important field . The maximum is 0.25.
Of course , This has nothing to do with the goal
- Statistical inspection method :
The relationship between the input field and the target field
Category field : Chi square test : The relationship between the input field and the target field
Numeric fields :ANOVA test ( The target field is greater than 2 Just go ):T test ( The target field has only 2 It's worth , such as yes or no): To verify the relevance between the input field and the target field .
ANOVA Case study : Whether background music will affect consumers' mood . music ( Input field ) Relationship with alcohol purchase .
No music ,French Accordion ,italian Accordion
Alcohol :French、italian、 Other alcoholic beverages
statistic

Real sales minus the sum of expectations divided by the sum of expectations


This is the expected frequency . Let the two be independent , probability 1 Multiply by the probability 2, Multiply by total 243.
Subtract the following table from the above table , Sum of squares , Divide by the sum of the mean

The larger the value, the better . The value of comparison can be found in the table ,
First calculate its chi square value , Use this value to look up the table , The corresponding probability , If it is less than the significance level 0.05, The probability that the two are irrelevant is very small , To exclude .
Case microfinance chi square test results :

1234 It's more important ,5678 Is not important
T Inspection process : to F test , How to T test


lower than 0.05 As an important variable
ANOVA Inspection process : Find out first F-value, How to find T-value

The result is very close to .
Pearson correlation coefficient :
Selection of highly relevant features ( Redundant variables ):
Highly relevant fields often appear , The information is repeated , Using Pearson correlation coefficient , Check the correlation between the two . Greater than 0.95 Just erase the variables .
It depends on keeping that , Variable can be found 1 And variables 2 Relationship with goals .
- Feature selection of model mode
Decision tree 、 Logical regression , Random forests ,XGBoost
The model will automatically select the most important variables , Variables that do not have collinearity ,
It can solve collinearity , Irrelevant issues .
- RFECV( Recursive variable selection .)
Cross validation method to verify .CV.
RFE: repeat
The evaluation index can use the index you decide . Remove Variable , If the index gets worse

backward : First use cross validation , Get the index value , Remove one of them , After the indicators get better , Continue to remove , If the index value becomes worse , Just go back and don't eliminate .
3 Methods , Forward method , backward , Stepwise regression
The best effect , But it consumes more energy , A waste of time
边栏推荐
- Microservice architecture | service monitoring and isolation - [sentinel] TBC
- How to apply Po mode in selenium automated testing
- Leetcode 1928. minimum cost of reaching the destination within the specified time
- API data interface of A-share transaction data
- Substr and substring function usage in SQL
- Oracle primary key auto increment setting
- How to view the execution plan of a stored procedure in Youxuan database
- Work notes - some problems encountered when using jest
- Understand the domestic open source Magnolia license series agreement in simple terms
- [learning notes] agc008
猜你喜欢

Upgrade appium automation framework to the latest 2.0

TCP sliding window, singleton mode (lazy and hungry) double checked locking / double checked locking (DCL)

Valdo2021 - vascular space segmentation in vascular disease detection challenge (2)

Bypass using the upper limit of the maximum number of regular backtracking

2022 chemical automation control instrument test question simulation test platform operation

【LeetCode】1184. 公交站间的距离

(posted) differences and connections between beanfactory and factorybean

Actual measurement of Qunhui 71000 Gigabit Network

YouTube "label products" pilot project launched
![微服务架构 | 服务监控与隔离 - [Sentinel] TBC...](/img/28/8ca90e9dbd492688e50446f55959ff.png)
微服务架构 | 服务监控与隔离 - [Sentinel] TBC...
随机推荐
Pix2seq: Google brain proposes a unified interface for CV tasks!
What is IDE (integrated development environment)
Vscode connected to the remote server cannot format the code / document (resolved)
Opengl rendering pipeline
Inconsistent time
Solve the problem of error l6218e undefined symbol XXX
Unitywebgl project summary (unfinished)
微服务架构 | 服务监控与隔离 - [Sentinel] TBC...
How does starknet change the L2 landscape?
Apache atlas version 2.2 installation
Methods of using tyrosine modified peptide nucleic acid PNA | Tyr PNA | BZ Tyr PNA | 99Tcm survivinmrna antisense peptide nucleic acid
Sql164 next day retention rate of new users per day in November 2021
[training Day6] dream [priority queue] [greed]
TCP sliding window, singleton mode (lazy and hungry) double checked locking / double checked locking (DCL)
[trial experience of Yuxin micro Wiota ad hoc network protocol development kit] RT thread BSP Software package production
Oracle primary key auto increment setting
[training Day8] tent [mathematics] [DP]
Leetcode 206 reverse linked list, 3 longest substring without repeated characters, 912 sorted array (fast row), the kth largest element in 215 array, 53 largest subarray and 152 product largest subarr
[mathematical modeling / mathematical programming model]
The U.S. economy continues to be weak, and Microsoft has frozen recruitment: the cloud business and security software departments have become the hardest hit