当前位置:网站首页>[summary of Feature Engineering] explain what features are and the steps of feature engineering
[summary of Feature Engineering] explain what features are and the steps of feature engineering
2022-07-24 20:29:00 【Sunny qt01】
- Introduction to feature engineering
Listen to people often , Data and features determine the upper limit of machine learning , Algorithms and models are just constantly approaching this upper limit . thus it can be seen , Feature engineering plays an indispensable role in machine learning .
Look back at the website Kaggle,KDD, Competitions at home and abroad , In fact, the champion of each competition did not use a very sophisticated algorithm , Most of them have done excellent work in feature engineering , Then we can get excellent performance by using some common algorithms .
Feature engineering is a key factor in machine learning .
- The importance and purpose of Feature Engineering
The purpose of Feature Engineering : It is to transform fields into features that can better represent potential problems , And then improve the efficiency of machine learning .
1. The better the characteristics are , The more flexible : Good features perform well in any model , The flexibility of good features is that they allow you to choose uncomplicated models , At the same time, the allowable speed will be faster , Make it easier for you to understand and maintain .
2. The better the characteristics are , The simpler the model is built : Have good characteristics , Even if the parameters of the model are not optimal , The performance of the model can still perform well , So you don't have to spend too much time looking for the optimal parameters , It greatly reduces the complexity of the model , Make the model simpler .
3. The better the characteristics are , The better the performance of the model . The purpose of our search for features is to improve the performance of the model .
How to evaluate feature Engineering
Build machine learning model Baseline Model( The most basic machine model )
Apply one or more feature engineering techniques to raw data
Rebuild machine learning model and Baseline Model Compare
If the increment of efficiency is greater than a certain critical value , It means it is beneficial
Before major feature engineering treatment , We have to
1. Feature understanding : Know what fields are in the dataset
2. Feature improvement : Data preprocessing for fields
- Feature understanding
1. Data is structure ( surface ) Or unstructured data ( Text , voice , video , Audio )
2. Type of field : Numerical type , Category type , Sequential type , Binary type
3. Descriptive data analysis (Exploratory Data Analysis)( In order to let you know the overall situation of the data )
(1) Descriptive statistics (Descriptive Statisties): Number of different values , Number of null values , Distribution of category values , Maximum , minimum value , Average , standard deviation , Outliers and other data quality reports
(2) Data visualization (Data Visualization): Collocation of various charts ( Pie chart , bar chart , Histogram , Scatter plot, etc. ), It can be presented with the target field .
Case study : Microcredit data set
Microfinance data protection 1551 Customer data
Each customer data contains a target field (Target Attribute)
Microfinance data includes 1551 Customer data
Each customer data contains a target field (Target Attribute) and 10 Input fields (Input Attribute)
8 Category fields ,2 Fields are numeric
This project divides customers into two categories
1. There will be microfinance ( Will respond ) The customer
167 Pen data
2. Will not come to microfinance ( No response ) The customer
Yes 1384 Pen data
Field 01:age( Numeric fields ): Age
Field 02:sex( Category field ): Gender
Field 03:region( Category field ): Residential area
Field 05:income( Numeric fields ): Monthly income
Field 06:children( Category field ): The number of children in the family
Field 07:car( Category field ) Is there a car
Field 08:save——act( Category field ): Whether there is a live savings account
Field 09:current_act( Category field ): Whether there is a deposit account
Field 10:mortgage( Category field ): Whether it is a mortgage account

Descriptive statistics : Data quality report
Total table

Different values of gender may be problematic .
Table of numeric fields
The maximum value and the minimum value are compared with the last two , See if there are outliers

Table 3 : Characteristics of classified data

Gender seems to have little to do with the target field

Different residential areas have slightly different loan ratios ( More important than gender )

Married people seem to have greater demand for loans

The relationship between the number of children and whether to loan , No children, no sense of responsibility ,( important )

Whether there is a car seems to have little to do with whether it is a loan

Whether there is a relationship between current account and loan

The age will be lower and lower .

Income also has a downward trend .

Feature improvement :
Improve on the premise of understanding the characteristics
Data cleaning : Wrong value 、 Null value 、 The treatment of outliers , It has been explained before
Data encoding , Data standardization (Data Standardization) And type conversion
Z-score,Min-Max, Code of category and sequential fields , It has been explained before
Generalization of data , Normalization of data ( Let the length of the vector be 1, For text analysis , The third part will explain )
Unstructured data structure ( The third part will specify )
data ( Non text ) Normalization of
L2 The Euclidean distance representing the data therein is equal to 1, That is, the square root of the two is 1
L1
The sum of the two is 1,
- Coverage of Feature Engineering
Feature construction : Construct new features , Explore the relationship between features
Use external data , Data exploration , Expert experience , Data analysis , Feature construction method
feature selection : Select some useful features , For bad characteristics say no
Statistical method , Highly relevant features , Model way ( Random forests , Decision tree ), Recursive feature selection ( Gradual regression and so on )
Feature transformation :( With premise, for example PCA) Using mathematical methods ( Simple addition, subtraction, multiplication and division, principal component factor analysis, etc ), Merge old fields ( It can be a bad feature ) Produce new features , Extract the potential structure hidden in the data
linear (PCA, Matrix decomposition NMF,SVD,TSVD,LDA)
nonlinear (Kernel PCA,tSNE, neural network )
Two linear transformations
Feature learning :( No premise ) Use deep learning , Automatically learn new features .
( Association rules , neural network , Deep learning ) Feature based learning , Word embedding based text feature learning
With AI promote AI
边栏推荐
- [msp430g2553] graphical development notes (2) system clock and low power consumption mode
- 2022 chemical automation control instrument test question simulation test platform operation
- Istio一之Envoy工作原理
- Unity's ugui text component hard row display (improved)
- [learning notes] agc008
- Luogu - p1616 crazy herb picking
- 1. Mx6u-alpha development board (buzzer experiment)
- [trial experience of Yuxin micro Wiota ad hoc network protocol development kit] RT thread BSP Software package production
- Lunch break train & problem thinking: on multidimensional array statistics of the number of elements
- C# 窗体应用TreeView控件使用
猜你喜欢

Make Huawei router into FTP server (realize upload and download function)

"Hualiu is the top stream"? Share your idea of yyds

Azide labeled PNA peptide nucleic acid | methylene blue labeled PNA peptide nucleic acid | tyrosine modified PNA | Tyr PNA Qiyue Bio

【LeetCode】1184. 公交站间的距离

API data interface of A-share transaction data

Alibaba sentinel basic operation

Apache atlas version 2.2 installation
![[msp430g2553] graphical development notes (2) system clock and low power consumption mode](/img/4e/c08288c3804d3f1bcd5ff2826f7546.png)
[msp430g2553] graphical development notes (2) system clock and low power consumption mode

What should Ali pay attention to during the interview? Personal account of Alibaba interns who passed five rounds of interviews

What does software testing need to learn?
随机推荐
The U.S. economy continues to be weak, and Microsoft has frozen recruitment: the cloud business and security software departments have become the hardest hit
Near infrared dye cy7.5 labeling PNA polypeptide experimental steps cy7.5-pna|188re labeling anti gene peptide nucleic acid (agpna)
Leetcode 560 and the subarray of K (with negative numbers, one-time traversal prefix and), leetcode 438 find all alphabetic ectopic words in the string (optimized sliding window), leetcode 141 circula
Azide labeled PNA peptide nucleic acid | methylene blue labeled PNA peptide nucleic acid | tyrosine modified PNA | Tyr PNA Qiyue Bio
Choose the appropriate container runtime for kubernetes
Mass modify attribute values in objects in JS
Are network security and data security indistinguishable? Why is data security important?
Synthesis route of ALA PNA alanine modified PNA peptide nucleic acid | AC ala PNA
A new UI testing method: visual perception test
Install MySQL 5.7.37 on windows10
Login Huawei device in SSH mode
英文翻译中文常见脏话
Rhodamine B labeled PNA | rhodamine b-pna | biotin modified PNA | biotin modified PNA | specification information
Substr and substring function usage in SQL
C# 窗体应用TreeView控件使用
Working principle of envy of istio I
Teach you five ways to crack the computer boot password
How to learn automated testing
Actual measurement of Qunhui 71000 Gigabit Network
"Hualiu is the top stream"? Share your idea of yyds