当前位置:网站首页>[summary of Feature Engineering] explain what features are and the steps of feature engineering
[summary of Feature Engineering] explain what features are and the steps of feature engineering
2022-07-24 20:29:00 【Sunny qt01】
- Introduction to feature engineering
Listen to people often , Data and features determine the upper limit of machine learning , Algorithms and models are just constantly approaching this upper limit . thus it can be seen , Feature engineering plays an indispensable role in machine learning .
Look back at the website Kaggle,KDD, Competitions at home and abroad , In fact, the champion of each competition did not use a very sophisticated algorithm , Most of them have done excellent work in feature engineering , Then we can get excellent performance by using some common algorithms .
Feature engineering is a key factor in machine learning .
- The importance and purpose of Feature Engineering
The purpose of Feature Engineering : It is to transform fields into features that can better represent potential problems , And then improve the efficiency of machine learning .
1. The better the characteristics are , The more flexible : Good features perform well in any model , The flexibility of good features is that they allow you to choose uncomplicated models , At the same time, the allowable speed will be faster , Make it easier for you to understand and maintain .
2. The better the characteristics are , The simpler the model is built : Have good characteristics , Even if the parameters of the model are not optimal , The performance of the model can still perform well , So you don't have to spend too much time looking for the optimal parameters , It greatly reduces the complexity of the model , Make the model simpler .
3. The better the characteristics are , The better the performance of the model . The purpose of our search for features is to improve the performance of the model .
How to evaluate feature Engineering
Build machine learning model Baseline Model( The most basic machine model )
Apply one or more feature engineering techniques to raw data
Rebuild machine learning model and Baseline Model Compare
If the increment of efficiency is greater than a certain critical value , It means it is beneficial
Before major feature engineering treatment , We have to
1. Feature understanding : Know what fields are in the dataset
2. Feature improvement : Data preprocessing for fields
- Feature understanding
1. Data is structure ( surface ) Or unstructured data ( Text , voice , video , Audio )
2. Type of field : Numerical type , Category type , Sequential type , Binary type
3. Descriptive data analysis (Exploratory Data Analysis)( In order to let you know the overall situation of the data )
(1) Descriptive statistics (Descriptive Statisties): Number of different values , Number of null values , Distribution of category values , Maximum , minimum value , Average , standard deviation , Outliers and other data quality reports
(2) Data visualization (Data Visualization): Collocation of various charts ( Pie chart , bar chart , Histogram , Scatter plot, etc. ), It can be presented with the target field .
Case study : Microcredit data set
Microfinance data protection 1551 Customer data
Each customer data contains a target field (Target Attribute)
Microfinance data includes 1551 Customer data
Each customer data contains a target field (Target Attribute) and 10 Input fields (Input Attribute)
8 Category fields ,2 Fields are numeric
This project divides customers into two categories
1. There will be microfinance ( Will respond ) The customer
167 Pen data
2. Will not come to microfinance ( No response ) The customer
Yes 1384 Pen data
Field 01:age( Numeric fields ): Age
Field 02:sex( Category field ): Gender
Field 03:region( Category field ): Residential area
Field 05:income( Numeric fields ): Monthly income
Field 06:children( Category field ): The number of children in the family
Field 07:car( Category field ) Is there a car
Field 08:save——act( Category field ): Whether there is a live savings account
Field 09:current_act( Category field ): Whether there is a deposit account
Field 10:mortgage( Category field ): Whether it is a mortgage account

Descriptive statistics : Data quality report
Total table

Different values of gender may be problematic .
Table of numeric fields
The maximum value and the minimum value are compared with the last two , See if there are outliers

Table 3 : Characteristics of classified data

Gender seems to have little to do with the target field

Different residential areas have slightly different loan ratios ( More important than gender )

Married people seem to have greater demand for loans

The relationship between the number of children and whether to loan , No children, no sense of responsibility ,( important )

Whether there is a car seems to have little to do with whether it is a loan

Whether there is a relationship between current account and loan

The age will be lower and lower .

Income also has a downward trend .

Feature improvement :
Improve on the premise of understanding the characteristics
Data cleaning : Wrong value 、 Null value 、 The treatment of outliers , It has been explained before
Data encoding , Data standardization (Data Standardization) And type conversion
Z-score,Min-Max, Code of category and sequential fields , It has been explained before
Generalization of data , Normalization of data ( Let the length of the vector be 1, For text analysis , The third part will explain )
Unstructured data structure ( The third part will specify )
data ( Non text ) Normalization of
L2 The Euclidean distance representing the data therein is equal to 1, That is, the square root of the two is 1
L1
The sum of the two is 1,
- Coverage of Feature Engineering
Feature construction : Construct new features , Explore the relationship between features
Use external data , Data exploration , Expert experience , Data analysis , Feature construction method
feature selection : Select some useful features , For bad characteristics say no
Statistical method , Highly relevant features , Model way ( Random forests , Decision tree ), Recursive feature selection ( Gradual regression and so on )
Feature transformation :( With premise, for example PCA) Using mathematical methods ( Simple addition, subtraction, multiplication and division, principal component factor analysis, etc ), Merge old fields ( It can be a bad feature ) Produce new features , Extract the potential structure hidden in the data
linear (PCA, Matrix decomposition NMF,SVD,TSVD,LDA)
nonlinear (Kernel PCA,tSNE, neural network )
Two linear transformations
Feature learning :( No premise ) Use deep learning , Automatically learn new features .
( Association rules , neural network , Deep learning ) Feature based learning , Word embedding based text feature learning
With AI promote AI
边栏推荐
- [training Day10] linear [mathematics] [thinking]
- Pychart tutorial: 5 very useful tips
- [training Day8] interesting number [digital DP]
- [FreeRTOS] 10 event flag group
- The difference between map and flatmap in stream
- Istio二之流量劫持过程
- The difference between delete, truncate and drop in MySQL
- Flink window & time principle
- Alibaba sentinel basic operation
- Flink Window&Time 原理
猜你喜欢

Markdown to PDF API data interface
![[training Day10] silly [simulation] [greed]](/img/31/94c32e05b498f8ad192f8ec2c500ca.png)
[training Day10] silly [simulation] [greed]

Preview and save pictures using uni app

C# 窗体应用TreeView控件使用
![[training Day8] interesting number [digital DP]](/img/39/caad2ccff916d5ab0f8c3d93f3901d.png)
[training Day8] interesting number [digital DP]
![[training Day8] [luogu_p6335] staza [tarjan]](/img/cf/e2027549c56b8597e7cd579d737392.png)
[training Day8] [luogu_p6335] staza [tarjan]

Make Huawei router into FTP server (realize upload and download function)

How does starknet change the L2 landscape?

Lights of thousands of families in the year of xinchou

Leetcode 206 reverse linked list, 3 longest substring without repeated characters, 912 sorted array (fast row), the kth largest element in 215 array, 53 largest subarray and 152 product largest subarr
随机推荐
Install MySQL 5.7.37 on windows10
[training Day10] linear [mathematics] [thinking]
Unity's ugui text component hard row display (improved)
Open source demo | release of open source example of arcall applet
Safe way -- Analysis of single pipe reverse connection back door
What is IDE (integrated development environment)
Oracle creates table spaces and views table spaces and usage
Machine learning job interview summary: five key points that resume should pay attention to
Methods of using tyrosine modified peptide nucleic acid PNA | Tyr PNA | BZ Tyr PNA | 99Tcm survivinmrna antisense peptide nucleic acid
How to set the allure test report
Istio II traffic hijacking process
Substr and substring function usage in SQL
Make Huawei router into FTP server (realize upload and download function)
When using vscode, the tab indentation changes from 4 spaces to small arrows (solved)
Teach you five ways to crack the computer boot password
Failed to create a concurrent index, leaving an invalid index. How to find it
Actual measurement of Qunhui 71000 Gigabit Network
MySQL docker installation master-slave deployment
C# 窗体应用TreeView控件使用
Oracle 19C datagruad replication standby rman-05535 ora-01275