当前位置：网站首页>Lecture 3 of Data Engineering Series: characteristic engineering of data centric AI

Lecture 3 of Data Engineering Series: characteristic engineering of data centric AI

2022-06-22 00:35:00 【Amazon cloud developer】

Preface

stay Data-centric AI Feature Engineering Lecture 2 , We introduced The three sub steps of feature preprocessing are sample category imbalance processing , Continuous feature discretization and numerical type category Feature code . Today we will continue to introduce Feature preprocessing and other steps of feature engineering .

Feature scaling of feature preprocessing

When the range of values of different features of the sample has different orders of magnitude , The difference of orders of magnitude will lead to the dominance of features with larger orders of magnitude , and Feature scaling It is a method used to unify the range of amplitude variation of characteristic values in a data set , It is mainly aimed at continuous features .

Let's look at the function of feature scaling （ Suppose there is an order of magnitude difference between the amplitude changes of the two features ）, Here's the picture ：

The above figure on the left shows the objective function of a model with two features and without feature scaling 2 Contour lines of model parameters , The upper right figure shows the feature scaling . It can be seen that the optimization process of the optimal solution in the upper right figure will obviously become smooth , It's easier to converge to the optimal solution correctly ; That is to say, after the feature scaling , Gradient descent is not easy to vibrate , Can converge faster . The disadvantage of feature scaling is , Easily distort features , Some important information about the training model may be lost .

As mentioned above, feature scaling has drawbacks , So don't scale the feature first （ Simplicity is beauty , This can basically be regarded as the general rule of feature engineering , Because any feature processing step may bring in some noise ）, Try again when you find that the model is not effective .

If the amplitude ranges of all features of the sample have the same magnitude , Don't do feature scaling first . Discuss according to whether it is necessary to reduce the dimension of features , If you choose to use PCA（ Principal component analysis ） Do feature dimensionality reduction , In the current step, only centralization is needed ( That is, the eigenvalue subtracts the mean value as the new eigenvalue , Therefore, the mean value of the new feature is 0), No other feature scaling , If you choose LDA（ Linear discriminant analysis ） Do feature dimensionality reduction , In the current step, neither centralization nor other feature scaling is required .

Then if you select for data size / Amplitude insensitive based on tree Model of , Ignore this step first , Finally, the original features or the new features obtained after feature dimensionality reduction are scaled . Be careful , The deep learning model is sensitive to the scale of eigenvalues （ Because it uses gradient descent method to learn parameters ）, Therefore, when using the deep learning model for modeling , Feature scaling is often required .

Common methods of feature scaling ：

Method	Introduce	The formula
Z-score Standardization	Change the characteristic into the mean value 0, The variance of 1 New features
logarithm log Transformation	It is equivalent to that the variation range of the compressed eigenvalue becomes smaller （ This is a method used in many practical projects ）; The opposite is , In some cases, the amplitude change of a continuous feature is very small , We can use the exponential transformation to amplify the amplitude change slightly .	log(x + delta) x Ensure that it is greater than or equal to 0;delta It is used for smoothing , prevent x by 0 When log0 Become negative infinity ,delta It is usually set to a very small number , such as 10 Of -7 Power .
Interval zoom （ Also called normalization ）	Use the two maxima （ That is, the maximum value and the minimum value ） Zoom to [0,1] Section
Quantile normalization	If the training data contains abnormal data and needs to be normalized , Consider more robust to abnormal data, such as SKLearn Medium RobustScaler To process

Learn more about RobustScaler Information :

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

In practice , We often encounter the problem of choosing standardization method or normalization method . If the feature is stable , There is no extreme maximum or minimum , Use normalization . If the feature has outliers or more noise , use Standardization （ It indirectly reduces the influence of outliers and noise through centralization ） or Quantile based normalization , The best way is on-line AB test See the corresponding effect .

Feature generation

The purpose of feature generation is to inject human prior knowledge into the model by means of features . Feature generation can run through the whole ML In the life cycle of a project ： At first , From various logs （ For example, the user clicks on the log ） And database tables （ Such as user attribute table and item attribute table ） Extract and process the original features . Then it may be necessary to do data exploration or analysis on the original features, and then generate some new features to replace or add to the original features .

In data exploration / The characteristics obtained after analysis may still be very few , At this time, you can consider adding some new features . If under fitting is found after model training , At this time, you can consider adding some new features （ In this paper, feature generation is put after feature preprocessing , But there is no sequential relationship between the two ）.

We should give priority to direct observation or collection / Extracted features , Instead of generating features . If the model performance is not good , Consider feature generation . For clustering tasks , Generate as few features as possible , Because clustering has no clear goal , And the generated features are purposeful , So there is a conflict between the two . For classification and regression tasks , Relatively speaking, it can create more features , The goal of the classification and regression task is clear , Therefore, human understanding and knowledge of tasks can be modeled as features , Help to better achieve the goal and task .

Common methods of feature generation ：

Method	Introduce	Use time
Statistical characteristics within the time window	Characteristics of historical statistics ： For example, the number of clicks per week , The item is recent 1 Days of hits , The user's accumulated consumption for one month . Statistical aggregation characteristics ： For example, the click through rate of a certain commodity calculated by days in a week min,max,average. Characteristics of grouping statistics ： Group statistical median , For example, the median salary of employees in the company ; Group statistical arithmetic mean , For example, the average amount of money a customer buys each time ; Group statistical mode in a certain period of time , For example, the mode of a certain type of customer's purchase of goods in a week .	Original feature extraction
New features are generated by adding, subtracting, multiplying and dividing the semantics between features	Feature subtraction ： For example, according to the difference between the construction time and the purchase time of the house, we can get the age of the house at the time of purchase ; For example, game users logout And login The time difference is the length of a single user's stay . Feature addition ： For example, the total sales volume of the same category can be obtained by adding the sales volume of the same category . Feature multiplication ： For example, a new characteristic single return can be obtained by multiplying the selling price and the transaction rate . Characteristic division ： For example, you can divide the number of clicks by the number of page exposures to get the click through rate （ Click through rate and click through times can be regarded as characteristics , It's a better way to react item Popularity of ） Be careful ： After creating new features through the semantics of features , Be careful to remove possible redundant features , Leave the most useful features .	Original feature extraction
Combine or modify existing features in a human understandable way to get new features	For example, based on daytime The time feature of the timestamp becomes morning , In the morning , Afternoon , It's like this at night category Feature and replace the original time feature （ Replace the original feature ）; Or into a day 24 Hours of barrel discrete characteristics （ Replace the original feature ）. Such as longitude and latitude 2 Three characteristics are transformed into countries , province , Three characteristics of the city are added （ Add new features ） Or replace the longitude and latitude features （ Replace the original feature ）. For example, convert the user's last login timestamp to the time interval in days from the current time point （ Add new features ）	Data exploration / When analyzing
Merge category The characteristics of the Sparse category	if category Some categories of features have only a small number of samples , Consider consolidating these categories into one big category “Other”. There is an exception such as dating app application , yes , we have “ City ” such as “ Hefei ” Few samples , If you combine these small sample cities into one other Words , It goes against the fact that dating in the same city is more likely , At this time, you may not want to merge .	Data exploration / When analyzing
According to the understanding of the business , The prior knowledge is added to the data set as a new feature	For example, some clothes will sell differently according to different seasons , It needs to add the seasonal characteristics . For example, the consumption level of users may be helpful for the task of predicting the loss of users , Add the characteristics of user consumption level .	Data exploration / When analyzing
New features are constructed by machine learning algorithm or statistical analysis method	Such as clustering algorithm and topic model to extract hidden features ; Such as through GDBT The algorithm constructs new features by indexing the leaf nodes of the output of each tree ; For example, using statistical analysis methods or machine learning methods to create portrait features for users and objects respectively .	When the number of features is insufficient or the model is under fitted
cross features	SKLearn Provided in the PolynomialFeatures To establish polynomial features including cross features ; There are some Deep Learning Models can be used to automatically generate cross features such as Deep & Cross network	When the number of features is insufficient or the model is under fitted
Use tools that automatically generate features such as tsfresh, featuretools etc.	The problem with tools that automatically generate features is that they can lead to “ Feature explosion ”（ In the actual project, few people use this tool to automatically generate features ）	When the number of features is insufficient or the model is under fitted

PolynomialFeatures :

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Deep & Cross network:

https://arxiv.org/abs/1708.05123

tsfresh:

https://tsfresh.readthedocs.io/en/latest/

featuretools :

https://www.featuretools.com/

feature selection

Feature selection is the process of selecting possible important feature subsets from a given feature set .

First of all, we may encounter dimension disaster in real tasks , This is often due to high cardinality category The characteristics are One-hot The vector , It may also be a dimension disaster caused by feature crossover , At this time, it is necessary to consider whether such two feature intersections are reasonable and to consider using the deep learning model to do feature intersection .

If you can select important features from them , So that the follow-up learning process only needs to be carried out on some features , Then the problem of dimensional disaster will be greatly reduced . In this sense , Feature selection and feature dimensionality reduction have similar motivations . In fact, they are two mainstream technologies for traditional machine learning to process high-dimensional data .

For the moment , In practice ML In the project , In addition to considering the variance selection method of features in the data exploration stage , The correlation between two characteristics , The correlation between features and target variables to make possible feature selection , Other methods of feature selection and feature dimensionality reduction are rarely used , It may be that the original features are relatively few in actual projects , High cardinal category Features we have a better way to deal with （ For example, use LightGBM Or deep learning model ）.

We are here to characterize the integrity of engineering methodology , It is still necessary to briefly introduce feature selection and feature dimensionality reduction , But we need to know that they are not the focus in the actual project .

secondly , Removing irrelevant features will often reduce the difficulty of learning tasks .

Feature selection may reduce the predictive power of the model （ Because the eliminated features may contain effective information , Discarding this part of information may reduce the prediction accuracy to some extent ）, This is a tradeoff between computational complexity and predictive power .

Use the process of feature selection ：

Feature selection method （ Except for variance selection , The other methods mentioned here are for label Of the data ）：

Method

Introduce

Filter

method

（ Filtration method ）

Rate each feature by divergence or correlation , Set a threshold to select features .

Calculate the statistics of the feature itself , For example, using variance selection method （ It is only suitable for continuous features ）;

Calculate the correlation between a single feature and the target variable , For example, use Pearson Correlation coefficient method , Chi square test or mutual information method .

Wrapper

method

（ Wrapping method ）

Using machine learning models , So as to carry out multiple rounds of training , Select the optimal feature subset .

For example, using recursive feature elimination algorithm RFE, It uses a machine learning model to perform multiple rounds of training , After each round of training , Eliminate some unimportant features , Next training based on the new feature set ;

such as LVM (Las Vegas Wrapper) Algorithm , It uses a random strategy to perform subset search , The error of the final learner is used as the evaluation criterion of the feature subset .

Embedded

method

（ Embedding method ）

First use a machine learning model for training , Select features according to their importance or weight after training .

For example, based on L1 Regular feature selection （ because L1 Regularization tends to change the weight to 0）;

For example, based on tree Feature selection of algorithm .

Filter Methods and Wrapper The difference is Filter Method does not consider subsequent learners , and Wrapper The method directly takes the performance of the learner to be used as the evaluation criterion of the feature subset , The purpose is to select the most suitable and customized feature subset for a given learner .

Wrapper The advantage of the method is that it is directly optimized for specific learners , In terms of the final learner performance, it is better than filter Better way , The disadvantage is that the learner needs to be trained many times , Therefore, the computational overhead is usually lower than filter The method is much larger .

Besides , stay Wrapper In the method , The machine learning model used for feature selection is the same as the machine learning model specially used for training , And in the embedded In the method , The machine learning model used for feature selection can be different from the machine learning model specially used for training .

Feature dimension reduction

At this step , If the feature dimension is too large , It will lead to a large amount of calculation , Long training time , The capacity of the model is too large , The model is easy to over fit . At this time, we should first consider whether such a high feature dimension is reasonable , Is it because one-hot The result of a vector , Can we do embedding, Then consider reducing the feature dimension .

Feature dimensionality reduction is hypothetical , That is, although the data samples observed or collected by people are high-dimensional , But what is closely related to the learning task may only be a low dimensional distribution —— That is, a low dimension in a high-dimensional space “ The embedded ”. At present, the mainstream feature dimensionality reduction methods are all aimed at continuous features .

About the effect of dimensionality reduction , There is no good direct evaluation index , It is usually to compare the performance of the learners before and after dimension reduction , If the performance of the model is improved （ Especially the online effect of the model ）, It is considered that the dimensionality reduction may play a role . You can also reduce the dimension to two or three dimensions , Then we can judge the effect of dimensionality reduction through visualization technology .

About how to select the dimensionality reduction , Choose this dimension as a super parameter , Use... According to different dimensions PCA Look at the performance of the learner after dimensionality reduction （ Especially online effects ）, The one with the best performance corresponds to PCA The dimension of may be the dimension that should be reduced , Or set it according to the percentage of original information retained .

The common dimensionality reduction methods are divided as follows ：

Generally, non generative method is selected to reduce dimension , When a new sample needs to be generated , The generative approach is the better or only option .PCA It is a dimension reduction algorithm discussed more , Let's briefly introduce PCA and LDA Two algorithms .

PCA Principle ：

Put the original n Use fewer features m Features replace , The new feature is a linear combination of the old features , Try to make new m The characteristics are independent of each other , The goal is to maximize the variance after projection （ That is, the sample is most divergent in this projection direction ）. From another point of view , The goal is to minimize reconstruction errors , That is, the distance from the transformed low dimensional space to the original space is the smallest .

PCA There are two hypotheses , One is to assume that different features may contain redundant information , By linearly combining the original features , So as to remove some redundant or unimportant features , Retain important features ; The second is to assume that the reconstruction error conforms to the Gaussian distribution （PCA It is not assumed that the data itself conforms to the Gaussian distribution , For the sake of calculation , probability PCA It is assumed that the data itself and the implicit variables themselves are Gaussian distribution ）.

PCA The projection matrix will be generated after fitting the training set , And save the average value of each feature of the training set , When it comes to new data （ Include validation set , Test sets and future data ） When you do dimension reduction , Centralize using the average value of each saved feature , Then, the reduced dimension result is generated by the operation with the projection matrix .

LDA Principle ：

Will have label Data points of , By means of projection , Projected into a lower dimensional space , Make the projected point , A cluster by cluster situation will be formed , Points of the same category will be closer to... In the projected space .

The goal is to make the points within the category as close as possible （ focus ）, The farther the points between categories, the better .LDA The dimension reduction does not assume that the sample data conforms to the Gaussian distribution .

PCA and LDA The similarity is that both of them can reduce the dimension of data , Both use the idea of matrix eigen decomposition , Both are only suitable for linear scenarios , Both of them are not suitable for dimensionality reduction of non Gaussian samples （ Although the two algorithms do not assume that the data itself conforms to the Gaussian distribution when reducing the dimension ）.

PCA and LDA The difference between ：

Difference	PCA	LDA
Start thinking Different	PCA From the perspective of covariance of characteristics , Select the direction where the sample point projection has the maximum variance ;	LDA Is to think more about classification label information , Choose the most open direction to divide the categories .
learning model Different	PCA Belong to Unsupervised learning .	LDA It's a kind of supervised learning , In addition to dimensionality reduction itself , It can also be classified （LDA It is mainly used for dimensionality reduction ）.
Available after dimensionality reduction Number of dimensions Different	PCA At most N Dimension available （N Represents the feature dimension before dimension reduction ）	LDA Up to... Can be generated after dimensionality reduction C-1 Dimensional subspace （C Indicates the classification quantity of the classification task ）

summary

Algorithm engineers are in the process of machine learning project development , In addition to focusing on the model itself , More time should be spent on Feature Engineering and Sample Engineering On , When you find that the online effect is not good , First time to check samples and characteristics , That is, first adjust the samples and characteristics , Then adjust the model , Before changing to a more complex model , We must first consider whether the feature engineering has been done in place .

Data-centric AI This is the end of the introduction to the characteristic engineering of , We used three lectures to introduce the concepts related to feature engineering , Processing steps of features and relevant practical knowledge .

Come here , I believe you have felt the importance of feature engineering more . The opposite of feature engineering is sample engineering , We will introduce Data-centric AI Sample project . Thank you for your patience in reading .

Author of this article

Liang Yuhui

Amazon cloud technology

Machine learning product technologist

Responsible for consulting and designing machine learning solutions based on Amazon cloud technology , Focus on the promotion and application of machine learning , Deeply participated in the construction and optimization of machine learning projects of many real customers . For deep learning model, distributed training , Rich experience in recommendation system and computing advertising .