当前位置：网站首页>[pytorch basic tutorial 29] DIN model

[pytorch basic tutorial 29] DIN model

2022-06-21 15:37:00 【Mountain peak evening view】

Learning summary

Most of the loss functions in the sorting part of the recommendation system are the cross entropy loss functions of two classes , But many of the recalled models are not . Recall model is also common sampled softmax Loss function ;
Model training , stay seed Set the fixed model loss There's a lot of volatility , Maybe there are too few early stops , It could be batch_size The relatively small , Result in data imbalance , Or learning rate learning rate Too big .
DIN Used a local activation unit structure , Use the correlation between candidate commodities and historical problem commodities to calculate the weight , This represents the prediction of current commodity advertising , The importance of each product of the user's historical behavior .
- stay rechub In the project , This activation unit is MLP,attention The essence is weighted average ,MLP yes [email protected], among W Is the weighted weight , And in MLP More than that softmax Conditions （ Let the sum of weights be 1） Namely attention 了 .
- attention There are many forms , such as transformer（ Dot product form ）、DIN（MLP form ）, As long as you finally get an attention coefficient , Can , Through the back propagation mechanism , Always calculate the appropriate weight .

List of articles

Learning summary
One 、 Data characteristics represent
- 1.1 Characteristic means
Two 、 Deep interest network DIN（add attention ）
3、 ... and 、 Code section
Four 、 A few questions
Reference

One 、 Data characteristics represent

1.1 Characteristic means

Industrial CTR Forecast data sets are generally multi-group categorial form In the form of , That is, category type features are the most common , This data set is usually long like this ：

Insert picture description here

The highlight here is the framed feature , This contains a wealth of user interest information .

For feature coding , The author gives an example here ：[weekday=Friday, gender=Female, visited_cate_ids={Bag,Book}, ad_cate_id=Book], In this case, we know that it is generally through one-hot Code it in the form of , Into the form of a binary characteristic of a coefficient .
But here we will find a visted_cate_ids, That is, the user's historical product list , For a user , This value is a multivalued characteristic , But also know that the length of this feature is not the same , That is, the number of historical commodities purchased by users is not the same , This is obviously . With this feature , We usually use multi-hot code , That is, there may be more than 1 individual 1 了 , What commodity is there , The corresponding position is 1, So the encoded data is as follows , Into the model ：

Insert picture description here

There are no interactive combinations in the above features , That is, there is no feature crossover . This interactive information is given to the later neural network to learn .

DIN The input characteristics of the model can be roughly divided into three categories ： Dense( Continuous type ), Sparse( discrete ), VarlenSparse( Variable length discrete type ), That is, the above historical behavior data . And different type characteristics also determine that the later processing methods will be different ：

Dense Type features ： Because it is numerical , For each of these features, a Input The layer receives this input , Then put them together , After the discrete side is handled , And discrete splicing together DNN
Sparse Type features , Establish for discrete features Input Layer receives input , Then you need to go through embedding Layers are transformed into low dimensional dense vectors , Then put them together , After the variable length discrete side is handled , Put them together DNN, But there is one characteristic that should be paid attention to embedding The vector has to be used , Is a candidate product embedding vector , This has to be related to the following calculation , Weighting historical behavior sequences .
VarlenSparse Type features ： This generally refers to the historical behavior characteristics of users , Variable length data , First of all padding Operate in equal length , Then build Input Layer receives input , And then through embedding Layers get their own historical behavior embedding vector , Take these vectors and the candidate products above embedding Vector entry AttentionPoolingLayer To weight and merge these historical behavior characteristics , Finally, we get the output .

stay torch rechub In the project is create_seq_features Process the corresponding historical sequence .

Two 、 Deep interest network DIN（add attention ）

DIN The application scenario of the model is Alibaba's most typical e-commerce advertising recommendation , There is a lot of historical user behavior information （ History of purchased goods or category information ）. For goods paid for advertising , Ali will predict the click through rate according to the model , Recommend the right advertising products to the right users , therefore DIN The model is essentially a hit rate prediction model .

The following diagram 1 Namely DIN The basic model of Base Model. We can see ,Base Model It's a typical Embedding MLP Structure . Its input features are user attributes （User Proflie Features）、 User behavior characteristics （User Behaviors）、 Candidate advertising features （Candidate Ad） And scene features （Context Features）.
Insert picture description here

chart 1 Ali Base Architecture of the model ( From the paper Deep Interest Network for Click-Through Rate Prediction)

2.1 User behavior characteristics and Candidate advertising features

User attribute features and scene features have been mentioned before , Here, note the user behavior characteristics and candidate advertisement characteristics in the color part of the above figure ：
（1） User behavior characteristics are composed of a series of goods purchased by users , That's the picture Goods 1 To Goods N, Each commodity contains three sub characteristics , That is, the three color dots in the figure , among Red stands for commodity ID, Blue is the shop ID, Pink is the product category ID.
（2） Candidate advertising features also include these three ID Sub characteristics of type , Because the candidate advertisement here is also a product on Alibaba platform .

In deep learning , Generally, if you encounter ID Type features , We'll build it Embedding, And then put Embedding Connect with other features , Enter the following MLP.

Ali's Base Model That's what it did , It puts three ID Converted to the corresponding Embedding, Then put these Embedding Connected to form the current commodity Embedding.

2.2 Accumulate each user behavior sequence

Because the user's behavior sequence is actually a sequence of goods , This sequence can be long or short , But the dimension of the input vector of the neural network must be fixed , Then how should we put this group of goods Embedding Process into a fixed length Embedding Well ？ Pictured 1 Medium SUM Pooling Layer structure , Namely Directly put the... Of these goods Embedding Stack up （ Vector accumulation ）, Then put the superimposed Embedding Input the connection result with all other features MLP.

【SUM Pooling Deficiency 】
SUM Pooling Of Embedding The superposition operation actually treats all historical behaviors equally , Add up without any emphasis , This is actually not in line with our shopping habits .

For example , The product corresponding to the candidate advertisement is “ keyboard ”, meanwhile , There are several items in the user's historical behavior sequence ID, Namely “ mouse ”“T T-shirt ” and “ cleanser ”. Starting from our common sense of shopping ,“ mouse ” This historical commodity ID To forecast “ keyboard ” The importance of advertising click through rate should be much greater than the latter two . From the perspective of attention mechanism , When we bought the keyboard , Will focus more on buying “ mouse ” The history of such related goods , Because these buying experiences are more conducive for us to make better decisions .

【 Each module of the baseline model 】

Embedding layer： Transform high dimensional sparse input into low dimensional dense vector , Each discrete feature will correspond to a embedding The dictionary , Dimension is $D\times K$ , there $D$ Represents the dimension of an implicit vector , and $K$ Represents the number of unique values of the current discrete feature , Here for the sake of understanding , Here is an example to illustrate , Like the one above weekday features ：

Suppose a user's weekday The characteristic is Friday , turn one-hot When coding , Namely [0,0,0,0,1,0,0] Express , Here, if we assume that the hidden vector dimension is D, So this feature corresponds to embedding A dictionary is a $D\times7$ A matrix of ( Each column represents a embedding,7 Column just 7 individual embedding vector , Corresponds to Monday to Sunday ), Then the user one-hot The vector passes through embedding After the layer, you will get a $D\times1$ Vector , That is, the one corresponding to Friday embedding, How to calculate the , In fact, that is $[0,0,0,0,1,0,0]^T$ .

In fact, it means to put embedding Matrix one-hot The vector is 1 In that position embedding Take out the vector . In this way, we get the dense vector of sparse features . The same is true for other discrete features , It's just the one above multi-hot The coded one , You'll get one embedding List of vectors , Because the one he started multi-hot More than one vector is 1, Multiply this by embedding matrix , You'll get a list . Through this layer , The above input features can get the corresponding density embedding Vector .

pooling layer and Concat layer：
- pooling The role of the layer is to integrate the user's historical behavior embedding This eventually becomes a fixed length vector , Because the number of products purchased by each user in history is different , That is, every user multi-hot in 1 The number of is inconsistent , Go through like this embedding layer , Get the user history behavior embedding There are different numbers of , That's the top embedding list $t_i$ It's not the same length , In that case , The historical behavior characteristics of each user are not the same long . And if you add a fully connected network later , We know , He needs fixed length feature input . So we often use one pooling layer First, the user's historical behavior embedding Become a fixed length ( Uniform length ), So we have this formula ：
  $e_i=pooling(e_{i1}, e_{i2}, ...e_{ik})$
  there $e_{ij}$ It is the historical behavior of users embedding. $e_i$ It becomes a vector of fixed length , there $i$ It means the first one $i$ Historical feature groups ( It is a historical act , For example, historical commodities id, Historical commodity categories id etc. ), there $k$ Indicates the quantity of goods purchased by users in the corresponding historical special group , That is, history embedding The number of , Look at the picture above user behaviors series , That's the process .
- Concat layer The function of the layer is to splice , Is to put all these features embedding vector , If there are continuous features, it also counts , Splice and integrate from the feature dimension , As MLP The input of .
MLP： Normal full connection , Various interactions between learning features are used .

CTR In the second category task , Negative for the general loss function log Log likelihood ：
$L=-\frac{1}{N} \sum_{(\boldsymbol{x}, y) \in \mathcal{S}}(y \log p(\boldsymbol{x})+(1-y) \log (1-p(\boldsymbol{x})))$

base Improvement points of the model ：

Put it all together , It is impossible to see which product in the user's historical behavior is more relevant to the current product , That is, the importance of each commodity in the historical behavior to the current prediction is lost .
The last point is if all the historical behavior products that users have browsed , Finally, they all passed embedding and pooling Converted to a fixed length embedding, This will limit the diverse interests of model learning users .

Specific improvement ideas ：

enlarge embedding Dimensions , Increase the expression ability of previous products , In this way, even if taken together ,embedding The ability to express will also be enhanced , Can contain the user's interest information , But the amount of computation in large-scale real recommended scenarios is huge , Not an option .
namely DIN, A mechanism to introduce attention between the current candidate advertisement and the user's historical behavior , In this way, when predicting whether the current advertisement is clicked , Let the model pay more attention to those user history products related to the current advertisement , That is to say, the historical behavior more related to the current product can promote the click behavior of users .

2.3 The application of attention mechanism ——DIN

（1） Improvements

So Ali is base model On the basis of , The attention mechanism is applied to the processing of users' historical behavior sequence .

The specific operation is shown in the figure below ：DIN An activation unit is added to each user's historical purchase （Activation Unit）—— This activation unit generates a weight , This weight is the user's attention score for this historical commodity , The weight corresponds to the user's attention .
Insert picture description here

chart 3 Ali DIN Architecture of the model ( From the paper Deep Interest Network for Click-Through Rate Prediction)

And the previous base Model comparison ：
Insert picture description here

chart 1 Ali Base Architecture of the model ( From the paper Deep Interest Network for Click-Through Rate Prediction)

（2） Activation unit （local activation unit）

You can see the picture above 3 The detailed structure of the activation unit on the right of ：
input： The current historical behavior commodity Embedding, And the of candidate advertising products Embedding.
practice ： Enter these two Embedding, With their Exoproduct The results are connected to form a vector （ The vector direction is the normal vector direction of the plane composed of the two vectors ）, Then input to the activation unit MLP layer , Eventually an attention weight will be generated .

（1） The activation unit is equivalent to a small deep learning model , It takes advantage of the... Of two commodities Embedding, Attention weights representing their degree of relevance are generated .
（2）Sparrow Code inside . There is no strict use of outer product . It uses element-wise sub & multipy. Then use these two vectors to splice , Composed of activation_all.
Wang Zhe's practical experience ： The function of external product is not very big , And greatly increase the number of parameters .

Insert picture description here

local activation unit It can give the historical behavior characteristics of users according to the correlation between the historical behavior characteristics of users and the current advertising embedding Weighted ： Inside is a feedforward neural network , Input is the user's historical behavior product and current candidate product , The output is the correlation between them , This correlation is equivalent to the weight of each historical commodity , Compare this weight with the original historical behavior embedding Multiplication and summation get the user's interest $\boldsymbol{v}_{U}(A)$ , Its formula ：
$\boldsymbol{v}_{U}(A)=f\left(\boldsymbol{v}_{A}, \boldsymbol{e}_{1}, \boldsymbol{e}_{2}, \ldots, \boldsymbol{e}_{H}\right)=\sum_{j=1}^{H} a\left(\boldsymbol{e}_{j}, \boldsymbol{v}_{A}\right) \boldsymbol{e}_{j}=\sum_{j=1}^{H} \boldsymbol{w}_{j} \boldsymbol{e}_{j}$
The specific symbol explanation of the above formula ：

$\left\{\boldsymbol{v}_{A}, \boldsymbol{e}_{1}, \boldsymbol{e}_{2}, \ldots, \boldsymbol{e}_{H}\right\}$ Is the user $U$ The characteristics of historical behavior of embedding;
$v_{A}$ It means candidate advertisement $A$ Of embedding vector
$a\left(e_{j}, v_{A}\right)=w_{j}$ Indicates the weight or historical behavior of the commodity and the current advertisement $A$ The degree of relevance .
$a(\cdot)$ The feedforward neural network above , That's the so-called attention mechanism
Input in addition to historical behavior vector and candidate advertising vector , There is also an outer product operation between them , The author said that this is the explicit knowledge that is conducive to model correlation modeling .

RecHub Medium ActivationUnit Code ：

class ActivationUnit(torch.nn.Module):
    def __init__(self, emb_dim, dims=[36], activation="dice", use_softmax=False):
        super(ActivationUnit, self).__init__()
        self.emb_dim = emb_dim
        self.use_softmax = use_softmax
        # Dice(36)
        self.attention = MLP(4 * self.emb_dim, dims=dims, activation=activation)

    def forward(self, history, target):
        seq_length = history.size(1)
        target = target.unsqueeze(1).expand(-1, seq_length, -1)
        # Concat
        att_input = torch.cat([target, history, target - history, target * history], dim=-1)  
        # Dice(36)
        att_weight = self.attention(att_input.view(-1, 4 * self.emb_dim))  
        # Linear(1)
        att_weight = att_weight.view(-1, seq_length)
        if self.use_softmax:
            att_weight = att_weight.softmax(dim=-1)
        # (batch_size,emb_dim)
        output = (att_weight.unsqueeze(-1) * history).sum(dim=1)
        return output

It can be seen in self.attention The assignment here is done with MLP：

class MLP(nn.Module):
    """Multi Layer Perceptron Module, it is the most widely used module for learning feature. Note we default add `BatchNorm1d` and `Activation` `Dropout` for each `Linear` Module. Args: input dim (int): input size of the first Linear Layer. output_layer (bool): whether this MLP module is the output layer. If `True`, then append one Linear(*,1) module. dims (list): output size of Linear Layer (default=[]). dropout (float): probability of an element to be zeroed (default = 0.5). activation (str): the activation function, support `[sigmoid, relu, prelu, dice, softmax]` (default='relu'). Shape: - Input: `(batch_size, input_dim)` - Output: `(batch_size, 1)` or `(batch_size, dims[-1])` """

    def __init__(self, input_dim, output_layer=True, dims=[], dropout=0, activation="relu"):
        super().__init__()
        layers = list()
        for i_dim in dims:
            layers.append(nn.Linear(input_dim, i_dim))
            layers.append(nn.BatchNorm1d(i_dim))
            layers.append(activation_layer(activation))
            layers.append(nn.Dropout(p=dropout))
            input_dim = i_dim
        if output_layer:
            layers.append(nn.Linear(input_dim, 1))
        self.mlp = nn.Sequential(*layers)

    def forward(self, x):
        return self.mlp(x)

3、 ... and 、 Code section

3.1 DIN The model part

import torch
import torch.nn as nn
import numpy as np
from torch.nn.modules.activation import Sigmoid

class DIN(nn.Module):
    def __init__(self, candidate_movie_num, recent_rate_num, user_profile_num, context_feature_num, candidate_movie_dict, 
            recent_rate_dict, user_profile_dict, context_feature_dict, history_num, embed_dim, activation_dim, hidden_dim=[128, 64]):
        super().__init__()
        self.candidate_vocab_list = list(candidate_movie_dict.values())
        self.recent_rate_list = list(recent_rate_dict.values())
        self.user_profile_list = list(user_profile_dict.values())
        self.context_feature_list = list(context_feature_dict.values())
        self.embed_dim = embed_dim
        self.history_num = history_num
        # candidate_embedding_layer 
        self.candidate_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.candidate_vocab_list])
        # recent_rate_embedding_layer
        self.recent_rate_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.recent_rate_list])
        # user_profile_embedding_layer
        self.user_profile_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.user_profile_list])
        # context_embedding_list
        self.context_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.context_feature_list])

        # activation_unit
        self.activation_unit = nn.Sequential(nn.Linear(4*embed_dim, activation_dim), 
                                            nn.PReLU(),
                                            nn.Linear(activation_dim, 1),
                                            nn.Sigmoid())
        
        # self.dnn_part
        self.dnn_input_dim = len(self.candidate_embedding_list) * embed_dim + candidate_movie_num - len(
            self.candidate_embedding_list) + embed_dim + len(self.user_profile_embedding_list) * embed_dim + \
            user_profile_num - len(self.user_profile_embedding_list) + len(self.context_embedding_list) * embed_dim \
            + context_feature_num - len(self.context_embedding_list)

        self.dnn = nn.Sequential(nn.Linear(self.dnn_input_dim, hidden_dim[0]),
                             nn.BatchNorm1d(hidden_dim[0]),
                             nn.PReLU(),
                             nn.Linear(hidden_dim[0], hidden_dim[1]),
                             nn.BatchNorm1d(hidden_dim[1]),
                             nn.PReLU(),
                             nn.Linear(hidden_dim[1], 1),
                             nn.Sigmoid())

    def forward(self, candidate_features, recent_features, user_features, context_features):
        bs = candidate_features.shape[0]
        # candidate cate_feat embed
        candidate_embed_features = []
        for i, embed_layer in enumerate(self.candidate_embedding_list):
            candidate_embed_features.append(embed_layer(candidate_features[:, i].long()))
        candidate_embed_features = torch.stack(candidate_embed_features, dim=1).reshape(bs, -1).unsqueeze(1)
        ## add candidate continous feat
        candidate_continous_features = candidate_features[:, len(candidate_features):]
        candidate_branch_features = torch.cat([candidate_continous_features.unsqueeze(1), candidate_embed_features], dim=2).repeat(1, self.history_num, 1)

        # recent_rate cate_feat embed
        recent_embed_features = []
        for i, embed_layer in enumerate(self.recent_rate_embedding_list):
            recent_embed_features.append(embed_layer(recent_features[:, i].long()))
        recent_branch_features = torch.stack(recent_embed_features, dim=1)
        
        # user_profile feat embed 
        user_profile_embed_features = []
        for i, embed_layer in enumerate(self.user_profile_embedding_list):
            user_profile_embed_features.append(embed_layer(user_features[:, i].long()))
        user_profile_embed_features = torch.cat(user_profile_embed_features, dim=1)
        ## add user_profile continous feat
        user_profile_continous_features = user_features[:, len(self.user_profile_list):]
        user_profile_branch_features = torch.cat([user_profile_embed_features, user_profile_continous_features], dim=1)

        # context embed feat
        context_embed_features = []
        for i, embed_layer in enumerate(self.context_embedding_list):
            context_embed_features.append(embed_layer(context_features[:, i].long()))
        context_embed_features = torch.cat(context_embed_features, dim=1)
        ## add context continous feat
        context_continous_features = context_features[:, len(self.context_embedding_list):]
        context_branch_features = torch.cat([context_embed_features, context_continous_features], dim=1)

        # activation_unit
        sub_unit_input = recent_branch_features - candidate_branch_features
        product_unit_input = torch.mul(recent_branch_features, candidate_branch_features)
        unit_input = torch.cat([recent_branch_features, candidate_branch_features, sub_unit_input, product_unit_input], dim=2)
        # weight-pool
        activation_unit_out = self.activation_unit(unit_input).repeat(1, 1, self.embed_dim)
        recent_branch_pooled_features = torch.mean(torch.mul(activation_unit_out, recent_branch_features), dim=1)
        # dnn part
        dnn_input = torch.cat([candidate_branch_features[:, 0, :], recent_branch_pooled_features, user_profile_branch_features, context_branch_features], dim=1)
        dnn_out = self.dnn(dnn_input)
        return dnn_out

3.2 torch rechub Use

For example, in a dataset amazon_electronics_sample Run up DIN Model . The original data is json Format , We extract the required information for a preprocessing that contains only user_id, item_id, cate_id, time Four characteristic columns CSV file ：
Insert picture description here

（1） Feature processing part

from torch_rechub.basic.features import DenseFeature, SparseFeature, SequenceFeature

n_users, n_items, n_cates = data["user_id"].max(), data["item_id"].max(), data["cate_id"].max()
#  Here you specify how each column of features is handled , about sparsefeature, Need to enter embedding layer , So you need to specify the size of the feature space and the dimension of the output 
features = [SparseFeature("target_item", vocab_size=n_items + 2, embed_dim=8),
            SparseFeature("target_cate", vocab_size=n_cates + 2, embed_dim=8),
            SparseFeature("user_id", vocab_size=n_users + 2, embed_dim=8)]
target_features = features
#  For sequence features , Except that you need to deal with the exception of category characteristics ,item Sequences and candidates item Should belong to the same space , We want models to share their embedding, So you can go through shared_with Parameter assignment 
history_features = [
    SequenceFeature("history_item", vocab_size=n_items + 2, embed_dim=8, pooling="concat", shared_with="target_item"),
    SequenceFeature("history_cate", vocab_size=n_cates + 2, embed_dim=8, pooling="concat", shared_with="target_cate")
]

（2） The model code

It is necessary to process the basic data set to obtain behavior characteristics hist_behavior;
This historical behavior data is a series of characteristics , The length of historical behavior characteristics of different users is different , So to enter NN Before, we usually follow the longest sequence padding; When performing operations on a specific layer , Will use mask Mask the positions of these fills , To ensure the accuracy of the calculation .

Insert picture description here

class DIN(torch.nn.Module):
    def __init__(self, features, history_features, target_features, mlp_params, attention_mlp_params):
        super().__init__()
        self.features = features
        self.history_features = history_features
        self.target_features = target_features
        #  The number of historical behavior characteristics 
        self.num_history_features = len(history_features)
        #  Calculate all the dim
        self.all_dims = sum([fea.embed_dim for fea in features + history_features + target_features])
        
        #  structure Embeding layer 
        self.embedding = EmbeddingLayer(features + history_features + target_features)
        #  Build attention layer 
        self.attention_layers = nn.ModuleList(
            [ActivationUnit(fea.embed_dim, **attention_mlp_params) for fea in self.history_features])
        self.mlp = MLP(self.all_dims, activation="dice", **mlp_params)

    def forward(self, x):
        embed_x_features = self.embedding(x, self.features)
        embed_x_history = self.embedding(x, self.history_features)
        embed_x_target = self.embedding(x, self.target_features)
        attention_pooling = []
        for i in range(self.num_history_features):
            attention_seq = self.attention_layers[i](embed_x_history[:, i, :, :], embed_x_target[:, i, :])
            attention_pooling.append(attention_seq.unsqueeze(1)) 
        # SUM Pooling
        attention_pooling = torch.cat(attention_pooling, dim=1)
        # Concat & Flatten
        mlp_in = torch.cat([
            attention_pooling.flatten(start_dim=1),
            embed_x_target.flatten(start_dim=1),
            embed_x_features.flatten(start_dim=1)
        ], dim=1)
        
        #  Can be introduced into [80, 200]
        y = self.mlp(mlp_in)
        
        #  This is used in the code sigmoid(1)+BCELoss, Effect and the DIN Model softmax(2)+CELoss similar 
        return torch.sigmoid(y.squeeze(1))

Four 、 A few questions

DIN The model is widely used in industry , You are free to check the data to see how this model is used in specific practice ？
For example, whether the behavior sequence is reasonable , If the time interval is long, should it be divided into several sections ？
For example, it would be better if the attention mechanism could be changed to another way of calculating attention ？( We also know that the way attention works is not just DNN This kind of ), Another example is whether attention weight should be added softmax？