当前位置:网站首页>[pytorch basic tutorial 29] DIN model
[pytorch basic tutorial 29] DIN model
2022-06-21 15:37:00 【Mountain peak evening view】
Learning summary
- Most of the loss functions in the sorting part of the recommendation system are the cross entropy loss functions of two classes , But many of the recalled models are not . Recall model is also common sampled softmax Loss function ;
- Model training , stay seed Set the fixed model loss There's a lot of volatility , Maybe there are too few early stops , It could be
batch_sizeThe relatively small , Result in data imbalance , Or learning rate learning rate Too big . - DIN Used a local activation unit structure , Use the correlation between candidate commodities and historical problem commodities to calculate the weight , This represents the prediction of current commodity advertising , The importance of each product of the user's historical behavior .
- stay rechub In the project , This activation unit is MLP,attention The essence is weighted average ,MLP yes [email protected], among W Is the weighted weight , And in MLP More than that softmax Conditions ( Let the sum of weights be 1) Namely attention 了 .
- attention There are many forms , such as transformer( Dot product form )、DIN(MLP form ), As long as you finally get an attention coefficient , Can , Through the back propagation mechanism , Always calculate the appropriate weight .
List of articles
One 、 Data characteristics represent
1.1 Characteristic means
Industrial CTR Forecast data sets are generally multi-group categorial form In the form of , That is, category type features are the most common , This data set is usually long like this :

The highlight here is the framed feature , This contains a wealth of user interest information .
- For feature coding , The author gives an example here :
[weekday=Friday, gender=Female, visited_cate_ids={Bag,Book}, ad_cate_id=Book], In this case, we know that it is generally through one-hot Code it in the form of , Into the form of a binary characteristic of a coefficient . - But here we will find a
visted_cate_ids, That is, the user's historical product list , For a user , This value is a multivalued characteristic , But also know that the length of this feature is not the same , That is, the number of historical commodities purchased by users is not the same , This is obviously . With this feature , We usually use multi-hot code , That is, there may be more than 1 individual 1 了 , What commodity is there , The corresponding position is 1, So the encoded data is as follows , Into the model :

There are no interactive combinations in the above features , That is, there is no feature crossover . This interactive information is given to the later neural network to learn .
DIN The input characteristics of the model can be roughly divided into three categories : Dense( Continuous type ), Sparse( discrete ), VarlenSparse( Variable length discrete type ), That is, the above historical behavior data . And different type characteristics also determine that the later processing methods will be different :
- Dense Type features : Because it is numerical , For each of these features, a Input The layer receives this input , Then put them together , After the discrete side is handled , And discrete splicing together DNN
- Sparse Type features , Establish for discrete features Input Layer receives input , Then you need to go through embedding Layers are transformed into low dimensional dense vectors , Then put them together , After the variable length discrete side is handled , Put them together DNN, But there is one characteristic that should be paid attention to embedding The vector has to be used , Is a candidate product embedding vector , This has to be related to the following calculation , Weighting historical behavior sequences .
- VarlenSparse Type features : This generally refers to the historical behavior characteristics of users , Variable length data , First of all padding Operate in equal length , Then build Input Layer receives input , And then through embedding Layers get their own historical behavior embedding vector , Take these vectors and the candidate products above embedding Vector entry AttentionPoolingLayer To weight and merge these historical behavior characteristics , Finally, we get the output .
stay torch rechub In the project is create_seq_features Process the corresponding historical sequence .
Two 、 Deep interest network DIN(add attention )
DIN The application scenario of the model is Alibaba's most typical e-commerce advertising recommendation , There is a lot of historical user behavior information ( History of purchased goods or category information ). For goods paid for advertising , Ali will predict the click through rate according to the model , Recommend the right advertising products to the right users , therefore DIN The model is essentially a hit rate prediction model .
The following diagram 1 Namely DIN The basic model of Base Model. We can see ,Base Model It's a typical Embedding MLP Structure . Its input features are user attributes (User Proflie Features)、 User behavior characteristics (User Behaviors)、 Candidate advertising features (Candidate Ad) And scene features (Context Features).
2.1 User behavior characteristics and Candidate advertising features
User attribute features and scene features have been mentioned before , Here, note the user behavior characteristics and candidate advertisement characteristics in the color part of the above figure :
(1) User behavior characteristics are composed of a series of goods purchased by users , That's the picture Goods 1 To Goods N, Each commodity contains three sub characteristics , That is, the three color dots in the figure , among Red stands for commodity ID, Blue is the shop ID, Pink is the product category ID.
(2) Candidate advertising features also include these three ID Sub characteristics of type , Because the candidate advertisement here is also a product on Alibaba platform .
In deep learning , Generally, if you encounter ID Type features , We'll build it Embedding, And then put Embedding Connect with other features , Enter the following MLP.
Ali's Base Model That's what it did , It puts three ID Converted to the corresponding Embedding, Then put these Embedding Connected to form the current commodity Embedding.
2.2 Accumulate each user behavior sequence
Because the user's behavior sequence is actually a sequence of goods , This sequence can be long or short , But the dimension of the input vector of the neural network must be fixed , Then how should we put this group of goods Embedding Process into a fixed length Embedding Well ? Pictured 1 Medium SUM Pooling Layer structure , Namely Directly put the... Of these goods Embedding Stack up ( Vector accumulation ), Then put the superimposed Embedding Input the connection result with all other features MLP.
【SUM Pooling Deficiency 】
SUM Pooling Of Embedding The superposition operation actually treats all historical behaviors equally , Add up without any emphasis , This is actually not in line with our shopping habits .
For example , The product corresponding to the candidate advertisement is “ keyboard ”, meanwhile , There are several items in the user's historical behavior sequence ID, Namely “ mouse ”“T T-shirt ” and “ cleanser ”. Starting from our common sense of shopping ,“ mouse ” This historical commodity ID To forecast “ keyboard ” The importance of advertising click through rate should be much greater than the latter two . From the perspective of attention mechanism , When we bought the keyboard , Will focus more on buying “ mouse ” The history of such related goods , Because these buying experiences are more conducive for us to make better decisions .
【 Each module of the baseline model 】
- Embedding layer: Transform high dimensional sparse input into low dimensional dense vector , Each discrete feature will correspond to a embedding The dictionary , Dimension is D × K D\times K D×K, there D D D Represents the dimension of an implicit vector , and K K K Represents the number of unique values of the current discrete feature , Here for the sake of understanding , Here is an example to illustrate , Like the one above weekday features :
Suppose a user's weekday The characteristic is Friday , turn one-hot When coding , Namely [0,0,0,0,1,0,0] Express , Here, if we assume that the hidden vector dimension is D, So this feature corresponds to embedding A dictionary is a D × 7 D\times7 D×7 A matrix of ( Each column represents a embedding,7 Column just 7 individual embedding vector , Corresponds to Monday to Sunday ), Then the user one-hot The vector passes through embedding After the layer, you will get a D × 1 D\times1 D×1 Vector , That is, the one corresponding to Friday embedding, How to calculate the , In fact, that is e m b e d d i n g Moment front ∗ [ 0 , 0 , 0 , 0 , 1 , 0 , 0 ] T embedding matrix * [0,0,0,0,1,0,0]^T embedding Moment front ∗[0,0,0,0,1,0,0]T .
In fact, it means to put embedding Matrix one-hot The vector is 1 In that position embedding Take out the vector . In this way, we get the dense vector of sparse features . The same is true for other discrete features , It's just the one above multi-hot The coded one , You'll get one embedding List of vectors , Because the one he started multi-hot More than one vector is 1, Multiply this by embedding matrix , You'll get a list . Through this layer , The above input features can get the corresponding density embedding Vector .
pooling layer and Concat layer:
- pooling The role of the layer is to integrate the user's historical behavior embedding This eventually becomes a fixed length vector , Because the number of products purchased by each user in history is different , That is, every user multi-hot in 1 The number of is inconsistent , Go through like this embedding layer , Get the user history behavior embedding There are different numbers of , That's the top embedding list t i t_i ti It's not the same length , In that case , The historical behavior characteristics of each user are not the same long . And if you add a fully connected network later , We know , He needs fixed length feature input . So we often use one pooling layer First, the user's historical behavior embedding Become a fixed length ( Uniform length ), So we have this formula :
e i = p o o l i n g ( e i 1 , e i 2 , . . . e i k ) e_i=pooling(e_{i1}, e_{i2}, ...e_{ik}) ei=pooling(ei1,ei2,...eik)
there e i j e_{ij} eij It is the historical behavior of users embedding. e i e_i ei It becomes a vector of fixed length , there i i i It means the first one i i i Historical feature groups ( It is a historical act , For example, historical commodities id, Historical commodity categories id etc. ), there k k k Indicates the quantity of goods purchased by users in the corresponding historical special group , That is, history embedding The number of , Look at the picture above user behaviors series , That's the process . - Concat layer The function of the layer is to splice , Is to put all these features embedding vector , If there are continuous features, it also counts , Splice and integrate from the feature dimension , As MLP The input of .
- pooling The role of the layer is to integrate the user's historical behavior embedding This eventually becomes a fixed length vector , Because the number of products purchased by each user in history is different , That is, every user multi-hot in 1 The number of is inconsistent , Go through like this embedding layer , Get the user history behavior embedding There are different numbers of , That's the top embedding list t i t_i ti It's not the same length , In that case , The historical behavior characteristics of each user are not the same long . And if you add a fully connected network later , We know , He needs fixed length feature input . So we often use one pooling layer First, the user's historical behavior embedding Become a fixed length ( Uniform length ), So we have this formula :
MLP: Normal full connection , Various interactions between learning features are used .
CTR In the second category task , Negative for the general loss function log Log likelihood :
L = − 1 N ∑ ( x , y ) ∈ S ( y log p ( x ) + ( 1 − y ) log ( 1 − p ( x ) ) ) L=-\frac{1}{N} \sum_{(\boldsymbol{x}, y) \in \mathcal{S}}(y \log p(\boldsymbol{x})+(1-y) \log (1-p(\boldsymbol{x}))) L=−N1(x,y)∈S∑(ylogp(x)+(1−y)log(1−p(x)))
base Improvement points of the model :
- Put it all together , It is impossible to see which product in the user's historical behavior is more relevant to the current product , That is, the importance of each commodity in the historical behavior to the current prediction is lost .
- The last point is if all the historical behavior products that users have browsed , Finally, they all passed embedding and pooling Converted to a fixed length embedding, This will limit the diverse interests of model learning users .
Specific improvement ideas :
- enlarge embedding Dimensions , Increase the expression ability of previous products , In this way, even if taken together ,embedding The ability to express will also be enhanced , Can contain the user's interest information , But the amount of computation in large-scale real recommended scenarios is huge , Not an option .
- namely DIN, A mechanism to introduce attention between the current candidate advertisement and the user's historical behavior , In this way, when predicting whether the current advertisement is clicked , Let the model pay more attention to those user history products related to the current advertisement , That is to say, the historical behavior more related to the current product can promote the click behavior of users .
2.3 The application of attention mechanism ——DIN
(1) Improvements
So Ali is base model On the basis of , The attention mechanism is applied to the processing of users' historical behavior sequence .
The specific operation is shown in the figure below :DIN An activation unit is added to each user's historical purchase (Activation Unit)—— This activation unit generates a weight , This weight is the user's attention score for this historical commodity , The weight corresponds to the user's attention .
And the previous base Model comparison :
(2) Activation unit (local activation unit)
You can see the picture above 3 The detailed structure of the activation unit on the right of :
input: The current historical behavior commodity Embedding, And the of candidate advertising products Embedding.
practice : Enter these two Embedding, With their Exoproduct The results are connected to form a vector ( The vector direction is the normal vector direction of the plane composed of the two vectors ), Then input to the activation unit MLP layer , Eventually an attention weight will be generated .
(1) The activation unit is equivalent to a small deep learning model , It takes advantage of the... Of two commodities Embedding, Attention weights representing their degree of relevance are generated .
(2)Sparrow Code inside . There is no strict use of outer product . It useselement-wise sub&multipy. Then use these two vectors to splice , Composed ofactivation_all.
Wang Zhe's practical experience : The function of external product is not very big , And greatly increase the number of parameters .

local activation unit It can give the historical behavior characteristics of users according to the correlation between the historical behavior characteristics of users and the current advertising embedding Weighted : Inside is a feedforward neural network , Input is the user's historical behavior product and current candidate product , The output is the correlation between them , This correlation is equivalent to the weight of each historical commodity , Compare this weight with the original historical behavior embedding Multiplication and summation get the user's interest v U ( A ) \boldsymbol{v}_{U}(A) vU(A), Its formula :
v U ( A ) = f ( v A , e 1 , e 2 , … , e H ) = ∑ j = 1 H a ( e j , v A ) e j = ∑ j = 1 H w j e j \boldsymbol{v}_{U}(A)=f\left(\boldsymbol{v}_{A}, \boldsymbol{e}_{1}, \boldsymbol{e}_{2}, \ldots, \boldsymbol{e}_{H}\right)=\sum_{j=1}^{H} a\left(\boldsymbol{e}_{j}, \boldsymbol{v}_{A}\right) \boldsymbol{e}_{j}=\sum_{j=1}^{H} \boldsymbol{w}_{j} \boldsymbol{e}_{j} vU(A)=f(vA,e1,e2,…,eH)=j=1∑Ha(ej,vA)ej=j=1∑Hwjej
The specific symbol explanation of the above formula :
- { v A , e 1 , e 2 , … , e H } \left\{\boldsymbol{v}_{A}, \boldsymbol{e}_{1}, \boldsymbol{e}_{2}, \ldots, \boldsymbol{e}_{H}\right\} { vA,e1,e2,…,eH} Is the user U U U The characteristics of historical behavior of embedding;
- v A v_{A} vA It means candidate advertisement A A A Of embedding vector
- a ( e j , v A ) = w j a\left(e_{j}, v_{A}\right)=w_{j} a(ej,vA)=wj Indicates the weight or historical behavior of the commodity and the current advertisement A A A The degree of relevance .
- a ( ⋅ ) a(\cdot) a(⋅) The feedforward neural network above , That's the so-called attention mechanism
- Input in addition to historical behavior vector and candidate advertising vector , There is also an outer product operation between them , The author said that this is the explicit knowledge that is conducive to model correlation modeling .
RecHub Medium ActivationUnit Code :
class ActivationUnit(torch.nn.Module):
def __init__(self, emb_dim, dims=[36], activation="dice", use_softmax=False):
super(ActivationUnit, self).__init__()
self.emb_dim = emb_dim
self.use_softmax = use_softmax
# Dice(36)
self.attention = MLP(4 * self.emb_dim, dims=dims, activation=activation)
def forward(self, history, target):
seq_length = history.size(1)
target = target.unsqueeze(1).expand(-1, seq_length, -1)
# Concat
att_input = torch.cat([target, history, target - history, target * history], dim=-1)
# Dice(36)
att_weight = self.attention(att_input.view(-1, 4 * self.emb_dim))
# Linear(1)
att_weight = att_weight.view(-1, seq_length)
if self.use_softmax:
att_weight = att_weight.softmax(dim=-1)
# (batch_size,emb_dim)
output = (att_weight.unsqueeze(-1) * history).sum(dim=1)
return output
It can be seen in self.attention The assignment here is done with MLP:
class MLP(nn.Module):
"""Multi Layer Perceptron Module, it is the most widely used module for learning feature. Note we default add `BatchNorm1d` and `Activation` `Dropout` for each `Linear` Module. Args: input dim (int): input size of the first Linear Layer. output_layer (bool): whether this MLP module is the output layer. If `True`, then append one Linear(*,1) module. dims (list): output size of Linear Layer (default=[]). dropout (float): probability of an element to be zeroed (default = 0.5). activation (str): the activation function, support `[sigmoid, relu, prelu, dice, softmax]` (default='relu'). Shape: - Input: `(batch_size, input_dim)` - Output: `(batch_size, 1)` or `(batch_size, dims[-1])` """
def __init__(self, input_dim, output_layer=True, dims=[], dropout=0, activation="relu"):
super().__init__()
layers = list()
for i_dim in dims:
layers.append(nn.Linear(input_dim, i_dim))
layers.append(nn.BatchNorm1d(i_dim))
layers.append(activation_layer(activation))
layers.append(nn.Dropout(p=dropout))
input_dim = i_dim
if output_layer:
layers.append(nn.Linear(input_dim, 1))
self.mlp = nn.Sequential(*layers)
def forward(self, x):
return self.mlp(x)
3、 ... and 、 Code section
3.1 DIN The model part
import torch
import torch.nn as nn
import numpy as np
from torch.nn.modules.activation import Sigmoid
class DIN(nn.Module):
def __init__(self, candidate_movie_num, recent_rate_num, user_profile_num, context_feature_num, candidate_movie_dict,
recent_rate_dict, user_profile_dict, context_feature_dict, history_num, embed_dim, activation_dim, hidden_dim=[128, 64]):
super().__init__()
self.candidate_vocab_list = list(candidate_movie_dict.values())
self.recent_rate_list = list(recent_rate_dict.values())
self.user_profile_list = list(user_profile_dict.values())
self.context_feature_list = list(context_feature_dict.values())
self.embed_dim = embed_dim
self.history_num = history_num
# candidate_embedding_layer
self.candidate_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.candidate_vocab_list])
# recent_rate_embedding_layer
self.recent_rate_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.recent_rate_list])
# user_profile_embedding_layer
self.user_profile_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.user_profile_list])
# context_embedding_list
self.context_embedding_list = nn.ModuleList([nn.Embedding(vocab_size, embed_dim) for vocab_size in self.context_feature_list])
# activation_unit
self.activation_unit = nn.Sequential(nn.Linear(4*embed_dim, activation_dim),
nn.PReLU(),
nn.Linear(activation_dim, 1),
nn.Sigmoid())
# self.dnn_part
self.dnn_input_dim = len(self.candidate_embedding_list) * embed_dim + candidate_movie_num - len(
self.candidate_embedding_list) + embed_dim + len(self.user_profile_embedding_list) * embed_dim + \
user_profile_num - len(self.user_profile_embedding_list) + len(self.context_embedding_list) * embed_dim \
+ context_feature_num - len(self.context_embedding_list)
self.dnn = nn.Sequential(nn.Linear(self.dnn_input_dim, hidden_dim[0]),
nn.BatchNorm1d(hidden_dim[0]),
nn.PReLU(),
nn.Linear(hidden_dim[0], hidden_dim[1]),
nn.BatchNorm1d(hidden_dim[1]),
nn.PReLU(),
nn.Linear(hidden_dim[1], 1),
nn.Sigmoid())
def forward(self, candidate_features, recent_features, user_features, context_features):
bs = candidate_features.shape[0]
# candidate cate_feat embed
candidate_embed_features = []
for i, embed_layer in enumerate(self.candidate_embedding_list):
candidate_embed_features.append(embed_layer(candidate_features[:, i].long()))
candidate_embed_features = torch.stack(candidate_embed_features, dim=1).reshape(bs, -1).unsqueeze(1)
## add candidate continous feat
candidate_continous_features = candidate_features[:, len(candidate_features):]
candidate_branch_features = torch.cat([candidate_continous_features.unsqueeze(1), candidate_embed_features], dim=2).repeat(1, self.history_num, 1)
# recent_rate cate_feat embed
recent_embed_features = []
for i, embed_layer in enumerate(self.recent_rate_embedding_list):
recent_embed_features.append(embed_layer(recent_features[:, i].long()))
recent_branch_features = torch.stack(recent_embed_features, dim=1)
# user_profile feat embed
user_profile_embed_features = []
for i, embed_layer in enumerate(self.user_profile_embedding_list):
user_profile_embed_features.append(embed_layer(user_features[:, i].long()))
user_profile_embed_features = torch.cat(user_profile_embed_features, dim=1)
## add user_profile continous feat
user_profile_continous_features = user_features[:, len(self.user_profile_list):]
user_profile_branch_features = torch.cat([user_profile_embed_features, user_profile_continous_features], dim=1)
# context embed feat
context_embed_features = []
for i, embed_layer in enumerate(self.context_embedding_list):
context_embed_features.append(embed_layer(context_features[:, i].long()))
context_embed_features = torch.cat(context_embed_features, dim=1)
## add context continous feat
context_continous_features = context_features[:, len(self.context_embedding_list):]
context_branch_features = torch.cat([context_embed_features, context_continous_features], dim=1)
# activation_unit
sub_unit_input = recent_branch_features - candidate_branch_features
product_unit_input = torch.mul(recent_branch_features, candidate_branch_features)
unit_input = torch.cat([recent_branch_features, candidate_branch_features, sub_unit_input, product_unit_input], dim=2)
# weight-pool
activation_unit_out = self.activation_unit(unit_input).repeat(1, 1, self.embed_dim)
recent_branch_pooled_features = torch.mean(torch.mul(activation_unit_out, recent_branch_features), dim=1)
# dnn part
dnn_input = torch.cat([candidate_branch_features[:, 0, :], recent_branch_pooled_features, user_profile_branch_features, context_branch_features], dim=1)
dnn_out = self.dnn(dnn_input)
return dnn_out
3.2 torch rechub Use
For example, in a dataset amazon_electronics_sample Run up DIN Model . The original data is json Format , We extract the required information for a preprocessing that contains only user_id, item_id, cate_id, time Four characteristic columns CSV file :
(1) Feature processing part
from torch_rechub.basic.features import DenseFeature, SparseFeature, SequenceFeature
n_users, n_items, n_cates = data["user_id"].max(), data["item_id"].max(), data["cate_id"].max()
# Here you specify how each column of features is handled , about sparsefeature, Need to enter embedding layer , So you need to specify the size of the feature space and the dimension of the output
features = [SparseFeature("target_item", vocab_size=n_items + 2, embed_dim=8),
SparseFeature("target_cate", vocab_size=n_cates + 2, embed_dim=8),
SparseFeature("user_id", vocab_size=n_users + 2, embed_dim=8)]
target_features = features
# For sequence features , Except that you need to deal with the exception of category characteristics ,item Sequences and candidates item Should belong to the same space , We want models to share their embedding, So you can go through shared_with Parameter assignment
history_features = [
SequenceFeature("history_item", vocab_size=n_items + 2, embed_dim=8, pooling="concat", shared_with="target_item"),
SequenceFeature("history_cate", vocab_size=n_cates + 2, embed_dim=8, pooling="concat", shared_with="target_cate")
]
(2) The model code
- It is necessary to process the basic data set to obtain behavior characteristics
hist_behavior; - This historical behavior data is a series of characteristics , The length of historical behavior characteristics of different users is different , So to enter NN Before, we usually follow the longest sequence padding; When performing operations on a specific layer , Will use mask Mask the positions of these fills , To ensure the accuracy of the calculation .

class DIN(torch.nn.Module):
def __init__(self, features, history_features, target_features, mlp_params, attention_mlp_params):
super().__init__()
self.features = features
self.history_features = history_features
self.target_features = target_features
# The number of historical behavior characteristics
self.num_history_features = len(history_features)
# Calculate all the dim
self.all_dims = sum([fea.embed_dim for fea in features + history_features + target_features])
# structure Embeding layer
self.embedding = EmbeddingLayer(features + history_features + target_features)
# Build attention layer
self.attention_layers = nn.ModuleList(
[ActivationUnit(fea.embed_dim, **attention_mlp_params) for fea in self.history_features])
self.mlp = MLP(self.all_dims, activation="dice", **mlp_params)
def forward(self, x):
embed_x_features = self.embedding(x, self.features)
embed_x_history = self.embedding(x, self.history_features)
embed_x_target = self.embedding(x, self.target_features)
attention_pooling = []
for i in range(self.num_history_features):
attention_seq = self.attention_layers[i](embed_x_history[:, i, :, :], embed_x_target[:, i, :])
attention_pooling.append(attention_seq.unsqueeze(1))
# SUM Pooling
attention_pooling = torch.cat(attention_pooling, dim=1)
# Concat & Flatten
mlp_in = torch.cat([
attention_pooling.flatten(start_dim=1),
embed_x_target.flatten(start_dim=1),
embed_x_features.flatten(start_dim=1)
], dim=1)
# Can be introduced into [80, 200]
y = self.mlp(mlp_in)
# This is used in the code sigmoid(1)+BCELoss, Effect and the DIN Model softmax(2)+CELoss similar
return torch.sigmoid(y.squeeze(1))
Four 、 A few questions
- DIN The model is widely used in industry , You are free to check the data to see how this model is used in specific practice ?
- For example, whether the behavior sequence is reasonable , If the time interval is long, should it be divided into several sections ?
- For example, it would be better if the attention mechanism could be changed to another way of calculating attention ?( We also know that the way attention works is not just DNN This kind of ), Another example is whether attention weight should be added softmax?
Reference
[1] 【CTR forecast 】CTR How to add dense continuous and sequential features to the model ?
[2] datawhale rechub project
[3] 《 Deep learning recommendation system 》 nimo
边栏推荐
- Is it safe to open a securities account by downloading the app of qiniu school? Is there a risk?
- Connecting MySQL with C language under Windows system
- Analysis of China's social financing scale and financing structure in 2021: RMB loans to the real economy account for more than 60%[figure]
- I don't really want to open an account online. Is it safe to open an account online
- 57 common mistakes in golang development
- Write static multi data source code and do scheduled tasks to realize database data synchronization
- 马拦过河卒
- Tomb. Weekly update of Finance (February 14-20)
- First Canary deployment with rancher
- For the first time in China, Tsinghua and other teams won the wsdm2022 only best paper award, and Hong Kong Chinese won the "time test Award"
猜你喜欢

Non local network: early human attempts to tame transformer in CV | CVPR 2018

Soul app focuses on the social needs of generation Z and has won many awards for its outstanding performance in 2021

2022 Hunan top eight (Safety Officer) simulated test question bank and answers

Apple was fined by Dutch regulators, totaling about RMB 180million

Gee Registration Guide

Retrieve the compressed package password

Basic concepts of database

Someone is storing credit card data - how do they do it- Somebody is storing credit card data - how are they doing it?
![[Yugong series] February 2022 wechat applet -app Subpackages and preloadrule of JSON configuration attribute](/img/94/a2447b3b00ad0d8fb990917531316d.jpg)
[Yugong series] February 2022 wechat applet -app Subpackages and preloadrule of JSON configuration attribute

Three sides of the headline: tostring(), string Valueof, (string) forced rotation. What is the difference
随机推荐
C语言的指针
Implementation of asynchronous request pool
Crontab pit stepping record: manual script execution is normal, but crontab timed script execution is abnormal
The application of RPC technology and its framework sekiro in crawler reverse, encrypting data is a shuttle!
Stm32l431 immediate sleep mode (code + explanation)
GO语言-指针
Phantom star VR product details 31: escape from the secret room
[Yugong series] February 2022 wechat applet -app Networktimeout of JSON configuration attribute
Select everything between matching brackets in vs Code - select everything between matching brackets in vs Code
马拦过河卒
PLSQL learning log
Telnet batch test (I): pit between telnet graceful stop and firewall
Selection (042) - what is the output of the following code?
Integration of sparkstreaming and sparksql
Redis introduction and Practice (with source code)
57 common mistakes in golang development
Browser evaluation: a free, simple and magical super browser - xiangtian browser
Three disciplines of elastic design, how to make the stability KPI high?
Mysql5.7 add SSL authentication
2020-11-12 meter skipping