当前位置:网站首页>2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
2022-07-23 06:11:00 【CityD】
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Address of thesis :https://aclanthology.org/2020.challengehml-1.1/
Emotion analysis and emotion recognition
For emotional analysis , Emotional expression can come from words 、 Audio 、 Images , Combine two or more modes to model emotional analysis , Namely Multimodal emotion analysis . As shown in the figure below , Use text 、 Images 、 Audio ( These three modes come from multimedia data , The text, sound and image data are extracted from the multimedia data ) Three modes are used to analyze whether the emotion expressed by multimedia data is positive or negative or what emotion ( Happy , excited , sad , angry ), Multimodal emotion analysis .

Emotion analysis and emotion recognition are distinguished by , Emotion analysis only analyzes whether the emotion expressed by multimedia data is positive or negative , So there are only two categories : positive (1) And negativity (-1), Or according to the intensity of positive and negative , Annotate multimedia information as highly negative (-3) To highly positive [-3,3] Range of emotions .
And emotion recognition is to recognize the emotion expressed by multimedia data , Like happiness 、 sad 、 anger 、 Fear 、 Hate 、 surprised , This is also a commonly used tag in the data set of emotion recognition tasks . Data set used in this article CMU-MOSEI Just divide emotions into these six categories .

The model proposed in this paper is used in emotion recognition and emotion analysis tasks ,** The model is based on Transformer Joint coding of , Besides using Transformer framework , The method proposed in this paper also relies on a modular common concern and a glimpse layer To jointly code one or more modes .** The model proposed in this paper is inspired by machine translation (Transformers,Vaswani wait forsomeone (2017)) And answer visual questions (Modular co-attention,Yu wait forsomeone (2019)) Mission .
self-attention(SA) unit & guided-attention(GA) unit
Here are two basic units , namely self-attention(SA) unit and guided-attention(GA) unit. It was inspired by Transformer Note the scaling dot product proposed in (scaled dot-product attention).SA and GA There are two basic attention units , Realize multiple attention of different input types .SA The unit receives a set of input features X, Output X Attention characteristics of Z;GA The unit receives two sets of input features X and Y, stay Y Output under the guidance of X Attention characteristics of Z. among SA Unit and transformer The encoder in is similar , It is a single-mode encoder ;GA The unit is for Transformer Apply to multimodal task pairs SA To improve , For multimodal encoder .

Given a query (query) Q ∈ R n × d Q \in R^{n \times d} Q∈Rn×d, key (key) K ∈ R n × d K \in R^{n \times d} K∈Rn×d And the value (value) V ∈ R n × d V \in R^{n \times d} V∈Rn×d. Zoom dot product note that the function refers to computing queries query Key key The dot product , Then divide by d \sqrt{d} d, And Application s o f t m a x softmax softmax Function to get the value value The weight of attention , Then use the calculated attention weight pair value Weighted :
f = A t t e n t i o n ( Q , K , C ) = s o f t m a x ( Q K T d ) V (2) f=Attention(Q,K,C)=softmax(\frac{QK^T}{\sqrt{d}})V\tag{2} f=Attention(Q,K,C)=softmax(dQKT)V(2)
In order to further improve the representation ability of the concerned features , The introduction of long attention , It consists of h Two parallel " head " form . Each header corresponds to an independent scaled dot product attention function . The output characteristics are obtained by the following formula :
f = M H A ( Q , K , C ) = C o n c a t ( h e a d 1 , ⋯ , h e a d h ) W o w h e r e h e a d i = A t t e n t i o n ( Q W i Q , K W i K , C W i C ) f=MHA(Q,K,C)=Concat(head_1,\cdots,head_h)W_o\\ where\;head_i=Attention(QW_i^Q,KW_i^K,CW_i^C) f=MHA(Q,K,C)=Concat(head1,⋯,headh)Wowhereheadi=Attention(QWiQ,KWiK,CWiC)
SA Unit sum GA Each unit is composed of a multi head attention layer and a feedforward neural network layer , Residual connection and layer normalization are applied to the output of these two layers to promote optimization .
SA unit
SA The unit accepts a set of input features X = [ x 1 ; ⋯ ; x m ] ∈ R m × d x X=[x_1;\cdots;x_m]\in R^{m \times d_x} X=[x1;⋯;xm]∈Rm×dx, Calculate multi head attention level learning X X X Inner paired samples < x i , x j > < x_i,x_j> <xi,xj> Attention between , And through the X X X All values in value Weighted summation , Get the output of multiple attention layers . The calculation for the :
f = M H A ( X , X , X ) f=MHA(X,X,X) f=MHA(X,X,X)
It can be understood as through calculation X X X All samples in are related to x i x_i xi Normalized similarity to reconstruct x i x_i xi. Because there are residual connections and layer normalization ,SA The output of the unit is :
L a y e r N o r m ( X + M H A ( X , X , X ) ) L a y e r N o r m ( X + M L P ( X ) ) LayerNorm(X+MHA(X,X,X))\\LayerNorm(X+MLP(X)) LayerNorm(X+MHA(X,X,X))LayerNorm(X+MLP(X))
GA unit
GA The unit accepts two sets of input features X = [ x 1 ; ⋯ ; x m ] ∈ R m × d x X=[x_1;\cdots;x_m]\in R^{m \times d_x} X=[x1;⋯;xm]∈Rm×dx and Y = [ y 1 ; ⋯ ; y n ] ∈ R n × d y Y=[y_1;\cdots;y_n]\in R^{n \times d_y} Y=[y1;⋯;yn]∈Rn×dy, among Y Y Y To guide the X X X Focus on learning . X X X and Y Y Y The shape of is flexible , They can be used to represent different modes ( For example, text and images ) Characteristics of .GA Unit pairs come from X X X and Y Y Y Each paired sample of < x i , y j > < x_i,y_j> <xi,yj> The pairwise relationship between them is modeled separately . The output of the multi head attention layer is :
f = M H A ( X , Y , Y ) f=MHA(X,Y,Y) f=MHA(X,Y,Y)
It can be understood as through calculation Y Y Y All samples in are related to x i x_i xi Normalized cross modal similarity to reconstruct x i x_i xi.GA The output of the unit is :
L a y e r N o r m ( X + M H A ( X , Y , Y ) ) L a y e r N o r m ( X + M L P ( X ) ) LayerNorm(X+MHA(X,Y,Y))\\LayerNorm(X+MLP(X)) LayerNorm(X+MHA(X,Y,Y))LayerNorm(X+MLP(X))
Single mode Transformer The coding
The single mode used by the model proposed in this paper Transformer The encoder of is right B individual SA Units are superimposed , Every SA Each unit has its own training parameters , As shown below :

Modular Co-Attention
It was said that , In this article, in addition to using Transformer framework , The proposed approach also relies on a modular common concern (Modular Co-Attention) And a glimpse layer To jointly code one or more modes . Let's see below. Modular Co-Attention What is it? .
Based on the two basic attention units introduced above SA and GA, We can synthesize it , Get three modular common attention . The following describes how these three modules are composed across the common attention , Here is a diagram of their structure .

ID(Y)-GA(X,Y) in , Modality Y The feature of is directly transferred to the output feature through identity mapping , Modality Y And mode X Input in GA(X,Y) The interaction between modes in the element , Mode Y Guide mode X Do feature learning .
SA(Y)-GA(X,Y) in , Modality Y First pass through SA(Y) The unit carries out modal interaction and transfers it to the output feature , And output and mode X Input in GA(X,Y) Calculate the cross modal in the element. Note .
SA(Y)-SGA(X,Y) Continue to SA(Y)-GA(X,Y) On the basis of SA(X) Element to calculate the mode X Modal internal attention . Then with the mode Y Intra modal attentional output of calculates cross modal attention .
Cross modal Transformer code
Let's take a look at the cross modal proposed in this paper Transformer code , As shown in the figure below , take B The modular common concern layer proposed above (Modular Co-Attention layer) And the upcoming glimpse layers . This coding can be ported to any number of modes by copying the architecture . In the example of this article , Always adjust the language mode to other modes .
What we should pay attention to here is that the output of each modular common concern layer will be used as glimpse Layer of the input , then glimpse The output of the layer is used as the input of the modular common concern layer , repeated B Time . instead of B The calculation results of a modular common concern layer are being transmitted to B individual glimpse layer .

Modular layer of common concern
Just introduced 3 It seems that we can't find the modular common concern layer in the above figure among the modular common concern layers , It may be that the author's drawing is not standard , Then look at the code . As you can see from the code , Modality x Enter the first SA unit , Then with the mode y Together as GA Unit input . That's what we introduced above SA(Y)-GA(X,Y) wow . But a closer look at the code is still a little different , stay SGA Class , Modality x First, I calculated the attention of the Bulls without passing MLP, As input and mode y Calculate multimodal attention . But that's about it ...
class SGA(nn.Module):
def __init__(self, args):
super(SGA, self).__init__()
self.mhatt1 = MHAtt(args)
self.mhatt2 = MHAtt(args)
self.ffn = FFN(args)
self.dropout1 = nn.Dropout(args.dropout_r)
self.norm1 = LayerNorm(args.hidden_size)
self.dropout2 = nn.Dropout(args.dropout_r)
self.norm2 = LayerNorm(args.hidden_size)
self.dropout3 = nn.Dropout(args.dropout_r)
self.norm3 = LayerNorm(args.hidden_size)
def forward(self, x, y, x_mask, y_mask):
x = self.norm1(x + self.dropout1(
self.mhatt1(v=x, k=x, q=x, mask=x_mask)
))
x = self.norm2(x + self.dropout2(
self.mhatt2(v=y, k=y, q=x, mask=y_mask)
))
x = self.norm3(x + self.dropout3(
self.ffn(x)
))
return x
class SA(nn.Module):
def __init__(self, args):
super(SA, self).__init__()
self.mhatt = MHAtt(args)
self.ffn = FFN(args)
self.dropout1 = nn.Dropout(args.dropout_r)
self.norm1 = LayerNorm(args.hidden_size)
self.dropout2 = nn.Dropout(args.dropout_r)
self.norm2 = LayerNorm(args.hidden_size)
def forward(self, y, y_mask):
y = self.norm1(y + self.dropout1(
self.mhatt(y, y, y, y_mask)
))
y = self.norm2(y + self.dropout2(
self.ffn(y)
))
return y
Glimpse layer
After going through the modular common concern layer , Here we go Glimpse layer , Let's see below. Glimpse How layers work . Here the mode is projected into a new representation space . One Glimpse Layers are stacked G A soft note (soft attention) Layer stack composition . every time soft attention Are regarded as a glimpse. In form , We passed a MLP And a weighted sum to define the input matrix M ∈ R N × k M\in R^{N \times k} M∈RN×k( For the output of the previous layer , That is, the output of the modular common concern layer ) Soft attention (SoA) i i i.
a i = s o f t m a x ( v i a T ( W m M ) ) (1) a_i=softmax(v_i^{a^T}(W_mM)) \tag{1} ai=softmax(viaT(WmM))(1)
S o A i ( M ) = m i = ∑ j = 0 N a i j M j (2) SoA_i(M)=m_i=\sum_{j=0}^Na_{ij}M_j \tag{2} SoAi(M)=mi=j=0∑NaijMj(2)
Put the matrix M Of Glimpse The mechanism is defined as G M G_M GM individual glimpse Stack of :
G M = S t a c k i n g ( m 1 , ⋯ , m G m ) (3) G_M=Stacking(m_1,\cdots,m_{G^m})\tag{3} GM=Stacking(m1,⋯,mGm)(3)
In this model , We always choose G m = N G^m=N Gm=N, So the size allows us to make the final residual connection , obtain Glimpse Layer output .
M = L a y e r N o r m ( M + G M ) M=LayerNorm(M+G_M) M=LayerNorm(M+GM)
I don't quite understand ???, Let's analyze Glimpse Layer code wow ...
class AttFlat(nn.Module):
def __init__(self, args, flat_glimpse, merge=False):
super(AttFlat, self).__init__()
self.args = args
self.merge = merge
self.flat_glimpse = flat_glimpse
self.mlp = MLP(
in_size=args.hidden_size,
mid_size=args.ff_size,
out_size=flat_glimpse,
dropout_r=args.dropout_r,
use_relu=True
)
if self.merge:
self.linear_merge = nn.Linear(
args.hidden_size * flat_glimpse,
args.hidden_size * 2
)
def forward(self, x, x_mask):
att = self.mlp(x)
#print(att.shape)
if x_mask is not None:
att = att.masked_fill(
x_mask.squeeze(1).squeeze(1).unsqueeze(2),
-1e9
)
att = F.softmax(att, dim=1)
att_list = []
for i in range(self.flat_glimpse):
att_list.append(
torch.sum(att[:, :, i: i + 1] * x, dim=1)
)
if self.merge:
x_atted = torch.cat(att_list, dim=1)
x_atted = self.linear_merge(x_atted)
return x_atted
return torch.stack(att_list).transpose_(0, 1)
In the subsequent calls to this class, you can find , In cross modal transformer Coded B-1 In blocks , Parameters flat_glimpse Is the sequence length of the incoming mode , In the first B One block is the last block , Parameters flat_glimpse by 1. This is done to make the dimension of the final output smaller , It is convenient to input the linear layer for emotional prediction .
Let's look at the front B-1 In blocks Glimpse How layers work . Take text mode as an example , The size entered is (32,60,512), among 32 Is the batch size ,60 Is the number of words ,512 Is the length of the word vector .
- Get into MLP After layer treatment , The size changes to (32,60,60).
- And then pass by softmax Handle , Size or size (32,60,60), Just in dimension 1 The sum of each data on is 0. The first two steps are formula 1
att[:, :, i: i + 1] * xIs every value in the third dimension, that is, the word vector dimension (60 individual ) Multiply with the original input data , Dimension for (32,60,1)*(32,60,512)=(30,60,512), And then in the second 2 Sum on three dimensions , The output data size is (32,512), It's a total cycle 60 Time , The results of each time are stored in the array att_list in . This step is formula 2- Finally, array att_list Add them together , The final size is (32,60,512), This step is formula 3.
In the B The size of the output in the block is (32,1024).
The following figure shows the calculation process .

So what is the function of this layer ??? I feel like I'm calculating my self attention , So called soft attention, Not used self-attention It should be a large amount of calculation , There are also parameters to learn .
Classification layer
The following is emotional analysis , That is, the classification layer . As I said before, cross modal transformer The coding is most enough for one block Glimpse The output dimension of the layer is 1, The text is (32,1024). therefore , The result is only one vector . The vectors of each mode are summed element by element , We call this sum result s, Then project on the possible answer according to the following formula .
y p = W a ( L a y e r N o r m ( s ) ) y~p=W_a(LayerNorm(s)) y p=Wa(LayerNorm(s))
If there is only one mode , Then the summation operation is omitted .
Take a look at the code :
# Classification layers
self.proj_norm = LayerNorm(2 * args.hidden_size)
self.proj = self.proj = nn.Linear(2 * args.hidden_size, args.ans_size)
# Classification layers
proj_feat = x + y
proj_feat = self.proj_norm(proj_feat)
ans = self.proj(proj_feat)
After the sum , After a layer normalization , Then the final classification result is obtained through a linear test .
experiment
Accuracy of Standards 
Accuracy of weighting 
边栏推荐
猜你喜欢
随机推荐
Zstuacm registration results (complete with STL linked list)
Firewall Research Report
蓝桥杯31天冲刺之二十一day(C语言)
Pytorch实现文本情感分析
Transplantation de systèmes embarqués
PWN --- ret2shellcode
2019_AAAI_Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis
Reset root password
查看进程!!!rpm安装!!!
【基础4】——文件读写、模块
C语言知识点(指针知识类型)
shell基本命令
2019_AAAI_ICCN
Optimizer (SGD, momentum, adagrad, rmsprop, Adam)
Mobile application classification
Enter two strings STR1 and STR2, and count the number of times that the string STR2 appears in STR1.
2019 Bar _ Aaai ICCN
2. 输入一个圆半径r,当r>=0时,计算并输出圆的面积和周长,否则,输出提示信息。
C language knowledge points (pointer knowledge type)
zy:修改主机名









