当前位置:网站首页>2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

2022-07-23 06:11:00 CityD

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Address of thesis :https://aclanthology.org/2020.challengehml-1.1/

Emotion analysis and emotion recognition

For emotional analysis , Emotional expression can come from words 、 Audio 、 Images , Combine two or more modes to model emotional analysis , Namely Multimodal emotion analysis . As shown in the figure below , Use text 、 Images 、 Audio ( These three modes come from multimedia data , The text, sound and image data are extracted from the multimedia data ) Three modes are used to analyze whether the emotion expressed by multimedia data is positive or negative or what emotion ( Happy , excited , sad , angry ), Multimodal emotion analysis .

 Insert picture description here

Emotion analysis and emotion recognition are distinguished by , Emotion analysis only analyzes whether the emotion expressed by multimedia data is positive or negative , So there are only two categories : positive (1) And negativity (-1), Or according to the intensity of positive and negative , Annotate multimedia information as highly negative (-3) To highly positive [-3,3] Range of emotions .

And emotion recognition is to recognize the emotion expressed by multimedia data , Like happiness 、 sad 、 anger 、 Fear 、 Hate 、 surprised , This is also a commonly used tag in the data set of emotion recognition tasks . Data set used in this article CMU-MOSEI Just divide emotions into these six categories .

 Insert picture description here

The model proposed in this paper is used in emotion recognition and emotion analysis tasks ,** The model is based on Transformer Joint coding of , Besides using Transformer framework , The method proposed in this paper also relies on a modular common concern and a glimpse layer To jointly code one or more modes .** The model proposed in this paper is inspired by machine translation (Transformers,Vaswani wait forsomeone (2017)) And answer visual questions (Modular co-attention,Yu wait forsomeone (2019)) Mission .

self-attention(SA) unit & guided-attention(GA) unit

Here are two basic units , namely self-attention(SA) unit and guided-attention(GA) unit. It was inspired by Transformer Note the scaling dot product proposed in (scaled dot-product attention).SA and GA There are two basic attention units , Realize multiple attention of different input types .SA The unit receives a set of input features X, Output X Attention characteristics of Z;GA The unit receives two sets of input features X and Y, stay Y Output under the guidance of X Attention characteristics of Z. among SA Unit and transformer The encoder in is similar , It is a single-mode encoder ;GA The unit is for Transformer Apply to multimodal task pairs SA To improve , For multimodal encoder .

 Insert picture description here

Given a query (query) Q ∈ R n × d Q \in R^{n \times d} QRn×d, key (key) K ∈ R n × d K \in R^{n \times d} KRn×d And the value (value) V ∈ R n × d V \in R^{n \times d} VRn×d. Zoom dot product note that the function refers to computing queries query Key key The dot product , Then divide by d \sqrt{d} d, And Application s o f t m a x softmax softmax Function to get the value value The weight of attention , Then use the calculated attention weight pair value Weighted :
f = A t t e n t i o n ( Q , K , C ) = s o f t m a x ( Q K T d ) V (2) f=Attention(Q,K,C)=softmax(\frac{QK^T}{\sqrt{d}})V\tag{2} f=Attention(Q,K,C)=softmax(dQKT)V(2)
In order to further improve the representation ability of the concerned features , The introduction of long attention , It consists of h Two parallel " head " form . Each header corresponds to an independent scaled dot product attention function . The output characteristics are obtained by the following formula :
f = M H A ( Q , K , C ) = C o n c a t ( h e a d 1 , ⋯   , h e a d h ) W o w h e r e    h e a d i = A t t e n t i o n ( Q W i Q , K W i K , C W i C ) f=MHA(Q,K,C)=Concat(head_1,\cdots,head_h)W_o\\ where\;head_i=Attention(QW_i^Q,KW_i^K,CW_i^C) f=MHA(Q,K,C)=Concat(head1,,headh)Wowhereheadi=Attention(QWiQ,KWiK,CWiC)
SA Unit sum GA Each unit is composed of a multi head attention layer and a feedforward neural network layer , Residual connection and layer normalization are applied to the output of these two layers to promote optimization .

SA unit

SA The unit accepts a set of input features X = [ x 1 ; ⋯   ; x m ] ∈ R m × d x X=[x_1;\cdots;x_m]\in R^{m \times d_x} X=[x1;;xm]Rm×dx, Calculate multi head attention level learning X X X Inner paired samples < x i , x j > < x_i,x_j> <xi,xj> Attention between , And through the X X X All values in value Weighted summation , Get the output of multiple attention layers . The calculation for the :
f = M H A ( X , X , X ) f=MHA(X,X,X) f=MHA(X,X,X)
It can be understood as through calculation X X X All samples in are related to x i x_i xi Normalized similarity to reconstruct x i x_i xi. Because there are residual connections and layer normalization ,SA The output of the unit is :
L a y e r N o r m ( X + M H A ( X , X , X ) ) L a y e r N o r m ( X + M L P ( X ) ) LayerNorm(X+MHA(X,X,X))\\LayerNorm(X+MLP(X)) LayerNorm(X+MHA(X,X,X))LayerNorm(X+MLP(X))

GA unit

GA The unit accepts two sets of input features X = [ x 1 ; ⋯   ; x m ] ∈ R m × d x X=[x_1;\cdots;x_m]\in R^{m \times d_x} X=[x1;;xm]Rm×dx and Y = [ y 1 ; ⋯   ; y n ] ∈ R n × d y Y=[y_1;\cdots;y_n]\in R^{n \times d_y} Y=[y1;;yn]Rn×dy, among Y Y Y To guide the X X X Focus on learning . X X X and Y Y Y The shape of is flexible , They can be used to represent different modes ( For example, text and images ) Characteristics of .GA Unit pairs come from X X X and Y Y Y Each paired sample of < x i , y j > < x_i,y_j> <xi,yj> The pairwise relationship between them is modeled separately . The output of the multi head attention layer is :
f = M H A ( X , Y , Y ) f=MHA(X,Y,Y) f=MHA(X,Y,Y)
It can be understood as through calculation Y Y Y All samples in are related to x i x_i xi Normalized cross modal similarity to reconstruct x i x_i xi.GA The output of the unit is :
L a y e r N o r m ( X + M H A ( X , Y , Y ) ) L a y e r N o r m ( X + M L P ( X ) ) LayerNorm(X+MHA(X,Y,Y))\\LayerNorm(X+MLP(X)) LayerNorm(X+MHA(X,Y,Y))LayerNorm(X+MLP(X))

Single mode Transformer The coding

The single mode used by the model proposed in this paper Transformer The encoder of is right B individual SA Units are superimposed , Every SA Each unit has its own training parameters , As shown below :

 Insert picture description here

Modular Co-Attention

It was said that , In this article, in addition to using Transformer framework , The proposed approach also relies on a modular common concern (Modular Co-Attention) And a glimpse layer To jointly code one or more modes . Let's see below. Modular Co-Attention What is it? .

Based on the two basic attention units introduced above SA and GA, We can synthesize it , Get three modular common attention . The following describes how these three modules are composed across the common attention , Here is a diagram of their structure .

 Insert picture description here

ID(Y)-GA(X,Y) in , Modality Y The feature of is directly transferred to the output feature through identity mapping , Modality Y And mode X Input in GA(X,Y) The interaction between modes in the element , Mode Y Guide mode X Do feature learning .

SA(Y)-GA(X,Y) in , Modality Y First pass through SA(Y) The unit carries out modal interaction and transfers it to the output feature , And output and mode X Input in GA(X,Y) Calculate the cross modal in the element. Note .

SA(Y)-SGA(X,Y) Continue to SA(Y)-GA(X,Y) On the basis of SA(X) Element to calculate the mode X Modal internal attention . Then with the mode Y Intra modal attentional output of calculates cross modal attention .

Cross modal Transformer code

Let's take a look at the cross modal proposed in this paper Transformer code , As shown in the figure below , take B The modular common concern layer proposed above (Modular Co-Attention layer) And the upcoming glimpse layers . This coding can be ported to any number of modes by copying the architecture . In the example of this article , Always adjust the language mode to other modes .

What we should pay attention to here is that the output of each modular common concern layer will be used as glimpse Layer of the input , then glimpse The output of the layer is used as the input of the modular common concern layer , repeated B Time . instead of B The calculation results of a modular common concern layer are being transmitted to B individual glimpse layer .

 Insert picture description here

Modular layer of common concern

Just introduced 3 It seems that we can't find the modular common concern layer in the above figure among the modular common concern layers , It may be that the author's drawing is not standard , Then look at the code . As you can see from the code , Modality x Enter the first SA unit , Then with the mode y Together as GA Unit input . That's what we introduced above SA(Y)-GA(X,Y) wow . But a closer look at the code is still a little different , stay SGA Class , Modality x First, I calculated the attention of the Bulls without passing MLP, As input and mode y Calculate multimodal attention . But that's about it ...

class SGA(nn.Module):
    def __init__(self, args):
        super(SGA, self).__init__()

        self.mhatt1 = MHAtt(args)
        self.mhatt2 = MHAtt(args)
        self.ffn = FFN(args)

        self.dropout1 = nn.Dropout(args.dropout_r)
        self.norm1 = LayerNorm(args.hidden_size)

        self.dropout2 = nn.Dropout(args.dropout_r)
        self.norm2 = LayerNorm(args.hidden_size)

        self.dropout3 = nn.Dropout(args.dropout_r)
        self.norm3 = LayerNorm(args.hidden_size)

    def forward(self, x, y, x_mask, y_mask):
        x = self.norm1(x + self.dropout1(
            self.mhatt1(v=x, k=x, q=x, mask=x_mask)
        ))

        x = self.norm2(x + self.dropout2(
            self.mhatt2(v=y, k=y, q=x, mask=y_mask)
        ))

        x = self.norm3(x + self.dropout3(
            self.ffn(x)
        ))

        return x
class SA(nn.Module):
    def __init__(self, args):
        super(SA, self).__init__()

        self.mhatt = MHAtt(args)
        self.ffn = FFN(args)

        self.dropout1 = nn.Dropout(args.dropout_r)
        self.norm1 = LayerNorm(args.hidden_size)

        self.dropout2 = nn.Dropout(args.dropout_r)
        self.norm2 = LayerNorm(args.hidden_size)

    def forward(self, y, y_mask):
        y = self.norm1(y + self.dropout1(
            self.mhatt(y, y, y, y_mask)
        ))

        y = self.norm2(y + self.dropout2(
            self.ffn(y)
        ))

        return y

Glimpse layer

After going through the modular common concern layer , Here we go Glimpse layer , Let's see below. Glimpse How layers work . Here the mode is projected into a new representation space . One Glimpse Layers are stacked G A soft note (soft attention) Layer stack composition . every time soft attention Are regarded as a glimpse. In form , We passed a MLP And a weighted sum to define the input matrix M ∈ R N × k M\in R^{N \times k} MRN×k( For the output of the previous layer , That is, the output of the modular common concern layer ) Soft attention (SoA) i i i.
a i = s o f t m a x ( v i a T ( W m M ) ) (1) a_i=softmax(v_i^{a^T}(W_mM)) \tag{1} ai=softmax(viaT(WmM))(1)

S o A i ( M ) = m i = ∑ j = 0 N a i j M j (2) SoA_i(M)=m_i=\sum_{j=0}^Na_{ij}M_j \tag{2} SoAi(M)=mi=j=0NaijMj(2)

Put the matrix M Of Glimpse The mechanism is defined as G M G_M GM individual glimpse Stack of :
G M = S t a c k i n g ( m 1 , ⋯   , m G m ) (3) G_M=Stacking(m_1,\cdots,m_{G^m})\tag{3} GM=Stacking(m1,,mGm)(3)
In this model , We always choose G m = N G^m=N Gm=N, So the size allows us to make the final residual connection , obtain Glimpse Layer output .
M = L a y e r N o r m ( M + G M ) M=LayerNorm(M+G_M) M=LayerNorm(M+GM)
I don't quite understand ???, Let's analyze Glimpse Layer code wow ...

class AttFlat(nn.Module):
    def __init__(self, args, flat_glimpse, merge=False):
        super(AttFlat, self).__init__()
        self.args = args
        self.merge = merge
        self.flat_glimpse = flat_glimpse
        self.mlp = MLP(
            in_size=args.hidden_size,
            mid_size=args.ff_size,
            out_size=flat_glimpse,
            dropout_r=args.dropout_r,
            use_relu=True
        )

        if self.merge:
            self.linear_merge = nn.Linear(
                args.hidden_size * flat_glimpse,
                args.hidden_size * 2
            )

    def forward(self, x, x_mask):
        att = self.mlp(x)
        #print(att.shape)
        if x_mask is not None:
            att = att.masked_fill(
                x_mask.squeeze(1).squeeze(1).unsqueeze(2),
                -1e9
            )
        att = F.softmax(att, dim=1)

        att_list = []
        for i in range(self.flat_glimpse):
            att_list.append(
                torch.sum(att[:, :, i: i + 1] * x, dim=1)
            )

        if self.merge:
            x_atted = torch.cat(att_list, dim=1)
            x_atted = self.linear_merge(x_atted)

            return x_atted

        return torch.stack(att_list).transpose_(0, 1)

In the subsequent calls to this class, you can find , In cross modal transformer Coded B-1 In blocks , Parameters flat_glimpse Is the sequence length of the incoming mode , In the first B One block is the last block , Parameters flat_glimpse by 1. This is done to make the dimension of the final output smaller , It is convenient to input the linear layer for emotional prediction .

Let's look at the front B-1 In blocks Glimpse How layers work . Take text mode as an example , The size entered is (32,60,512), among 32 Is the batch size ,60 Is the number of words ,512 Is the length of the word vector .

  1. Get into MLP After layer treatment , The size changes to (32,60,60).
  2. And then pass by softmax Handle , Size or size (32,60,60), Just in dimension 1 The sum of each data on is 0. The first two steps are formula 1
  3. att[:, :, i: i + 1] * x Is every value in the third dimension, that is, the word vector dimension (60 individual ) Multiply with the original input data , Dimension for (32,60,1)*(32,60,512)=(30,60,512), And then in the second 2 Sum on three dimensions , The output data size is (32,512), It's a total cycle 60 Time , The results of each time are stored in the array att_list in . This step is formula 2
  4. Finally, array att_list Add them together , The final size is (32,60,512), This step is formula 3.

In the B The size of the output in the block is (32,1024).

The following figure shows the calculation process .

 Insert picture description here

So what is the function of this layer ??? I feel like I'm calculating my self attention , So called soft attention, Not used self-attention It should be a large amount of calculation , There are also parameters to learn .

Classification layer

The following is emotional analysis , That is, the classification layer . As I said before, cross modal transformer The coding is most enough for one block Glimpse The output dimension of the layer is 1, The text is (32,1024). therefore , The result is only one vector . The vectors of each mode are summed element by element , We call this sum result s, Then project on the possible answer according to the following formula .
y   p = W a ( L a y e r N o r m ( s ) ) y~p=W_a(LayerNorm(s)) y p=Wa(LayerNorm(s))
If there is only one mode , Then the summation operation is omitted .

Take a look at the code :

# Classification layers
self.proj_norm = LayerNorm(2 * args.hidden_size)
self.proj = self.proj = nn.Linear(2 * args.hidden_size, args.ans_size)

# Classification layers
proj_feat = x + y
proj_feat = self.proj_norm(proj_feat)
ans = self.proj(proj_feat)

After the sum , After a layer normalization , Then the final classification result is obtained through a linear test .

experiment

Accuracy of Standards
 Insert picture description here

Accuracy of weighting
 Insert picture description here

原网站

版权声明
本文为[CityD]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/204/202207221757347272.html