当前位置：网站首页>2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

2020_ ACL_ A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

2022-07-23 06:11:00 【CityD】

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Address of thesis ：https://aclanthology.org/2020.challengehml-1.1/

Emotion analysis and emotion recognition

For emotional analysis , Emotional expression can come from words 、 Audio 、 Images , Combine two or more modes to model emotional analysis , Namely Multimodal emotion analysis . As shown in the figure below , Use text 、 Images 、 Audio ( These three modes come from multimedia data , The text, sound and image data are extracted from the multimedia data ) Three modes are used to analyze whether the emotion expressed by multimedia data is positive or negative or what emotion ( Happy , excited , sad , angry ), Multimodal emotion analysis .

Insert picture description here

Emotion analysis and emotion recognition are distinguished by , Emotion analysis only analyzes whether the emotion expressed by multimedia data is positive or negative , So there are only two categories ： positive (1) And negativity (-1), Or according to the intensity of positive and negative , Annotate multimedia information as highly negative (-3) To highly positive [-3,3] Range of emotions .

And emotion recognition is to recognize the emotion expressed by multimedia data , Like happiness 、 sad 、 anger 、 Fear 、 Hate 、 surprised , This is also a commonly used tag in the data set of emotion recognition tasks . Data set used in this article CMU-MOSEI Just divide emotions into these six categories .

Insert picture description here

The model proposed in this paper is used in emotion recognition and emotion analysis tasks ,** The model is based on Transformer Joint coding of , Besides using Transformer framework , The method proposed in this paper also relies on a modular common concern and a glimpse layer To jointly code one or more modes .** The model proposed in this paper is inspired by machine translation (Transformers,Vaswani wait forsomeone (2017)) And answer visual questions (Modular co-attention,Yu wait forsomeone (2019)) Mission .

self-attention(SA) unit & guided-attention(GA) unit

Here are two basic units , namely self-attention(SA) unit and guided-attention(GA) unit. It was inspired by Transformer Note the scaling dot product proposed in (scaled dot-product attention).SA and GA There are two basic attention units , Realize multiple attention of different input types .SA The unit receives a set of input features X, Output X Attention characteristics of Z;GA The unit receives two sets of input features X and Y, stay Y Output under the guidance of X Attention characteristics of Z. among SA Unit and transformer The encoder in is similar , It is a single-mode encoder ;GA The unit is for Transformer Apply to multimodal task pairs SA To improve , For multimodal encoder .

Insert picture description here

Given a query (query) $\in R^{n \times d}$ , key (key) $\in R^{n \times d}$ And the value (value) $\in R^{n \times d}$ . Zoom dot product note that the function refers to computing queries query Key key The dot product , Then divide by $\sqrt{d}$ , And Application $s o f t m a x$ Function to get the value value The weight of attention , Then use the calculated attention weight pair value Weighted ：
$f=Attention(Q,K,C)=softmax(\frac{QK^T}{\sqrt{d}})V\tag{2}$
In order to further improve the representation ability of the concerned features , The introduction of long attention , It consists of h Two parallel " head " form . Each header corresponds to an independent scaled dot product attention function . The output characteristics are obtained by the following formula ：
$f=MHA(Q,K,C)=Concat(head_1,\cdots,head_h)W_o\\ where\;head_i=Attention(QW_i^Q,KW_i^K,CW_i^C)$
SA Unit sum GA Each unit is composed of a multi head attention layer and a feedforward neural network layer , Residual connection and layer normalization are applied to the output of these two layers to promote optimization .

SA unit

SA The unit accepts a set of input features $X=[x_1;\cdots;x_m]\in R^{m \times d_x}$ , Calculate multi head attention level learning $X$ Inner paired samples $x_i,x_j>$ Attention between , And through the $X$ All values in value Weighted summation , Get the output of multiple attention layers . The calculation for the ：
$f = M H A (X, X, X)$
It can be understood as through calculation $X$ All samples in are related to $x_i$ Normalized similarity to reconstruct $x_i$ . Because there are residual connections and layer normalization ,SA The output of the unit is ：
$LayerNorm(X+MHA(X,X,X))\\LayerNorm(X+MLP(X))$

GA unit

GA The unit accepts two sets of input features $X=[x_1;\cdots;x_m]\in R^{m \times d_x}$ and $Y=[y_1;\cdots;y_n]\in R^{n \times d_y}$ , among $Y$ To guide the $X$ Focus on learning . $X$ and $Y$ The shape of is flexible , They can be used to represent different modes （ For example, text and images ） Characteristics of .GA Unit pairs come from $X$ and $Y$ Each paired sample of $x_i,y_j>$ The pairwise relationship between them is modeled separately . The output of the multi head attention layer is ：
$f = M H A (X, Y, Y)$
It can be understood as through calculation $Y$ All samples in are related to $x_i$ Normalized cross modal similarity to reconstruct $x_i$ .GA The output of the unit is :
$LayerNorm(X+MHA(X,Y,Y))\\LayerNorm(X+MLP(X))$

Single mode Transformer The coding

The single mode used by the model proposed in this paper Transformer The encoder of is right B individual SA Units are superimposed , Every SA Each unit has its own training parameters , As shown below ：

Insert picture description here

Modular Co-Attention

It was said that , In this article, in addition to using Transformer framework , The proposed approach also relies on a modular common concern (Modular Co-Attention) And a glimpse layer To jointly code one or more modes . Let's see below. Modular Co-Attention What is it? .

Based on the two basic attention units introduced above SA and GA, We can synthesize it , Get three modular common attention . The following describes how these three modules are composed across the common attention , Here is a diagram of their structure .

Insert picture description here

ID(Y)-GA(X,Y) in , Modality Y The feature of is directly transferred to the output feature through identity mapping , Modality Y And mode X Input in GA(X,Y) The interaction between modes in the element , Mode Y Guide mode X Do feature learning .

SA(Y)-GA(X,Y) in , Modality Y First pass through SA(Y) The unit carries out modal interaction and transfers it to the output feature , And output and mode X Input in GA(X,Y) Calculate the cross modal in the element. Note .

SA(Y)-SGA(X,Y) Continue to SA(Y)-GA(X,Y) On the basis of SA(X) Element to calculate the mode X Modal internal attention . Then with the mode Y Intra modal attentional output of calculates cross modal attention .

Cross modal Transformer code

Let's take a look at the cross modal proposed in this paper Transformer code , As shown in the figure below , take B The modular common concern layer proposed above (Modular Co-Attention layer) And the upcoming glimpse layers . This coding can be ported to any number of modes by copying the architecture . In the example of this article , Always adjust the language mode to other modes .

What we should pay attention to here is that the output of each modular common concern layer will be used as glimpse Layer of the input , then glimpse The output of the layer is used as the input of the modular common concern layer , repeated B Time . instead of B The calculation results of a modular common concern layer are being transmitted to B individual glimpse layer .

Insert picture description here

Modular layer of common concern

Just introduced 3 It seems that we can't find the modular common concern layer in the above figure among the modular common concern layers , It may be that the author's drawing is not standard , Then look at the code . As you can see from the code , Modality x Enter the first SA unit , Then with the mode y Together as GA Unit input . That's what we introduced above SA(Y)-GA(X,Y) wow . But a closer look at the code is still a little different , stay SGA Class , Modality x First, I calculated the attention of the Bulls without passing MLP, As input and mode y Calculate multimodal attention . But that's about it ...

class SGA(nn.Module):
    def __init__(self, args):
        super(SGA, self).__init__()

        self.mhatt1 = MHAtt(args)
        self.mhatt2 = MHAtt(args)
        self.ffn = FFN(args)

        self.dropout1 = nn.Dropout(args.dropout_r)
        self.norm1 = LayerNorm(args.hidden_size)

        self.dropout2 = nn.Dropout(args.dropout_r)
        self.norm2 = LayerNorm(args.hidden_size)

        self.dropout3 = nn.Dropout(args.dropout_r)
        self.norm3 = LayerNorm(args.hidden_size)

    def forward(self, x, y, x_mask, y_mask):
        x = self.norm1(x + self.dropout1(
            self.mhatt1(v=x, k=x, q=x, mask=x_mask)
        ))

        x = self.norm2(x + self.dropout2(
            self.mhatt2(v=y, k=y, q=x, mask=y_mask)
        ))

        x = self.norm3(x + self.dropout3(
            self.ffn(x)
        ))

        return x
class SA(nn.Module):
    def __init__(self, args):
        super(SA, self).__init__()

        self.mhatt = MHAtt(args)
        self.ffn = FFN(args)

        self.dropout1 = nn.Dropout(args.dropout_r)
        self.norm1 = LayerNorm(args.hidden_size)

        self.dropout2 = nn.Dropout(args.dropout_r)
        self.norm2 = LayerNorm(args.hidden_size)

    def forward(self, y, y_mask):
        y = self.norm1(y + self.dropout1(
            self.mhatt(y, y, y, y_mask)
        ))

        y = self.norm2(y + self.dropout2(
            self.ffn(y)
        ))

        return y

Glimpse layer

After going through the modular common concern layer , Here we go Glimpse layer , Let's see below. Glimpse How layers work . Here the mode is projected into a new representation space . One Glimpse Layers are stacked G A soft note (soft attention) Layer stack composition . every time soft attention Are regarded as a glimpse. In form , We passed a MLP And a weighted sum to define the input matrix $M\in R^{N \times k}$ （ For the output of the previous layer , That is, the output of the modular common concern layer ） Soft attention (SoA) $i$ .
$a_i=softmax(v_i^{a^T}(W_mM)) \tag{1}$

$SoA_i(M)=m_i=\sum_{j=0}^Na_{ij}M_j \tag{2}$

Put the matrix M Of Glimpse The mechanism is defined as $G_M$ individual glimpse Stack of ：
$G_M=Stacking(m_1,\cdots,m_{G^m})\tag{3}$
In this model , We always choose $G^m=N$ , So the size allows us to make the final residual connection , obtain Glimpse Layer output .
$M=LayerNorm(M+G_M)$
I don't quite understand ？？？, Let's analyze Glimpse Layer code wow ...

class AttFlat(nn.Module):
    def __init__(self, args, flat_glimpse, merge=False):
        super(AttFlat, self).__init__()
        self.args = args
        self.merge = merge
        self.flat_glimpse = flat_glimpse
        self.mlp = MLP(
            in_size=args.hidden_size,
            mid_size=args.ff_size,
            out_size=flat_glimpse,
            dropout_r=args.dropout_r,
            use_relu=True
        )

        if self.merge:
            self.linear_merge = nn.Linear(
                args.hidden_size * flat_glimpse,
                args.hidden_size * 2
            )

    def forward(self, x, x_mask):
        att = self.mlp(x)
        #print(att.shape)
        if x_mask is not None:
            att = att.masked_fill(
                x_mask.squeeze(1).squeeze(1).unsqueeze(2),
                -1e9
            )
        att = F.softmax(att, dim=1)

        att_list = []
        for i in range(self.flat_glimpse):
            att_list.append(
                torch.sum(att[:, :, i: i + 1] * x, dim=1)
            )

        if self.merge:
            x_atted = torch.cat(att_list, dim=1)
            x_atted = self.linear_merge(x_atted)

            return x_atted

        return torch.stack(att_list).transpose_(0, 1)

In the subsequent calls to this class, you can find , In cross modal transformer Coded B-1 In blocks , Parameters flat_glimpse Is the sequence length of the incoming mode , In the first B One block is the last block , Parameters flat_glimpse by 1. This is done to make the dimension of the final output smaller , It is convenient to input the linear layer for emotional prediction .

Let's look at the front B-1 In blocks Glimpse How layers work . Take text mode as an example , The size entered is (32,60,512), among 32 Is the batch size ,60 Is the number of words ,512 Is the length of the word vector .

Get into MLP After layer treatment , The size changes to (32,60,60).
And then pass by softmax Handle , Size or size (32,60,60), Just in dimension 1 The sum of each data on is 0. The first two steps are formula 1
att[:, :, i: i + 1] * x Is every value in the third dimension, that is, the word vector dimension (60 individual ) Multiply with the original input data , Dimension for (32,60,1)*(32,60,512)=(30,60,512), And then in the second 2 Sum on three dimensions , The output data size is (32,512), It's a total cycle 60 Time , The results of each time are stored in the array att_list in . This step is formula 2
Finally, array att_list Add them together , The final size is (32,60,512), This step is formula 3.

In the B The size of the output in the block is (32,1024).

The following figure shows the calculation process .

Insert picture description here

So what is the function of this layer ？？？ I feel like I'm calculating my self attention , So called soft attention, Not used self-attention It should be a large amount of calculation , There are also parameters to learn .

Classification layer

The following is emotional analysis , That is, the classification layer . As I said before, cross modal transformer The coding is most enough for one block Glimpse The output dimension of the layer is 1, The text is (32,1024). therefore , The result is only one vector . The vectors of each mode are summed element by element , We call this sum result s, Then project on the possible answer according to the following formula .
$y~p=W_a(LayerNorm(s))$
If there is only one mode , Then the summation operation is omitted .

Take a look at the code ：

# Classification layers
self.proj_norm = LayerNorm(2 * args.hidden_size)
self.proj = self.proj = nn.Linear(2 * args.hidden_size, args.ans_size)

# Classification layers
proj_feat = x + y
proj_feat = self.proj_norm(proj_feat)
ans = self.proj(proj_feat)