当前位置：网站首页>Implementation and analysis of transformer model

Implementation and analysis of transformer model

2022-06-21 05:45:00 【huserblog】

Insert picture description here

Transformer Model implementation and analysis

brief introduction

Transformer It is a model that uses the attention mechanism to improve the training speed of the model , Its core idea is self attention mechanism （self-attention）—— The ability to pay attention to different positions of the input sequence to calculate the representation of the sequence .

Reference link ：https://www.tensorflow.org/tutorials/text/transformer?hl=zh-cn

Model structure

The structure of the model is as follows ：

Model structure

This is a typical seq2seq Model , The structure is divided into encoder and decoder Two parts . On the left for encoder, It is mainly composed of a multi head attention layer and a feedforward network ; On the right is decoder. It is mainly composed of two multi head attention layers and a feedforward network .transformer Compared to other seq2seq The biggest feature of the model is that it uses multiple heads of attention to replace rnn.

The detailed structure of long attention is as follows ：

Attention structure diagram

On the right side of the picture is the multi head attention structure , Here will first be q、k、v Three inputs are connected to a linear layer （ Fully connected layer ） Then pass in a zoom point to accumulate the attention layer , The structure of this layer is shown on the left , Finally, connect the output of attention to the linear layer and get the result of attention .

The zoom point on the left of the figure is also very simple , First of all, I will q and k Multiply , Then zoom it , And then we do it mask Handle , Finally, it is softmax, So you get the attention weight , Finally, the weight and v Multiply .

Code implementation

The code is divided into 4 part ： data 、 Model 、 Training 、 forecast .

Data processing

import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import os
import io
import time

import time
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt


examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2 ** 13)

tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2 ** 13)

# print("==============")
# sample_string = 'Transformer is awesome.'
#
# tokenized_string = tokenizer_en.encode(sample_string)
# print('Tokenized string is {}'.format(tokenized_string))
#
# original_string = tokenizer_en.decode(tokenized_string)
# print('The original string: {}'.format(original_string))
#
# assert original_string == sample_string

BUFFER_SIZE = 200
BATCH_SIZE = 4
MAX_LENGTH = 40


def encode(lang1, lang2):
    lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
        lang1.numpy()) + [tokenizer_pt.vocab_size + 1]

    lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
        lang2.numpy()) + [tokenizer_en.vocab_size + 1]

    return lang1, lang2


def filter_max_length(x, y, max_length=MAX_LENGTH):
    return tf.logical_and(tf.size(x) <= max_length,
                          tf.size(y) <= max_length)


def tf_encode(pt, en):
    result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
    result_pt.set_shape([None])
    result_en.set_shape([None])

    return result_pt, result_en


train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
#  Cache datasets in memory to speed up reading .
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)

val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)


# pt_batch, en_batch = next(iter(val_dataset))
# print(pt_batch, en_batch)


def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates


def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)

    #  take  sin  Apply to even index in array （indices）;2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    #  take  cos  Apply to odd index in array ;2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)


# pos_encoding = positional_encoding(50, 512)
# print(pos_encoding.shape)
#
# plt.pcolormesh(pos_encoding[0], cmap='RdBu')
# plt.xlabel('Depth')
# plt.xlim((0, 512))
# plt.ylabel('Position')
# plt.colorbar()
# plt.show()

def create_padding_mask(seq):
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

    #  Add additional dimensions to add padding to 
    #  Attention logarithm （logits）.
    return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)


def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

The data processing here is the usual translation data processing method , adopt tensorflow_datasets The interface can easily download data and process data .

In addition to the usual Vocabulary Processing , There is also a location code and mask operation . The location code is due to transformer Instead of attention rnn, And attention can't be like rnn Get location information as well , So we need to extract the location information separately , Then add it to the data . Another difference is mask, Because the attention mechanism can read all the data , but rnn The first one is read in , At this time, he has no information about the next position .transformer Use in mask To solve this problem , In the first position, it will cover the back position to achieve and rnn Similar effect .

Model structure

def scaled_dot_product_attention(q, k, v, mask):
    """ Calculate the weight of attention . q, k, v  Must have matching pre dimension . k, v  There must be a matching penultimate dimension , for example ：seq_len_k = seq_len_v.  although  mask  According to its type （ Fill or look ahead ） There are different shapes ,  however  mask  Must be able to perform broadcast conversion in order to sum .  Parameters : q:  Requested shape  == (..., seq_len_q, depth) k:  The shape of the primary key  == (..., seq_len_k, depth) v:  The shape of the number  == (..., seq_len_v, depth_v) mask: Float  tensor , Its shape can be converted into  (..., seq_len_q, seq_len_k). The default is None.  Return value :  Output , Attention weight  """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    #  The zoom  matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    #  take  mask  Add to the scaled tensor .
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # softmax  On the last axis （seq_len_k） Go ahead , So the score 
    #  The sum is equal to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights


#
# def print_out(q, k, v):
# temp_out, temp_attn = scaled_dot_product_attention(
# q, k, v, None)
# print('Attention weights are:')
# print(temp_attn)
# print('Output is:')
# print(temp_out)
#
#
# np.set_printoptions(suppress=True)
#
# temp_k = tf.constant([[10, 0, 0],
# [0, 10, 0],
# [0, 0, 10],
# [0, 0, 10]], dtype=tf.float32) # (4, 3)
#
# temp_v = tf.constant([[1, 0],
# [10, 0],
# [100, 5],
# [1000, 6]], dtype=tf.float32) # (4, 2)
#
# #  This article  ` request （query） In line with the second ` Primary key （key）`,
# #  So the second... Is returned ` The number （value）`.
# temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
#
# #  This request matches the repeated primary key （ Third, Fourth ）,
# #  therefore , All relevant values are averaged .
# temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
# #  This request matches the first and second primary keys ,
# #  therefore , Their values are averaged .
# temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
#
# temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32) # (3, 3)
# print_out(temp_q, temp_k, temp_v)


class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """ Split the last dimension to  (num_heads, depth).  Transpose the result so that the shape is  (batch_size, num_heads, seq_len, depth) """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        # print(q.shape, k.shape, v.shape, mask.shape)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)
        # print(scaled_attention.shape)
        scaled_attention = tf.transpose(scaled_attention,
                                        perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights


# temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
# y = tf.random.uniform((1, 60, 512)) # (batch_size, encoder_sequence, d_model)
# out, attn = temp_mha(y, k=y, q=y, mask=None)
# print(out.shape, attn.shape)


def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])


class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2


class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training,
             look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2


#
# sample_encoder_layer = EncoderLayer(512, 8, 2048)
#
# sample_encoder_layer_output = sample_encoder_layer(
# tf.random.uniform((64, 43, 512)), False, None)
#
# print(sample_encoder_layer_output.shape) # (batch_size, input_seq_len, d_model)
#
#
# sample_decoder_layer = DecoderLayer(512, 8, 2048)
#
# sample_decoder_layer_output, _, _ = sample_decoder_layer(
# tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
# False, None, None)
#
# print(sample_decoder_layer_output.shape) # (batch_size, target_seq_len, d_model)


class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
                 maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding,
                                                self.d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]

        #  Add the embedding and position coding .
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)


class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
                 maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training,
             look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {
    }

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                                   look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i + 1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i + 1)] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights


# sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8,
# dff=2048, input_vocab_size=8500,
# maximum_position_encoding=10000)
#
# sample_encoder_output = sample_encoder(tf.random.uniform((64, 62)),
# training=False, mask=None)
#
# print(sample_encoder_output.shape) # (batch_size, input_seq_len, d_model)
#
# sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8,
# dff=2048, target_vocab_size=8000,
# maximum_position_encoding=5000)
#
# output, attn = sample_decoder(tf.random.uniform((64, 26)),
# enc_output=sample_encoder_output,
# training=False, look_ahead_mask=None,
# padding_mask=None)
#
# print(output.shape, attn['decoder_layer2_block2'].shape)


class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
                 target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                               input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                               target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask,
             look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weight

Here, first of all scaled_dot_product_attention Method , This method is similar to the drawing above , Will be will be q and k Multiply , Then divide by a scaling value , This value is the square root of its last dimension . And then there was mask To deal with , Creating mask The value position will be set to 0, The filled or occluded position is set to 1. Processing mask Will be mask Multiply by a very large negative number （-1 multiply 10 Of 9 Power ）, Then add it to the scaled dot product , such mask The position of the number becomes a large negative number , So in the following softmax In the operation, its value will become 0, namely mask The weight of the position of is 0. Finally, the weight and v Multiply to get the output .

And then there was MultiHeadAttention class , This is the multiple attention level . As shown in the figure above , He'll put q、k、v Pass in a linear layer respectively （ Fully connected layer ）, Then a new dimension is split according to the number of incoming multiple headers , However, the split data is used to calculate their attention . After calculating the attention, return to the original dimension , Finally, through a linear layer （ Fully connected layer ） Get the output .

And then there was point_wise_feed_forward_network Method , This method is very simple: two full connection layers .

And then there was EncoderLayer class , This corresponds to the contents in the left border of the first figure , It is mainly a multi head attention layer and a feedforward network . stay call Method will first pass in x As a bull's attention q、k、v, To get attention output . And then a dropout and layerNorm layer , What we need to pay attention to here is that we are doing layerNorm The former actually uses the idea of residual error , Will input x And processed attn_output Add up , Do more on the results layerNorm operation . Finally, a feedforward network and dropout+layerNorm.

Next is DecoderLayer class , It corresponds to the content in the right border of the first picture , Here are two multiple attention layers and a feedforward network . stay call In the method, the first multi head attention layer , Here is what will be entered x As q、k、v, Then the same as the coding layer dropout+layerNorm. Then there is the second long attention layer , What needs to be noted here is that he entered q And k、v It's not the same anymore ,q It is the multi head output of the upper layer , and k、v It's the output of the encoder . Finally, a feedforward network is connected .

And then there was Encoder and Decoder class , The two classes are very similar , First, the input words are encoded into word vectors , Then add its location code , As input , Loop the encoder layer or decoder layer for a specified number of times .

final Transformer Class is also very simple , Execute encoder first , Then execute the decoder , Finally, the result of the decoder is transmitted to a full connection layer , Get the final output .

model training

num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
dropout_rate = 0.1


class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)


learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

# temp_learning_rate_schedule = CustomSchedule(d_model)
#
# plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
# plt.ylabel("Learning Rate")
# plt.xlabel("Train Step")
# plt.show()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')


def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)


train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
    name='train_accuracy')
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, target_vocab_size,
                          pe_input=input_vocab_size,
                          pe_target=target_vocab_size,
                          rate=dropout_rate)


def create_masks(inp, tar):
    #  Encoder fill occlusion 
    enc_padding_mask = create_padding_mask(inp)

    #  In the second attention module of the decoder .
    #  This padding mask is used to mask the output of the encoder .
    dec_padding_mask = create_padding_mask(inp)

    #  In the first attention module of the decoder .
    #  For filling （pad） And shelter （mask） The subsequent marks of the input obtained by the decoder （future tokens）.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)

    # print("look_ahead_mask:", look_ahead_mask.shape)
    # print("dec_target_padding_mask:", dec_target_padding_mask.shape)
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, combined_mask, dec_padding_mask


checkpoint_path = "../savemodel/transformer"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

#  If a checkpoint exists , Restore the latest checkpoint .
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!!')

EPOCHS = 10


#  The  @tf.function  Will track - compile  train_step  To  TF  In the figure , In order to 
#  perform . This function is dedicated to the exact shape of the parametric tensor . In order to avoid variable sequence length or variable 
#  Batch size （ The last batch is smaller ） Resulting in retraining , Use  input_signature  Appoint 
#  More generic shapes .

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]


# @tf.function
@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

    # print("enc_padding_mask:", enc_padding_mask.shape)
    # print("combined_mask:", combined_mask.shape)
    # print("dec_padding_mask:", dec_padding_mask.shape)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, tar_inp,
                                     True,
                                     enc_padding_mask,
                                     combined_mask,
                                     dec_padding_mask)
        loss = loss_function(tar_real, predictions)

    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    train_loss(loss)
    train_accuracy(tar_real, predictions)


for epoch in range(EPOCHS):
    start = time.time()

    train_loss.reset_states()
    train_accuracy.reset_states()

    # inp -> portuguese, tar -> english
    for (batch, (inp, tar)) in enumerate(train_dataset):
        train_step(inp, tar)

        if batch % 50 == 0:
            print('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
                epoch + 1, batch, train_loss.result(), train_accuracy.result()))

    if (epoch + 1) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print('Saving checkpoint for epoch {} at {}'.format(epoch + 1,
                                                            ckpt_save_path))

    print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                        train_loss.result(),
                                                        train_accuracy.result()))

    print('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

Model training and general seq2seq The model is the same . among CustomSchedule Class is a class used to dynamically adjust the learning rate according to the training batch , Then there's the loss function , It is similar to the general seq2seq The loss function is the same . And then there's the creation of mask Of create_masks Method , Called here create_padding_mask Method has been previously parsed , He will add the filled data in the input as mask. Here's the inp Will create two mask, One used in encoder , One used in the decoder . Then there is the output tar Will first do a step-by-step forward mask And a filled mask, And then merge it .

The last thing to say is train_step Method . It's also very simple here , The first is to create... Based on input and output mask, Then pass in the input and output to transformer in , Get predictions , Finally, calculate the loss update parameters .

Model to predict


def evaluate(inp_sentence):
    start_token = [tokenizer_pt.vocab_size]
    end_token = [tokenizer_pt.vocab_size + 1]

    #  The input statement is Portuguese , Add start and end marks 
    inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
    encoder_input = tf.expand_dims(inp_sentence, 0)

    #  Because the goal is English , Input  transformer  The first word of should be 
    #  The opening mark of English .
    decoder_input = [tokenizer_en.vocab_size]
    output = tf.expand_dims(decoder_input, 0)

    for i in range(MAX_LENGTH):
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
            encoder_input, output)

        # predictions.shape == (batch_size, seq_len, vocab_size)
        predictions, attention_weights = transformer(encoder_input,
                                                     output,
                                                     False,
                                                     enc_padding_mask,
                                                     combined_mask,
                                                     dec_padding_mask)

        #  from  seq_len  Dimension select the last word 
        predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

        #  If  predicted_id  Equals the end tag , Just return the result 
        if predicted_id == tokenizer_en.vocab_size + 1:
            return tf.squeeze(output, axis=0), attention_weights

        #  Connect  predicted_id  And output , As the input of the decoder, it is passed to the decoder .
        output = tf.concat([output, predicted_id], axis=-1)

    return tf.squeeze(output, axis=0), attention_weights


def plot_attention_weights(attention, sentence, result, layer):
    fig = plt.figure(figsize=(16, 8))

    sentence = tokenizer_pt.encode(sentence)

    attention = tf.squeeze(attention[layer], axis=0)

    for head in range(attention.shape[0]):
        ax = fig.add_subplot(2, 4, head + 1)

        #  Draw the attention weight 
        ax.matshow(attention[head][:-1, :], cmap='viridis')

        fontdict = {
    'fontsize': 10}

        ax.set_xticks(range(len(sentence) + 2))
        ax.set_yticks(range(len(result)))

        ax.set_ylim(len(result) - 1.5, -0.5)

        ax.set_xticklabels(
            ['<start>'] + [tokenizer_pt.decode([i]) for i in sentence] + ['<end>'],
            fontdict=fontdict, rotation=90)

        ax.set_yticklabels([tokenizer_en.decode([i]) for i in result
                            if i < tokenizer_en.vocab_size],
                           fontdict=fontdict)

        ax.set_xlabel('Head {}'.format(head + 1))

    plt.tight_layout()
    plt.show()


def translate(sentence, plot=''):
    result, attention_weights = evaluate(sentence)

    predicted_sentence = tokenizer_en.decode([i for i in result
                                              if i < tokenizer_en.vocab_size])

    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(predicted_sentence))

    if plot:
        plot_attention_weights(attention_weights, sentence, result, plot)


translate("este é um problema que temos que resolver.")
print ("Real translation: this is a problem we have to solve .")


translate("os meus vizinhos ouviram sobre esta ideia.")
print ("Real translation: and my neighboring homes heard about this idea .")