当前位置:网站首页>Implementation and analysis of transformer model
Implementation and analysis of transformer model
2022-06-21 05:45:00 【huserblog】
Transformer Model implementation and analysis
brief introduction
Transformer It is a model that uses the attention mechanism to improve the training speed of the model , Its core idea is self attention mechanism (self-attention)—— The ability to pay attention to different positions of the input sequence to calculate the representation of the sequence .
Reference link :https://www.tensorflow.org/tutorials/text/transformer?hl=zh-cn
Model structure
The structure of the model is as follows :

This is a typical seq2seq Model , The structure is divided into encoder and decoder Two parts . On the left for encoder, It is mainly composed of a multi head attention layer and a feedforward network ; On the right is decoder. It is mainly composed of two multi head attention layers and a feedforward network .transformer Compared to other seq2seq The biggest feature of the model is that it uses multiple heads of attention to replace rnn.
The detailed structure of long attention is as follows :

On the right side of the picture is the multi head attention structure , Here will first be q、k、v Three inputs are connected to a linear layer ( Fully connected layer ) Then pass in a zoom point to accumulate the attention layer , The structure of this layer is shown on the left , Finally, connect the output of attention to the linear layer and get the result of attention .
The zoom point on the left of the figure is also very simple , First of all, I will q and k Multiply , Then zoom it , And then we do it mask Handle , Finally, it is softmax, So you get the attention weight , Finally, the weight and v Multiply .
Code implementation
The code is divided into 4 part : data 、 Model 、 Training 、 forecast .
Data processing
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import unicodedata
import re
import os
import io
import time
import time
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import time
import numpy as np
import matplotlib.pyplot as plt
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']
tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples), target_vocab_size=2 ** 13)
tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(pt.numpy() for pt, en in train_examples), target_vocab_size=2 ** 13)
# print("==============")
# sample_string = 'Transformer is awesome.'
#
# tokenized_string = tokenizer_en.encode(sample_string)
# print('Tokenized string is {}'.format(tokenized_string))
#
# original_string = tokenizer_en.decode(tokenized_string)
# print('The original string: {}'.format(original_string))
#
# assert original_string == sample_string
BUFFER_SIZE = 200
BATCH_SIZE = 4
MAX_LENGTH = 40
def encode(lang1, lang2):
lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
lang1.numpy()) + [tokenizer_pt.vocab_size + 1]
lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
lang2.numpy()) + [tokenizer_en.vocab_size + 1]
return lang1, lang2
def filter_max_length(x, y, max_length=MAX_LENGTH):
return tf.logical_and(tf.size(x) <= max_length,
tf.size(y) <= max_length)
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# Cache datasets in memory to speed up reading .
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)
val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)
# pt_batch, en_batch = next(iter(val_dataset))
# print(pt_batch, en_batch)
def get_angles(pos, i, d_model):
angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
return pos * angle_rates
def positional_encoding(position, d_model):
angle_rads = get_angles(np.arange(position)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :],
d_model)
# take sin Apply to even index in array (indices);2i
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
# take cos Apply to odd index in array ;2i+1
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
pos_encoding = angle_rads[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
# pos_encoding = positional_encoding(50, 512)
# print(pos_encoding.shape)
#
# plt.pcolormesh(pos_encoding[0], cmap='RdBu')
# plt.xlabel('Depth')
# plt.xlim((0, 512))
# plt.ylabel('Position')
# plt.colorbar()
# plt.show()
def create_padding_mask(seq):
seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
# Add additional dimensions to add padding to
# Attention logarithm (logits).
return seq[:, tf.newaxis, tf.newaxis, :] # (batch_size, 1, 1, seq_len)
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
The data processing here is the usual translation data processing method , adopt tensorflow_datasets The interface can easily download data and process data .
In addition to the usual Vocabulary Processing , There is also a location code and mask operation . The location code is due to transformer Instead of attention rnn, And attention can't be like rnn Get location information as well , So we need to extract the location information separately , Then add it to the data . Another difference is mask, Because the attention mechanism can read all the data , but rnn The first one is read in , At this time, he has no information about the next position .transformer Use in mask To solve this problem , In the first position, it will cover the back position to achieve and rnn Similar effect .
Model structure
def scaled_dot_product_attention(q, k, v, mask):
""" Calculate the weight of attention . q, k, v Must have matching pre dimension . k, v There must be a matching penultimate dimension , for example :seq_len_k = seq_len_v. although mask According to its type ( Fill or look ahead ) There are different shapes , however mask Must be able to perform broadcast conversion in order to sum . Parameters : q: Requested shape == (..., seq_len_q, depth) k: The shape of the primary key == (..., seq_len_k, depth) v: The shape of the number == (..., seq_len_v, depth_v) mask: Float tensor , Its shape can be converted into (..., seq_len_q, seq_len_k). The default is None. Return value : Output , Attention weight """
matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)
# The zoom matmul_qk
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# take mask Add to the scaled tensor .
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# softmax On the last axis (seq_len_k) Go ahead , So the score
# The sum is equal to 1.
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
return output, attention_weights
#
# def print_out(q, k, v):
# temp_out, temp_attn = scaled_dot_product_attention(
# q, k, v, None)
# print('Attention weights are:')
# print(temp_attn)
# print('Output is:')
# print(temp_out)
#
#
# np.set_printoptions(suppress=True)
#
# temp_k = tf.constant([[10, 0, 0],
# [0, 10, 0],
# [0, 0, 10],
# [0, 0, 10]], dtype=tf.float32) # (4, 3)
#
# temp_v = tf.constant([[1, 0],
# [10, 0],
# [100, 5],
# [1000, 6]], dtype=tf.float32) # (4, 2)
#
# # This article ` request (query) In line with the second ` Primary key (key)`,
# # So the second... Is returned ` The number (value)`.
# temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
#
# # This request matches the repeated primary key ( Third, Fourth ),
# # therefore , All relevant values are averaged .
# temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
# # This request matches the first and second primary keys ,
# # therefore , Their values are averaged .
# temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32) # (1, 3)
# print_out(temp_q, temp_k, temp_v)
#
#
# temp_q = tf.constant([[0, 0, 10], [0, 10, 0], [10, 10, 0]], dtype=tf.float32) # (3, 3)
# print_out(temp_q, temp_k, temp_v)
class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % self.num_heads == 0
self.depth = d_model // self.num_heads
self.wq = tf.keras.layers.Dense(d_model)
self.wk = tf.keras.layers.Dense(d_model)
self.wv = tf.keras.layers.Dense(d_model)
self.dense = tf.keras.layers.Dense(d_model)
def split_heads(self, x, batch_size):
""" Split the last dimension to (num_heads, depth). Transpose the result so that the shape is (batch_size, num_heads, seq_len, depth) """
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask):
batch_size = tf.shape(q)[0]
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k) # (batch_size, seq_len, d_model)
v = self.wv(v) # (batch_size, seq_len, d_model)
q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)
v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)
# scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
# attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
# print(q.shape, k.shape, v.shape, mask.shape)
scaled_attention, attention_weights = scaled_dot_product_attention(
q, k, v, mask)
# print(scaled_attention.shape)
scaled_attention = tf.transpose(scaled_attention,
perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth)
concat_attention = tf.reshape(scaled_attention,
(batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model)
output = self.dense(concat_attention) # (batch_size, seq_len_q, d_model)
return output, attention_weights
# temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
# y = tf.random.uniform((1, 60, 512)) # (batch_size, encoder_sequence, d_model)
# out, attn = temp_mha(y, k=y, q=y, mask=None)
# print(out.shape, attn.shape)
def point_wise_feed_forward_network(d_model, dff):
return tf.keras.Sequential([
tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)
tf.keras.layers.Dense(d_model) # (batch_size, seq_len, d_model)
])
class EncoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(EncoderLayer, self).__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask):
attn_output, _ = self.mha(x, x, x, mask) # (batch_size, input_seq_len, d_model)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output) # (batch_size, input_seq_len, d_model)
ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)
return out2
class DecoderLayer(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, rate=0.1):
super(DecoderLayer, self).__init__()
self.mha1 = MultiHeadAttention(d_model, num_heads)
self.mha2 = MultiHeadAttention(d_model, num_heads)
self.ffn = point_wise_feed_forward_network(d_model, dff)
self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = tf.keras.layers.Dropout(rate)
self.dropout2 = tf.keras.layers.Dropout(rate)
self.dropout3 = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training,
look_ahead_mask, padding_mask):
# enc_output.shape == (batch_size, input_seq_len, d_model)
attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask) # (batch_size, target_seq_len, d_model)
attn1 = self.dropout1(attn1, training=training)
out1 = self.layernorm1(attn1 + x)
attn2, attn_weights_block2 = self.mha2(
enc_output, enc_output, out1, padding_mask) # (batch_size, target_seq_len, d_model)
attn2 = self.dropout2(attn2, training=training)
out2 = self.layernorm2(attn2 + out1) # (batch_size, target_seq_len, d_model)
ffn_output = self.ffn(out2) # (batch_size, target_seq_len, d_model)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layernorm3(ffn_output + out2) # (batch_size, target_seq_len, d_model)
return out3, attn_weights_block1, attn_weights_block2
#
# sample_encoder_layer = EncoderLayer(512, 8, 2048)
#
# sample_encoder_layer_output = sample_encoder_layer(
# tf.random.uniform((64, 43, 512)), False, None)
#
# print(sample_encoder_layer_output.shape) # (batch_size, input_seq_len, d_model)
#
#
# sample_decoder_layer = DecoderLayer(512, 8, 2048)
#
# sample_decoder_layer_output, _, _ = sample_decoder_layer(
# tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
# False, None, None)
#
# print(sample_decoder_layer_output.shape) # (batch_size, target_seq_len, d_model)
class Encoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
maximum_position_encoding, rate=0.1):
super(Encoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding,
self.d_model)
self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(rate)
def call(self, x, training, mask):
seq_len = tf.shape(x)[1]
# Add the embedding and position coding .
x = self.embedding(x) # (batch_size, input_seq_len, d_model)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x = self.enc_layers[i](x, training, mask)
return x # (batch_size, input_seq_len, d_model)
class Decoder(tf.keras.layers.Layer):
def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
maximum_position_encoding, rate=0.1):
super(Decoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers
self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
for _ in range(num_layers)]
self.dropout = tf.keras.layers.Dropout(rate)
def call(self, x, enc_output, training,
look_ahead_mask, padding_mask):
seq_len = tf.shape(x)[1]
attention_weights = {
}
x = self.embedding(x) # (batch_size, target_seq_len, d_model)
x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
x, block1, block2 = self.dec_layers[i](x, enc_output, training,
look_ahead_mask, padding_mask)
attention_weights['decoder_layer{}_block1'.format(i + 1)] = block1
attention_weights['decoder_layer{}_block2'.format(i + 1)] = block2
# x.shape == (batch_size, target_seq_len, d_model)
return x, attention_weights
# sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8,
# dff=2048, input_vocab_size=8500,
# maximum_position_encoding=10000)
#
# sample_encoder_output = sample_encoder(tf.random.uniform((64, 62)),
# training=False, mask=None)
#
# print(sample_encoder_output.shape) # (batch_size, input_seq_len, d_model)
#
# sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8,
# dff=2048, target_vocab_size=8000,
# maximum_position_encoding=5000)
#
# output, attn = sample_decoder(tf.random.uniform((64, 26)),
# enc_output=sample_encoder_output,
# training=False, look_ahead_mask=None,
# padding_mask=None)
#
# print(output.shape, attn['decoder_layer2_block2'].shape)
class Transformer(tf.keras.Model):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
target_vocab_size, pe_input, pe_target, rate=0.1):
super(Transformer, self).__init__()
self.encoder = Encoder(num_layers, d_model, num_heads, dff,
input_vocab_size, pe_input, rate)
self.decoder = Decoder(num_layers, d_model, num_heads, dff,
target_vocab_size, pe_target, rate)
self.final_layer = tf.keras.layers.Dense(target_vocab_size)
def call(self, inp, tar, training, enc_padding_mask,
look_ahead_mask, dec_padding_mask):
enc_output = self.encoder(inp, training, enc_padding_mask) # (batch_size, inp_seq_len, d_model)
# dec_output.shape == (batch_size, tar_seq_len, d_model)
dec_output, attention_weights = self.decoder(
tar, enc_output, training, look_ahead_mask, dec_padding_mask)
final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)
return final_output, attention_weight
Here, first of all scaled_dot_product_attention Method , This method is similar to the drawing above , Will be will be q and k Multiply , Then divide by a scaling value , This value is the square root of its last dimension . And then there was mask To deal with , Creating mask The value position will be set to 0, The filled or occluded position is set to 1. Processing mask Will be mask Multiply by a very large negative number (-1 multiply 10 Of 9 Power ), Then add it to the scaled dot product , such mask The position of the number becomes a large negative number , So in the following softmax In the operation, its value will become 0, namely mask The weight of the position of is 0. Finally, the weight and v Multiply to get the output .
And then there was MultiHeadAttention class , This is the multiple attention level . As shown in the figure above , He'll put q、k、v Pass in a linear layer respectively ( Fully connected layer ), Then a new dimension is split according to the number of incoming multiple headers , However, the split data is used to calculate their attention . After calculating the attention, return to the original dimension , Finally, through a linear layer ( Fully connected layer ) Get the output .
And then there was point_wise_feed_forward_network Method , This method is very simple: two full connection layers .
And then there was EncoderLayer class , This corresponds to the contents in the left border of the first figure , It is mainly a multi head attention layer and a feedforward network . stay call Method will first pass in x As a bull's attention q、k、v, To get attention output . And then a dropout and layerNorm layer , What we need to pay attention to here is that we are doing layerNorm The former actually uses the idea of residual error , Will input x And processed attn_output Add up , Do more on the results layerNorm operation . Finally, a feedforward network and dropout+layerNorm.
Next is DecoderLayer class , It corresponds to the content in the right border of the first picture , Here are two multiple attention layers and a feedforward network . stay call In the method, the first multi head attention layer , Here is what will be entered x As q、k、v, Then the same as the coding layer dropout+layerNorm. Then there is the second long attention layer , What needs to be noted here is that he entered q And k、v It's not the same anymore ,q It is the multi head output of the upper layer , and k、v It's the output of the encoder . Finally, a feedforward network is connected .
And then there was Encoder and Decoder class , The two classes are very similar , First, the input words are encoded into word vectors , Then add its location code , As input , Loop the encoder layer or decoder layer for a specified number of times .
final Transformer Class is also very simple , Execute encoder first , Then execute the decoder , Finally, the result of the decoder is transmitted to a full connection layer , Get the final output .
model training
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = tokenizer_pt.vocab_size + 2
target_vocab_size = tokenizer_en.vocab_size + 2
dropout_rate = 0.1
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model, warmup_steps=4000):
super(CustomSchedule, self).__init__()
self.d_model = d_model
self.d_model = tf.cast(self.d_model, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step):
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** -1.5)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
epsilon=1e-9)
# temp_learning_rate_schedule = CustomSchedule(d_model)
#
# plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
# plt.ylabel("Learning Rate")
# plt.xlabel("Train Step")
# plt.show()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='train_accuracy')
transformer = Transformer(num_layers, d_model, num_heads, dff,
input_vocab_size, target_vocab_size,
pe_input=input_vocab_size,
pe_target=target_vocab_size,
rate=dropout_rate)
def create_masks(inp, tar):
# Encoder fill occlusion
enc_padding_mask = create_padding_mask(inp)
# In the second attention module of the decoder .
# This padding mask is used to mask the output of the encoder .
dec_padding_mask = create_padding_mask(inp)
# In the first attention module of the decoder .
# For filling (pad) And shelter (mask) The subsequent marks of the input obtained by the decoder (future tokens).
look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
dec_target_padding_mask = create_padding_mask(tar)
# print("look_ahead_mask:", look_ahead_mask.shape)
# print("dec_target_padding_mask:", dec_target_padding_mask.shape)
combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
return enc_padding_mask, combined_mask, dec_padding_mask
checkpoint_path = "../savemodel/transformer"
ckpt = tf.train.Checkpoint(transformer=transformer,
optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
# If a checkpoint exists , Restore the latest checkpoint .
if ckpt_manager.latest_checkpoint:
ckpt.restore(ckpt_manager.latest_checkpoint)
print('Latest checkpoint restored!!')
EPOCHS = 10
# The @tf.function Will track - compile train_step To TF In the figure , In order to
# perform . This function is dedicated to the exact shape of the parametric tensor . In order to avoid variable sequence length or variable
# Batch size ( The last batch is smaller ) Resulting in retraining , Use input_signature Appoint
# More generic shapes .
train_step_signature = [
tf.TensorSpec(shape=(None, None), dtype=tf.int64),
tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]
# @tf.function
@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
tar_inp = tar[:, :-1]
tar_real = tar[:, 1:]
enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)
# print("enc_padding_mask:", enc_padding_mask.shape)
# print("combined_mask:", combined_mask.shape)
# print("dec_padding_mask:", dec_padding_mask.shape)
with tf.GradientTape() as tape:
predictions, _ = transformer(inp, tar_inp,
True,
enc_padding_mask,
combined_mask,
dec_padding_mask)
loss = loss_function(tar_real, predictions)
gradients = tape.gradient(loss, transformer.trainable_variables)
optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
train_loss(loss)
train_accuracy(tar_real, predictions)
for epoch in range(EPOCHS):
start = time.time()
train_loss.reset_states()
train_accuracy.reset_states()
# inp -> portuguese, tar -> english
for (batch, (inp, tar)) in enumerate(train_dataset):
train_step(inp, tar)
if batch % 50 == 0:
print('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
epoch + 1, batch, train_loss.result(), train_accuracy.result()))
if (epoch + 1) % 5 == 0:
ckpt_save_path = ckpt_manager.save()
print('Saving checkpoint for epoch {} at {}'.format(epoch + 1,
ckpt_save_path))
print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
train_loss.result(),
train_accuracy.result()))
print('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))
Model training and general seq2seq The model is the same . among CustomSchedule Class is a class used to dynamically adjust the learning rate according to the training batch , Then there's the loss function , It is similar to the general seq2seq The loss function is the same . And then there's the creation of mask Of create_masks Method , Called here create_padding_mask Method has been previously parsed , He will add the filled data in the input as mask. Here's the inp Will create two mask, One used in encoder , One used in the decoder . Then there is the output tar Will first do a step-by-step forward mask And a filled mask, And then merge it .
The last thing to say is train_step Method . It's also very simple here , The first is to create... Based on input and output mask, Then pass in the input and output to transformer in , Get predictions , Finally, calculate the loss update parameters .
Model to predict
def evaluate(inp_sentence):
start_token = [tokenizer_pt.vocab_size]
end_token = [tokenizer_pt.vocab_size + 1]
# The input statement is Portuguese , Add start and end marks
inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
encoder_input = tf.expand_dims(inp_sentence, 0)
# Because the goal is English , Input transformer The first word of should be
# The opening mark of English .
decoder_input = [tokenizer_en.vocab_size]
output = tf.expand_dims(decoder_input, 0)
for i in range(MAX_LENGTH):
enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
encoder_input, output)
# predictions.shape == (batch_size, seq_len, vocab_size)
predictions, attention_weights = transformer(encoder_input,
output,
False,
enc_padding_mask,
combined_mask,
dec_padding_mask)
# from seq_len Dimension select the last word
predictions = predictions[:, -1:, :] # (batch_size, 1, vocab_size)
predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)
# If predicted_id Equals the end tag , Just return the result
if predicted_id == tokenizer_en.vocab_size + 1:
return tf.squeeze(output, axis=0), attention_weights
# Connect predicted_id And output , As the input of the decoder, it is passed to the decoder .
output = tf.concat([output, predicted_id], axis=-1)
return tf.squeeze(output, axis=0), attention_weights
def plot_attention_weights(attention, sentence, result, layer):
fig = plt.figure(figsize=(16, 8))
sentence = tokenizer_pt.encode(sentence)
attention = tf.squeeze(attention[layer], axis=0)
for head in range(attention.shape[0]):
ax = fig.add_subplot(2, 4, head + 1)
# Draw the attention weight
ax.matshow(attention[head][:-1, :], cmap='viridis')
fontdict = {
'fontsize': 10}
ax.set_xticks(range(len(sentence) + 2))
ax.set_yticks(range(len(result)))
ax.set_ylim(len(result) - 1.5, -0.5)
ax.set_xticklabels(
['<start>'] + [tokenizer_pt.decode([i]) for i in sentence] + ['<end>'],
fontdict=fontdict, rotation=90)
ax.set_yticklabels([tokenizer_en.decode([i]) for i in result
if i < tokenizer_en.vocab_size],
fontdict=fontdict)
ax.set_xlabel('Head {}'.format(head + 1))
plt.tight_layout()
plt.show()
def translate(sentence, plot=''):
result, attention_weights = evaluate(sentence)
predicted_sentence = tokenizer_en.decode([i for i in result
if i < tokenizer_en.vocab_size])
print('Input: {}'.format(sentence))
print('Predicted translation: {}'.format(predicted_sentence))
if plot:
plot_attention_weights(attention_weights, sentence, result, plot)
translate("este é um problema que temos que resolver.")
print ("Real translation: this is a problem we have to solve .")
translate("os meus vizinhos ouviram sobre esta ideia.")
print ("Real translation: and my neighboring homes heard about this idea .")
This is mainly to call the trained model , Translation text . The entry method is translate.
边栏推荐
- 蓝桥杯十三届国赛个人模板------又来一年
- armcm3权威指南笔记
- renren-fast启动提示 process.env.NODE_ENV
- Emotron Elton soft starter maintenance msf370/msf450
- armcm3权威指南笔记----arm编程中地址未对齐方面的影响
- Arm authoritative guide and our group's project notes
- [UVM basics] seed value setting in makefile
- Django database and module models (4)
- 还不了解最新版Kubernetes?一节公开课解决你所有疑问
- At the codeless Explorer conference, Qingliu invites you to discuss the way of digital transformation practice
猜你喜欢

Emotron伊尔通软启动器维修MSF370/MSF450
![[introduction to practice] CRM project practice tutorial -- CRM project based on SSM framework](/img/97/fa932f2686142dc40e2fe1e745932c.jpg)
[introduction to practice] CRM project practice tutorial -- CRM project based on SSM framework

OracleLinux6.5图形化安装Oracle11g

Redis cache penetration, cache breakdown, cache avalanche

mac os MAMP 安装redis 报错问题 ./common.h:12:10: fatal error: ‘zend_smart_str.h‘ file not found

Two ways to get rid of the setback of the new retail brand "three squirrels"

Latex adds a strikeout (horizontal line) to the entire row of the table

常见的存储类型

Ue4/5 impactor on begin overlap and on end overlap trigger simultaneously for resolution

Privacy sandbox helps enterprises: how privacy technology protects user data and promotes business growth
随机推荐
SSH copy ID batch password free script
[open source tutorial] DIY tutorial of 2020 new version of brushless power regulation [Author: I love loli love loli] (brief introduction and circuit construction)
Oracle笔记 之 表空间使用情况查询
Excel dynamic chart
MySQL query statements, replacing results
AI OPEN DAY---如何通过采用开源技术来优化产品和业务收益
Global and Chinese market of blood filter 2022-2028: Research Report on technology, participants, trends, market size and share
Getting started with BCC tools
VMware新建OracleLinux6.5虚拟机
ssh-copy-id 批量免密脚本
Mac MAMP Pro installation PHP extension method
The Mac OS MAMP installs redis with an error/ common. h:12:10: fatal error: ‘zend_ smart_ str.h‘ file not found
C#常用Chart组件
Research and Analysis on the current situation of China's wireless charger Market and forecast report on its development prospect
Music genre classification based on CNN
Randomly create circular, triangular or rectangular objects, store them in the array, and calculate the area and perimeter of each shape
Oracle笔记 之 merge语法
Jdbcdynamictablesource supports predicate push down?
利用多核多线程进行程序优化
Usage of async/await in JS