当前位置：网站首页>Learn NLP with Transformer (Chapter 5)

Learn NLP with Transformer (Chapter 5)

2022-07-25 11:09:00 【Small black board】

BERT Code

Task05 BERT Code
This study refers to Datawhale Open source learning ：https://github.com/datawhalechina/learn-nlp-with-transformers
The content is generally derived from the original , Adjust your learning ideas .
Personal summary ： One 、HuggingFace It's done Bert Model , This project has also developed into a larger open source community . Two 、Bert Include BertTokenizer、BertModel Two parts . among BertTokenizer It's a word breaker ;BertModel It is the model ontology , Include ：BertEmbeddings、BertEncoder、BertPooler. 3、 ... and 、HuggingFace Realized Bert In the model , Using a variety of technologies to save video memory .

This chapter will not show the specific code , Just introduce the input and output of each parameter and each module . The specific code is based on github in HuggingFace/Transformers To study .HuggingFace Is a New York based chat robot start-up service , Caught... Very early BERT The signal of big trend and start to realize based on pytorch Of BERT Model . The project was originally called pytorch-pretrained-bert, While reproducing the original effect , An easy-to-use method is provided to facilitate all kinds of play and research based on this powerful model .

As the number of users increases , This project has also developed into a larger open source community , Combined various pre training language models and added Tensorflow The implementation of the , And in 2019 In the second half of the year, it was renamed Transformers.

The main content of this chapter

Please add a picture description

It mainly includes ：

BERT Tokenization Participle model （BertTokenizer）
BERT Model Ontology model （BertModel）
- BertEmbeddings
- BertEncoder
  - BertLayer
    - BertAttention
    - BertIntermediate
    - BertOutput
- BertPooler

5. BERT Code

5.1 Tokenization participle -BertTokenizer

and BERT Relevant Tokenizer Mainly written in githubBertTokenizer in .

BertTokenizer Is based on BasicTokenizer and WordPieceTokenizer The participator of ：

BasicTokenizer Responsible for the first step of processing —— By punctuation 、 Space, etc , And whether to unify lowercase , And clean up illegal characters .
- For Chinese characters , By pretreatment （ Add space ） To split words ;
- At the same time through never_split Specifies that some words are not split ;
- This step is optional （ Default execution ）.
WordPieceTokenizer On the basis of words , Further decompose the word into sub words （subword）.
- subword Be situated between char and word Between , Both retain the meaning of the word to a certain extent , And can take care of English singular and plural 、 The explosion of thesaurus and the of unlisted words caused by tense OOV（Out-Of-Vocabulary） problem , Separate the root from the tense affix , This reduces the vocabulary , It also reduces the difficulty of training ;
- for example ,tokenizer The word can be broken down into “token” and “##izer” Two parts , Pay attention to the last word “##” After the previous word .
  BertTokenizer There are the following common methods ：
from_pretrained： From the containing thesaurus file （vocab.txt） Initialize a word breaker in the directory of ;
tokenize： Put text （ Words or sentences ） Break down into a list of subwords ;
convert_tokens_to_ids： Convert the list of subwords into the list of subscripts corresponding to subwords ;
convert_ids_to_tokens ： Contrary to the previous one ;
convert_tokens_to_string： take subword List by “##” Splice back words or sentences ;
encode： For single sentence input , Break down words and add special words to form “[CLS], x, [SEP]” And convert it into a list of subscripts corresponding to the Thesaurus ; For two sentences, enter （ For multiple sentences, only the first two ）, Break down words and add special words to form “[CLS], x1, [SEP], x2, [SEP]” And convert it into a subscript list ;
decode： Can be encode The output of the method becomes a complete sentence .
as well as , Methods of the class itself ：

Examples of participle ：

bt = BertTokenizer.from_pretrained('bert-base-uncased')
bt('I like natural language progressing!')

{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

5.2 Model-BertModel

and BERT The code related to the model is mainly written in githubmodeling_bert in , It contains BERT The basic structure of the model and the fine-tuning model based on it .

BertModel Mainly for transformer encoder structure , It consists of three parts ：

embeddings, namely BertEmbeddings Class , Get the corresponding vector representation according to the word symbol ;
encoder, namely BertEncoder Class ;
pooler, namely BertPooler Class , This part is optional .

BertModel The meaning and return value of each parameter in the forward propagation process of the whole ：

def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ): ...

input_ids： after tokenizer After the participle subword Corresponding subscript list ;
attention_mask： stay self-attention In the process , This piece of mask Used to mark subword The sentence and padding The difference between , take padding Partially filled with 0;
token_type_ids： Mark subword The current sentence （ The first sentence / The second sentence / padding）;
position_ids： Mark the position of the sentence where the current word is located ;
head_mask： Used to invalidate some attention calculations of some layers ;
inputs_embeds： If provided , Then there is no need for input_ids, Across embedding lookup The process acts directly as Embedding Get into Encoder Calculation ;
encoder_hidden_states： This part is in BertModel Configure to decoder It works , Will perform cross-attention instead of self-attention;
encoder_attention_mask： ditto , stay cross-attention Used to mark encoder It's the end input padding;
past_key_values： This parameter seems to be pre calculated K-V The product is passed into , To reduce cross-attention The cost of （ Because originally this part was double counting ）;
use_cache： The last parameter will be saved and returned , Speed up decoding;
output_attentions： Whether to return to... Of each middle layer attention Output ;
output_hidden_states： Whether to return the output of each intermediate layer ;
return_dict： Whether the form of key value pair （ModelOutput class , It can also be used as tuple use ） Return output , Default to true .

Be careful , there head_mask Ineffective calculation of attention , Different from the attention head pruning mentioned below , And just multiply the calculation result of some attention by this coefficient .

The output part is as follows ：

# BertModel Forward propagation return part of 
        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndCrossAttentions(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            past_key_values=encoder_outputs.past_key_values,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

It can be seen that , The return value contains not only encoder and pooler Output , It also contains other parts of the specified output （hidden_states and attention etc. , This part is in encoder_outputs[1:]） Easy to access ：

        # BertEncoder Forward propagation return part of , It's the one above encoder_outputs
        if not return_dict:
            return tuple(
                v
                for v in [
                    hidden_states,
                    next_decoder_cache,
                    all_hidden_states,
                    all_self_attentions,
                    all_cross_attentions,
                ]
                if v is not None
            )
        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attentions,
            cross_attentions=all_cross_attentions,
        )

Besides ,BertModel There are also the following methods , convenient BERT Players perform various operations ：

get_input_embeddings： extract embedding Medium word_embeddings That is, the word vector part ;
set_input_embeddings： by embedding Medium word_embeddings assignment ;
_prune_heads： Provides a function to prune the attention head , Input is {layer_num: list of heads to prune in this layer} Dictionary , You can prune some attention heads of a specified layer .

Pruning is a complex operation , You need to keep your attention on the head Wq、Kq、Vq And after splicing, the weight of the fully connected part is copied to a new smaller weight matrix （ Be careful not to grad Copy again ）, And record the cut head in real time to prevent subscript error . Specific reference BertAttention Part of the prune_heads Method

5.2.1 BertEmbeddings

It consists of three parts, which are summed to get ：
Please add a picture description

word_embeddings, In this paper subword Corresponding embedded .
token_type_embeddings, Used to indicate the sentence in which the current word is located , Assist in distinguishing sentences from padding、 The difference between sentence pairs .
position_embeddings, The position of each word in the sentence is embedded , Used to distinguish the order of words . and transformer The design in the paper is different , This one is trained , Not through Sinusoidal Function to calculate the fixed embedding . It is generally believed that this implementation is not conducive to expansibility （ It is difficult to transfer directly to longer sentences ）.

Three embedding Add without weight , And through a layer LayerNorm+dropout Post output , Its size is (batch_size, sequence_length, hidden_size).

Why do I use LayerNorm+Dropout Well ？ Why use LayerNorm instead of BatchNorm？ You can refer to a good answer ：transformer Why use layer normalization, Instead of other normalization methods ？

5.2.2 BertEncoder

Contains layers BertLayer, There is no special need to explain this piece itself , But there is one detail worth referring to ： utilize gradient checkpointing Technology to reduce the memory occupation during training .

gradient checkpointing Gradient checkpoint , The space occupied by the model is compressed by reducing the saved calculation graph nodes , But when calculating the gradient, you need to recalculate the values that are not stored , Reference paper 《Training Deep Nets with Sublinear Memory Cost》, The process is shown in the following diagram
Please add a picture description

stay BertEncoder in ,gradient checkpoint It's through torch.utils.checkpoint.checkpoint Realized , Easy to use , You can refer to the documentation ：torch.utils.checkpoint - PyTorch 1.8.1 documentation.

5.2.2.1 BertAttention

self Members are the realization of multi head attention , and output Member implementation attention Full connection after +dropout+residual+LayerNorm A series of operations .

class BertAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)
        self.pruned_heads = set()

First, go back to this floor . Here comes the pruning operation mentioned above , namely prune_heads Method ：

The specific implementation here is summarized as follows ：

find_pruneable_heads_and_indices It needs to be cut off head, And the dimension subscripts that need to be retained index;
prune_linear_layer Is responsible for Wk/Wq/Wv Weight matrices （ together with bias） According to the in index Keep the dimension that has not been pruned and transfer to the new matrix .
Next comes the main play ——Self-Attention The concrete realization of .

5.2.2.1.1 BertSelfAttention

This can be said to be the core area of the model , The formula is also the only place involved , So a lot of code will be posted .

Initialization part ：

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

Get rid of the familiar query、key、value Three weights and one dropout, Here's another mystery position_embedding_type, as well as decoder Mark ;
Be careful ,hidden_size and all_head_size It was the same at the beginning . As for why it seems unnecessary to set this variable —— Obviously because of the pruning function above , Cut off a few attention head in the future all_head_size Naturally small ;
hidden_size Must be num_attention_heads Integer multiple , With bert-base For example , Every attention contain 12 individual head,hidden_size yes 768, So each head Size is attention_head_size=768/12=64;
position_embedding_type What is it? ？ Just keep looking down .

Then the point is , That is, the forward propagation process .

First, let's review multi-head self-attention Basic formula of ：

$MHA(Q, K, V) = Concat(head_1, ..., head_h)W^O$
$head_i = SDPA(QW_i^Q, KW_i^K, VW_i^V)$
$softmax(\frac{QK^T}{\sqrt(d_k)})V$

And these attention heads , It is well known that parallel computing , So the top query、key、value The three weights are unique —— This is not all heads Shared weight , It is “ Splicing ” up .

The reason for the bulls in the original paper is Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. Another reliable analysis is ： Why? Transformer Need to carry out Multi-head Attention？

have a look forward Method ：

def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        past_key_value=None,
        output_attentions=False,
    ):
        mixed_query_layer = self.query(hidden_states)

        #  Omit a part of cross-attention The calculation of 
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # ...

there transpose_for_scores Used to handle hidden_size Split into the shape of multiple head outputs , And transpose the middle two dimensions to multiply the matrix ;

here key_layer/value_layer/query_layer The shape of is ：(batch_size, num_attention_heads, sequence_length, attention_head_size);
here attention_scores The shape of is ：(batch_size, num_attention_heads, sequence_length, sequence_length), Conform to the results obtained by calculating multiple heads separately attention map shape .

Here we realize K And Q Multiply , get raw attention scores Part of , According to the formula, the next step should be to press $d_k$ Conduct scaling And do softmax The operation of . However, what first appeared in front of us was a strange positional_embedding, And a bunch of Einstein sums ：

 # ...
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            seq_length = hidden_states.size()[1]
            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r
            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
        # ...

About Einstein's summation agreement , Refer to the following documents ：torch.einsum - PyTorch 1.8.1 documentation

For different positional_embedding_type, There are three operations ：

absolute： The default value is , You don't have to deal with this part ;
relative_key： Yes key_layer To deal with , Compare it with the positional_embedding and key Multiply matrices as key Relevant location codes ;
relative_key_query： Yes key and value Are multiplied by phase as position coding .

Back to normal attention The process of ：

# ...
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask  #  Why is this + instead of *？

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        #  Omit decoder Return value part ……
        return outputs

there attention_scores = attention_scores + attention_mask What is it doing ？ Shouldn't it be by mask Do you ？

Because of the attention_mask already 【 Passive hands and feet 】, Will be originally 1 The part of that becomes 0, And originally 0 Part of （ namely padding） Become a large negative number , This adds up to a large negative value ：
As for why to use 【 A large negative number 】？ Because in this way softmax After operation, this item will become close to 0 Decimals of .

(Pdb) attention_mask
tensor([[[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        ...,
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]],
        [[[    -0.,     -0.,     -0.,  ..., -10000., -10000., -10000.]]]],
       device='cuda:0')

that , Where did this step take place ？
stay modeling_bert.py There is no answer in , But in modeling_utils.py Found a special class in ：class ModuleUtilsMixin, In its get_extended_attention_mask A clue was found in the method ：

 def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device) -> Tensor:
        """
        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

        Arguments:
            attention_mask (:obj:`torch.Tensor`):
                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
            input_shape (:obj:`Tuple[int]`):
                The shape of the input to the model.
            device: (:obj:`torch.device`):
                The device of the input to the model.

        Returns:
            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
        """
        #  Omit a part of ……

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        return extended_attention_mask

that , When was this function called ？ and BertModel What does it matter ？
OK, This involves BertModel The details of inheritance ：BertModel Inherited from BertPreTrainedModel, The latter is inherited from PreTrainedModel, and PreTrainedModel Inherited from [nn.Module, ModuleUtilsMixin, GenerationMixin] Three base classes .—— What a complex package ！

That means ,BertModel Must be in the middle of some step to the original attention_mask Called get_extended_attention_mask, Lead to attention_mask From the original [1, 0] Turn into [0, -1e4] The value of .

In the end in BertModel This call was found during the forward propagation of （ The first 944 That's ok ）：

  # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)

Problem solved ： This method not only realizes the change mask Value , And broadcast it （broadcast） To be able to communicate directly with attention map Additive shape .
It's you ,HuggingFace.

besides , Notable details are ：

Scale according to the dimension of each head , about bert-base Namely 64 The square root of is 8;
attention_probs Not only did he do it softmax, Also used once dropout, This is a worry attention Is the matrix too dense …… It's also mentioned here that it's unusual , But primitive Transformer That's what the paper does ;
head_mask It's the long head calculation mentioned earlier mask, If not set, the default is all 1, It won't work here ;
context_layer namely attention Matrix and value The product of matrices , The original size is ：(batch_size, num_attention_heads, sequence_length, attention_head_size) ;
context_layer Transpose and view After operation , The shape is restored (batch_size, sequence_length, hidden_size).

5.2.2.1.2 BertSelfOutput

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

Here comes... Again LayerNorm and Dropout The combination of , It's just here first Dropout, Connect the residuals before LayerNorm. As for why to make residual connection , The most direct purpose is to reduce the training difficulty caused by too deep network layers , More sensitive to raw input .

5.2.2.2 BertIntermediate

It's over BertAttention, stay Attention There is also a full connection behind + Active operation ：

class BertIntermediate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states

The full connection here is an extension , With bert-base For example , The extended dimension is 3072, It's the original dimension 768 Of 4 Twice as many ;
The default implementation of the activation function here is gelu（Gaussian Error Linerar Units(GELUS） Of course , It cannot be calculated directly , You can use a that contains tanh Approximate the expression of （ A little ).

5.2.2.3 BertOutput

Here is another full connection +dropout+LayerNorm, There is also a residual connection residual connect：

class BertOutput(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

Here's the operation and BertSelfOutput It doesn't matter , It's as like as two peas. …… Two components that are very confusing .
The following also contains information based on BERT Application model of , as well as BERT Relevant optimizers and usage , It will be introduced in detail in the next article .

5.2.3 BertPooler

This layer simply takes out the first... Of the sentence token, namely [CLS] The corresponding vector , Then output after passing through a full connection layer and an activation function ：（ This part is optional , because pooling There are many different operations ）

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

原网站

版权声明
本文为[Small black board]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251017046003.html