当前位置：网站首页>Transformers pretrainedtokenizer class

Transformers pretrainedtokenizer class

2022-06-24 08:11:00 【Live up to your youth】

Overview of base classes

PreTrainedTokenizer Class is all word breaker classes Tokenizer Base class of , This class cannot be instantiated , All participle classes （ such as BertTokenizer、DebertaTokenizer etc. ） Inherit from PreTrainedTokenizer class , And implement the method of base class .

Base class method

1、_ _ call _ _ function

__call__(
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        is_split_into_words: bool = False,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        **kwargs
    )

This function returns a BatchEncoding, This class inherits from python Dictionary type , It can be used like a dictionary BatchEncoding. besides , There are also some words in this class 、 The method of converting characters into participles . Let's say BertTokenizer For example , Explain the meaning of each parameter .

Parameters text Represents the sequence or sequence batch to be encoded . Parameters text_pair Express Clause Sequence or sequence batch . Can be a str Or a list 、 One str A list of components , If for one str A list of components , The parameter is_split_into_words Should be True.

# text For one string
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# text For a list 
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks."])
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# text For one string A list of components , Now set   Parameters is_split_into_words=True
>>> tokenizer(text=[["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks"]], is_split_into_words=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# text_pair and text The format of should be the same ,text Is a list text_pair Should also be a list 
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",
                      text_pair="I demand that the more John eat, the more he pays.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Parameters add_special_tokens Indicates whether to add special tags to the model , such as [CLS]、[SEQ]、[PAD] Wait for the sign , The default is True.

# add_special_tokens=False, Do not add special marks 
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",add_special_tokens=False)
{'input_ids': [1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
#  The default is True, Add special tags 
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks."],add_special_tokens=True)
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']

Parameters padding Indicates whether to fill , Parameters truncation Indicates whether to intercept .padding Can be a bool data type , It could be one str Indicates the policy of filling , Namely longest、max_length、do_not_pad.truncation Can be either for bool data type , It can also be str Indicates the truncated policy , Namely longest_first、only_first、only_second、do_not_truncate. By default, both parameters are False.

Parameters max_length Indicates the maximum length of filling or truncation , Usually with padding、truncation Parameter to set the length of the sentence . Parameters pad_to_multiple_of If you set , The sequence is populated as a multiple of the provided value .

# padding by True, Then set fill , The length after filling is equal to the longest length of the sentence 
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],padding=True)
>>> encodings 
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 0, 0, 0], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP] [PAD] [PAD] [PAD]', '[CLS] i demand that the more john eat, the more he pays. [SEP]']

# truncation by True, Set truncation , By default , The truncated length is the same as the longest length of the sentence , Therefore, no sentence will be truncated 
#  here , Can be set by max_length To specify the truncation length .
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# truncation by True, Simultaneous setting max_length=10.
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Parameters stride Indicates if truncation=True,return_overflowing_tokens=True Will contain some tags from the end of the returned truncated sequence , To provide some overlap between the truncated sequence and the overflow sequence . Then this value represents the number of overlapping tags .

>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10, stride=2, return_overflowing_tokens=True)
>>> encodings
#  here overflowing_tokens Medium [1997,1996] It's overlapping markers , The number is 2, With the parameters stride=2 Consistent settings 
{'overflowing_tokens': [[1997, 1996, 5749, 1012], [4521, 1010, 1996, 2062, 2002, 12778, 1012]], 'num_truncated_tokens': [2, 5], 'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Parameters return_tensors Indicates the data type returned , It can be tf（tensorflow tensor ）、pt（pytorch tensor ）、np（numpy Array ）.

>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="tf")
>>> type(encodings["input_ids"])
<class 'tensorflow.python.framework.ops.EagerTensor'>

>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="np")
>>> type(encodings["input_ids"])
<class 'numpy.ndarray'>

Parameters return_token_type_ids Indicates whether to return clause id, Parameters return_attention_mask A mask indicating whether attention is returned , Parameters return_overflowing_tokens Indicates whether to return truncated words id,return_special_tokens_mask Indicates whether to return special tag information , Parameters return_length Indicates whether to return the length of the sentence .

#  There is only one clause ,token_type_ids All for 0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
#  There are two clauses , The second clause token_type_ids by 1
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",text_pair="I demand that the more John eat, the more he pays.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# overflowing_tokens Indicates truncated words 
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",truncation=True,max_length=10,return_overflowing_tokens=True)
{'overflowing_tokens': [5749, 1012], 'num_truncated_tokens': 2, 'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# special_tokens_mask Use special marking characters [CLS]、[SEQ] And so on are marked as 1, For the other 0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_special_tokens_mask=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# length Indicates the length of each sentence 
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks.", "I demand that the more John eat, the more he pays."],return_length=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'length': [12, 15], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Parameters return_offsets_mapping Indicates whether to return the starting position of each word in the sentence , Only applicable to Fast Word segmentation is , such as BertTokenizerFast.

>>> tokenizer("The sailors rode the breeze clear of the rocks.",return_offsets_mapping=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 3), (4, 11), (12, 16), (17, 20), (21, 27), (28, 33), (34, 36), (37, 40), (41, 46), (46, 47), (0, 0)]}

Parameters verbose Indicates whether redundant warning messages are printed .

2、encode function

encode(
        text: Union[TextInput, PreTokenizedInput, EncodedInput],
        text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs
    )

This function uses a word breaker to encode a string into a int list . All parameters are related to __call__ The meaning of the parameters in the function is the same . In batch processing , This function is rarely used , Usually use __call__ Function to handle .

# text Can be a str
>>> encoding = tokenizer.encode(text="The sailors rode the breeze clear of the rocks.")
>>> encoding
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text It could be one List[str], Every one in this list str Is called a token
>>> encoding = tokenizer.encode(text=["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks", "."])
>>> encoding
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text It can also be a List[int], Every int by token The corresponding id
>>> encodings = tokenizer.encode(text=[100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012])
>>> encodings
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]

3、decode function

decode(
        token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = True,
        **kwargs
    )

The function uses a word breaker to put a int The list is converted to a str.
Parameters clean_up_tokenization_spaces Indicates whether to clean up the tokenized space , If the value is set to False, Will id convert to str Spaces between post punctuation and words are not cleared ; If True, Will be cleared . The default is True.

>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks. [SEP]'#  There is no space between punctuation and words 
>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], clean_up_tokenization_spaces=False)
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks . [SEP]' #  There are spaces between punctuation marks and words

4、batch_decode function

batch_decode(
        sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
        skip_special_tokens: bool = False,
        clean_up_tokenization_spaces: bool = True,
        **kwargs
    )

Process more than one at a time List[int], That's one List[List[int]], The result returns a List[str]

# sequences For one List[List[int]]
>>> decodings = tokenizer.batch_decode([[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]])
>>> decodings
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']

5、convert_ids_to_tokens function

convert_ids_to_tokens(
        ids: Union[int, List[int]], skip_special_tokens: bool = False
    )

Will a id The list is converted to a str list .

>>> tokens = tokenizer.convert_ids_to_tokens([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> tokens
['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]']

6、convert_tokens_to_ids function

convert_tokens_to_ids(tokens: Union[str, List[str]])

take tokens（List[str]) Convert to a id list （List[int])

>>> ids = tokenizer.convert_tokens_to_ids(['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]'])
>>> ids
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]

7、tokenize function

tokenize(text: TextInput, **kwargs)

take str Convert to a List[str].

>>> tokenizer.tokenize("The sailors rode the breeze clear of the rocks.")
['the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.']

原网站

版权声明
本文为[Live up to your youth]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206240436268180.html