当前位置:网站首页>Transformers pretrainedtokenizer class
Transformers pretrainedtokenizer class
2022-06-24 08:11:00 【Live up to your youth】
Overview of base classes
PreTrainedTokenizer Class is all word breaker classes Tokenizer Base class of , This class cannot be instantiated , All participle classes ( such as BertTokenizer、DebertaTokenizer etc. ) Inherit from PreTrainedTokenizer class , And implement the method of base class .
Base class method
1、_ _ call _ _ function
__call__(
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = False,
max_length: Optional[int] = None,
stride: int = 0,
is_split_into_words: bool = False,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
**kwargs
)
This function returns a BatchEncoding, This class inherits from python Dictionary type , It can be used like a dictionary BatchEncoding. besides , There are also some words in this class 、 The method of converting characters into participles . Let's say BertTokenizer For example , Explain the meaning of each parameter .
Parameters text Represents the sequence or sequence batch to be encoded . Parameters text_pair Express Clause Sequence or sequence batch . Can be a str Or a list 、 One str A list of components , If for one str A list of components , The parameter is_split_into_words Should be True.
# text For one string
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# text For a list
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks."])
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
# text For one string A list of components , Now set Parameters is_split_into_words=True
>>> tokenizer(text=[["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks"]], is_split_into_words=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
# text_pair and text The format of should be the same ,text Is a list text_pair Should also be a list
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",
text_pair="I demand that the more John eat, the more he pays.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Parameters add_special_tokens Indicates whether to add special tags to the model , such as [CLS]、[SEQ]、[PAD] Wait for the sign , The default is True.
# add_special_tokens=False, Do not add special marks
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",add_special_tokens=False)
{'input_ids': [1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# The default is True, Add special tags
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.")
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks."],add_special_tokens=True)
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']
Parameters padding Indicates whether to fill , Parameters truncation Indicates whether to intercept .padding Can be a bool data type , It could be one str Indicates the policy of filling , Namely longest、max_length、do_not_pad.truncation Can be either for bool data type , It can also be str Indicates the truncated policy , Namely longest_first、only_first、only_second、do_not_truncate. By default, both parameters are False.
Parameters max_length Indicates the maximum length of filling or truncation , Usually with padding、truncation Parameter to set the length of the sentence . Parameters pad_to_multiple_of If you set , The sequence is populated as a multiple of the provided value .
# padding by True, Then set fill , The length after filling is equal to the longest length of the sentence
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],padding=True)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 0, 0, 0], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
>>> tokenizer.batch_decode(encodings["input_ids"])
['[CLS] the sailors rode the breeze clear of the rocks. [SEP] [PAD] [PAD] [PAD]', '[CLS] i demand that the more john eat, the more he pays. [SEP]']
# truncation by True, Set truncation , By default , The truncated length is the same as the longest length of the sentence , Therefore, no sentence will be truncated
# here , Can be set by max_length To specify the truncation length .
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
# truncation by True, Simultaneous setting max_length=10.
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10)
>>> encodings
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Parameters stride Indicates if truncation=True,return_overflowing_tokens=True Will contain some tags from the end of the returned truncated sequence , To provide some overlap between the truncated sequence and the overflow sequence . Then this value represents the number of overlapping tags .
>>> encodings = tokenizer(text=["The sailors rode the breeze clear of the rocks.","I demand that the more John eat, the more he pays."],truncation=True, max_length=10, stride=2, return_overflowing_tokens=True)
>>> encodings
# here overflowing_tokens Medium [1997,1996] It's overlapping markers , The number is 2, With the parameters stride=2 Consistent settings
{'overflowing_tokens': [[1997, 1996, 5749, 1012], [4521, 1010, 1996, 2062, 2002, 12778, 1012]], 'num_truncated_tokens': [2, 5], 'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Parameters return_tensors Indicates the data type returned , It can be tf(tensorflow tensor )、pt(pytorch tensor )、np(numpy Array ).
>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="tf")
>>> type(encodings["input_ids"])
<class 'tensorflow.python.framework.ops.EagerTensor'>
>>> encodings = tokenizer(text="The sailors rode the breeze clear of the rocks.",return_tensors="np")
>>> type(encodings["input_ids"])
<class 'numpy.ndarray'>
Parameters return_token_type_ids Indicates whether to return clause id, Parameters return_attention_mask A mask indicating whether attention is returned , Parameters return_overflowing_tokens Indicates whether to return truncated words id,return_special_tokens_mask Indicates whether to return special tag information , Parameters return_length Indicates whether to return the length of the sentence .
# There is only one clause ,token_type_ids All for 0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# There are two clauses , The second clause token_type_ids by 1
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",text_pair="I demand that the more John eat, the more he pays.",return_token_type_ids=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# overflowing_tokens Indicates truncated words
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",truncation=True,max_length=10,return_overflowing_tokens=True)
{'overflowing_tokens': [5749, 1012], 'num_truncated_tokens': 2, 'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# special_tokens_mask Use special marking characters [CLS]、[SEQ] And so on are marked as 1, For the other 0
>>> tokenizer(text="The sailors rode the breeze clear of the rocks.",return_special_tokens_mask=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# length Indicates the length of each sentence
>>> tokenizer(text=["The sailors rode the breeze clear of the rocks.", "I demand that the more John eat, the more he pays."],return_length=True)
{'input_ids': [[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], [101, 1045, 5157, 2008, 1996, 2062, 2198, 4521, 1010, 1996, 2062, 2002, 12778, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'length': [12, 15], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Parameters return_offsets_mapping Indicates whether to return the starting position of each word in the sentence , Only applicable to Fast Word segmentation is , such as BertTokenizerFast.
>>> tokenizer("The sailors rode the breeze clear of the rocks.",return_offsets_mapping=True)
{'input_ids': [101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 3), (4, 11), (12, 16), (17, 20), (21, 27), (28, 33), (34, 36), (37, 40), (41, 46), (46, 47), (0, 0)]}
Parameters verbose Indicates whether redundant warning messages are printed .
2、encode function
encode(
text: Union[TextInput, PreTokenizedInput, EncodedInput],
text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,
add_special_tokens: bool = True,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = False,
max_length: Optional[int] = None,
stride: int = 0,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs
)
This function uses a word breaker to encode a string into a int list . All parameters are related to __call__ The meaning of the parameters in the function is the same . In batch processing , This function is rarely used , Usually use __call__ Function to handle .
# text Can be a str
>>> encoding = tokenizer.encode(text="The sailors rode the breeze clear of the rocks.")
>>> encoding
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text It could be one List[str], Every one in this list str Is called a token
>>> encoding = tokenizer.encode(text=["The", "sailors", "rode", "the", "breeze", "clear", "of", "the", "rocks", "."])
>>> encoding
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
# text It can also be a List[int], Every int by token The corresponding id
>>> encodings = tokenizer.encode(text=[100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012])
>>> encodings
[101, 100, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
3、decode function
decode(
token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs
)
The function uses a word breaker to put a int The list is converted to a str.
Parameters clean_up_tokenization_spaces Indicates whether to clean up the tokenized space , If the value is set to False, Will id convert to str Spaces between post punctuation and words are not cleared ; If True, Will be cleared . The default is True.
>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks. [SEP]'# There is no space between punctuation and words
>>> decodings = tokenizer.decode([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102], clean_up_tokenization_spaces=False)
>>> decodings
'[CLS] [UNK] sailors rode the breeze clear of the rocks . [SEP]' # There are spaces between punctuation marks and words
4、batch_decode function
batch_decode(
sequences: Union[List[int], List[List[int]], "np.ndarray", "torch.Tensor", "tf.Tensor"],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs
)
Process more than one at a time List[int], That's one List[List[int]], The result returns a List[str]
# sequences For one List[List[int]]
>>> decodings = tokenizer.batch_decode([[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]])
>>> decodings
['[CLS] the sailors rode the breeze clear of the rocks. [SEP]']
5、convert_ids_to_tokens function
convert_ids_to_tokens(
ids: Union[int, List[int]], skip_special_tokens: bool = False
)
Will a id The list is converted to a str list .
>>> tokens = tokenizer.convert_ids_to_tokens([101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102])
>>> tokens
['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]']
6、convert_tokens_to_ids function
convert_tokens_to_ids(tokens: Union[str, List[str]])
take tokens(List[str]) Convert to a id list (List[int])
>>> ids = tokenizer.convert_tokens_to_ids(['[CLS]', 'the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.', '[SEP]'])
>>> ids
[101, 1996, 11279, 8469, 1996, 9478, 3154, 1997, 1996, 5749, 1012, 102]
7、tokenize function
tokenize(text: TextInput, **kwargs)
take str Convert to a List[str].
>>> tokenizer.tokenize("The sailors rode the breeze clear of the rocks.")
['the', 'sailors', 'rode', 'the', 'breeze', 'clear', 'of', 'the', 'rocks', '.']
边栏推荐
- Analysis of abnormal problems in domain name resolution in kubernetes
- Chapter 1 overview of canvas
- Unity culling related technologies
- Smart pointer remarks
- Vulnhub靶机:BOREDHACKERBLOG: SOCIAL NETWORK
- Écouter le réseau d'extension SWIFT (source)
- Latest news of awtk: new usage of grid control
- L1-019 who goes first (15 points)
- Methods of vector operation and coordinate transformation
- 有关iframe锚点,锚点出现上下偏移,锚点出现页面显示问题.iframe的srcdoc问题
猜你喜欢
![[run the script framework in Django and store the data in the database]](/img/6b/052679e5468e5a90be5c4339183f43.png)
[run the script framework in Django and store the data in the database]

Moonwell Artemis is now online moonbeam network

Open cooperation and win-win future | Fuxin Kunpeng joins Jinlan organization

某问答社区App x-zse-96签名分析

Synchronous FIFO

不止于观测|阿里云可观测套件正式发布

Pagoda panel installation php7.2 installation phalcon3.3.2

More than observation | Alibaba cloud observable suite officially released

Swift Extension NetworkUtil(网络监听)(源码)

How to cancel the display of the return button at the uniapp uni app H5 end the autobackbutton does not take effect
随机推荐
5g industrial router Gigabit high speed low delay
Optimization and practice of Tencent cloud EMR for cloud native containerization based on yarn
Easyplayerpro win configuration full screen mode can not be full screen why
Simple refraction effect
Echart 心得 (一): 有关Y轴yAxis属性
Do you still have the opportunity to become a machine learning engineer without professional background?
Swift extension networkutil (network monitoring) (source code)
Unity culling related technologies
Chapitre 2: dessiner une fenêtre
Swift 基础 闭包/Block的使用(源码)
不止于观测|阿里云可观测套件正式发布
Écouter le réseau d'extension SWIFT (source)
第 3 篇:绘制三角形
Open cooperation and win-win future | Fuxin Kunpeng joins Jinlan organization
[data update] Xunwei comprehensively upgraded NPU development data based on 3568 development board
Chapter 4 line operation of canvas
Part 1: building OpenGL environment
1-4metaploitable2 introduction
Introduction of model compression tool based on distiller
一文理解同步FIFO