当前位置:网站首页>Datasets dataset class (2)
Datasets dataset class (2)
2022-06-26 14:32:00 【Live up to your youth】
Object methods ( important )
1、map function
map(
function: Optional[Callable] = None,
with_indices: bool = False,
with_rank: bool = False,
input_columns: Optional[Union[str, List[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[Union[str, List[str]]] = None,
keep_in_memory: bool = False,
load_from_cache_file: bool = True,
cache_file_names: Optional[Dict[str, Optional[str]]] = None,
writer_batch_size: Optional[int] = 1000,
features: Optional[Features] = None,
disable_nullable: bool = False,
fn_kwargs: Optional[dict] = None,
num_proc: Optional[int] = None,
desc: Optional[str] = None,
)
Through a mapping function function, Handle Dataset Every element in . If you don't specify function, The default function is lambda x: x. Parameters batched Indicates whether to perform batch processing , Parameters batch_size Indicates the size of the batch , That is, how many elements are processed each time , The default is 1000. Parameters drop_last_batch Indicates when the quantity of the last batch is less than batch_size, Whether to process the last batch .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],
padding="max_length",
truncation=True,
max_length=10),
batched=True,
batch_size=1000,
drop_last_batch=False)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
Parameters input_columns Represents the entered column name , The default is Dataset All columns in , Pass in as a dictionary type . Parameters remove_columns Represents the removed column name .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
# Using parameter input_columns when , Pay attention to the incoming lambda Form of function
>>> dataset = dataset.map(lambda data: tokenizer(data,
padding="max_length",
truncation=True,
max_length=10),
batched=True,
batch_size=1000,
drop_last_batch=False,
input_columns=["sentence"])
>>> dataset
Dataset({
features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
# Using parameter remove_columns
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True, remove_columns=["sentence", "idx"])
>>> dataset
Dataset({
features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
2、to_tf_dataset function
to_tf_dataset(
columns: Union[str, List[str]],
batch_size: int,
shuffle: bool,
collate_fn: Callable,
drop_remainder: bool = None,
collate_fn_args: Dict[str, Any] = None,
label_cols: Union[str, List[str]] = None,
dummy_labels: bool = False,
prefetch: bool = True,
)
according to datasets.Dataset Object to create one tf.data.Dataset object . If you set batch_size Words , that tf.data.Dataset Will be taken from datasets.Dataset Load a batch of data in , Each batch of data is a dictionary , All key names come from the settings columns Parameters .
Parameters columns Indicates the key name of the generated data , The scope is datasets.Dataset Of features One or more of . Parameters batch_size Indicates the size of each batch of data in the generated data . Parameters shuffle Indicates whether the data is scrambled .collate_fn Represents a function used to convert multiple data into a batch of data .
Parameters drop_remainder Indicates whether to delete the last incomplete batch when loading , Ensure that all batches produced by the dataset have the same length in the batch dimension . Parameters label_cols Represents the dataset column to load as a label . Parameters prefetch Indicates whether to run the data loader in a separate thread and maintain a small batch buffer for training . Improve performance by allowing data to be loaded in the background during model training .
>>> import transformers
>>> import datasets
>>> dataset = datasets.load_dataset("glue", "cola", split="train")
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 8551
})
>>> tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
>>> dataset = dataset.map(lambda data: tokenizer(data["sentence"],padding=True), batched=True)
>>> dataset
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 8551
})
>>> data_collator = transformers.DataCollatorWithPadding(tokenizer, return_tensors="tf")
>>> dataset = dataset.to_tf_dataset(columns=["label", "input_ids"], batch_size=16, shuffle=False, collate_fn=data_collator)
>>> dataset
<PrefetchDataset element_spec={'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'labels': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>
边栏推荐
- 方程推导:二阶有源带通滤波器设计!(下载:教程+原理图+视频+代码)
- ArcGIS cannot be opened and displays' because afcore cannot be found ' DLL, solution to 'unable to execute code'
- 9 articles, 6 interdits! Le Ministère de l'éducation et le Ministère de la gestion des urgences publient et publient conjointement neuf règlements sur la gestion de la sécurité incendie dans les établ
- 量化框架backtrader之一文读懂observer观测器
- 通俗语言说BM3D
- 布局管理器~登录界面的搭建实例
- Linear basis
- Related knowledge of libsvm support vector machine
- A remove the underline from the label
- Relevant knowledge of information entropy
猜你喜欢
随机推荐
Summary of decimal point of amount and price at work and pit
9 articles, 6 interdits! Le Ministère de l'éducation et le Ministère de la gestion des urgences publient et publient conjointement neuf règlements sur la gestion de la sécurité incendie dans les établ
ThreadLocal巨坑!内存泄露只是小儿科...
Sword finger offer 40.41 Sort (medium)
Oracle ASMM and AMM
这才是优美的文件系统挂载方式,亲测有效
Sword finger offer 18.22.25.52 Double pointer (simple)
常用控件及自定义控件
Electron
ArcGIS batch export layer script
transformers DataCollatorWithPadding类
Question bank and answers of the latest Guizhou construction eight (Mechanics) simulated examination in 2022
ArcGIS secondary development -- arcpy batch automatic map publishing service
RISC-V 芯片架构新规范
[hnoi2010] flying sheep
赠书 | 《认知控制》:我们的大脑如何完成任务?
Codeforces Global Round 21A~D
Notes: the 11th and 12th generation mobile versions of Intel support the native thunderbolt4 interface, but the desktop version does not
Atcoder 238
Combat readiness mathematical modeling 31 data interpolation and curve fitting 3








