Data Collator — transformers 4.7.0 documentation
huggingface.co › transformers › v4Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length. Parameters tokenizer ( PreTrainedTokenizer or PreTrainedTokenizerFast) – The tokenizer used for encoding the data. mlm ( bool, optional, defaults to True) – Whether or not to use masked language modeling.
Data Collator - Hugging Face
huggingface.co › docs › transformersData collator used for language modeling that masks entire words. collates batches of tensors, honoring their tokenizer’s pad_token preprocesses batches for masked language modeling This collator relies on details of the implementation of subword tokenization by BertTokenizer, specifically that subword tokens are prefixed with ##.