release notes
release notes
Published 7/22/2021
MinorContains breaking changesThis version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.
python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
Validating ONNX model...
-[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
- Validating ONNX Model output "last_hidden_state":
-[✓] (2, 8, 768) matchs (2, 8, 768)
-[✓] all values close (atol: 0.0001)
- Validating ONNX Model output "pooler_output":
-[✓] (2, 768) matchs (2, 768)
-[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/bert-base-cased/model.onnx
Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch.
The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine
This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
# We train on batch of texts, 1000 at a time here.
batch_size = 1000
corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))
tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)
The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.
HuBERT is now implemented in TensorFlow:
When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps.
--log_level feature #12365 (@bhadreshpsavani)print statement with logger.info in QA example utils #12368 (@bhadreshpsavani)einsum in Albert's attention computation #12394 (@mfuntowicz)push_to_hub #12391 (@patrickvonplaten)Repository import to the FLAX example script #12501 (@LysandreJik)model_kwargs when loading a model in pipeline() #12449 (@aphedges)_mask_hidden_states to avoid double masking #12692 (@mfuntowicz)config.mask_feature_prob > 0 #12705 (@mfuntowicz)list type of additional_special_tokens in special_token_map #12759 (@SaulLu)cls and checkpoint #12619 (@europeanplaice)datasets_modules ImportError with Ray Tune #12749 (@Yard1)save_steps=0|None and logging_steps=0 #12796 (@stas00)release notes
Published 7/22/2021
MinorContains breaking changesThis version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.
python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
Validating ONNX model...
-[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
- Validating ONNX Model output "last_hidden_state":
-[✓] (2, 8, 768) matchs (2, 8, 768)
-[✓] all values close (atol: 0.0001)
- Validating ONNX Model output "pooler_output":
-[✓] (2, 768) matchs (2, 768)
-[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/bert-base-cased/model.onnx
Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch.
The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine
This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
# We train on batch of texts, 1000 at a time here.
batch_size = 1000
corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))
tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)
The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.
HuBERT is now implemented in TensorFlow:
When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps.
--log_level feature #12365 (@bhadreshpsavani)print statement with logger.info in QA example utils #12368 (@bhadreshpsavani)einsum in Albert's attention computation #12394 (@mfuntowicz)push_to_hub #12391 (@patrickvonplaten)Repository import to the FLAX example script #12501 (@LysandreJik)model_kwargs when loading a model in pipeline() #12449 (@aphedges)_mask_hidden_states to avoid double masking #12692 (@mfuntowicz)config.mask_feature_prob > 0 #12705 (@mfuntowicz)list type of additional_special_tokens in special_token_map #12759 (@SaulLu)cls and checkpoint #12619 (@europeanplaice)datasets_modules ImportError with Ray Tune #12749 (@Yard1)save_steps=0|None and logging_steps=0 #12796 (@stas00)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.