release notes
release notes
Published 1/25/2023
MinorContains breaking changesGenerationConfigThe generate method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.
GenerationConfig as the basis for .generate() parametrization by @gante in #20388GenerationConfig as the basis for .generate() parametrization by @gante in #20994GenerationConfig as the basis for .generate() parametrization by @gante in #21007ImageProcessorIn the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor class, so be sure to switch your code to this new name sooner rather than later!
AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.
BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.
BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.
GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.
Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.
OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.
The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.
Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.
TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.
UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.
ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.
Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.
BeitDropPath layers by @younesbelkada in #20587natten with CUDA version by @ydshieh in #20546FEATURE_EXTRACTOR_MAPPING_NAMES by @ydshieh in #20551require_torch to 2 pipeline tests by @ydshieh in #20585tensorflow_probability for TF pipeline CI by @ydshieh in #20586classifier_dropout in config by @ydshieh in #20596set-output by $GITHUB_OUTPUT by @ydshieh in #20547.to function for ImageProcessors by @younesbelkada in #20536AutomaticSpeechRecognitionPipelineTests.run_pipeline_test by @ydshieh in #20597natten installation in docker file by @ydshieh in #206328bitmodels by @younesbelkada in #20651ViTHybrid] + [BiT] cleaner __init__ by @younesbelkada in #20649run_pipeline_test by @ydshieh in #20623dpt-hybrid support by @younesbelkada in #20645BiT] Small patch fix by @younesbelkada in #20657BackboneMixin by @ydshieh in #20660ViTHybrid] Fix accelerate slow tests by @younesbelkada in #20679test_tokenization_led by @IMvision12 in #20568test_multi_gpu_data_parallel_forward for MaskFormerSwinModelTest by @ydshieh in #20688ViTHybrid] fix last accelerate slow test by @younesbelkada in #20705accelerate support for LongT5 models by @pszemraj in #20341AutoModelTest.test_model_from_pretrained by @ydshieh in #20730layoutlm_job to exotic_models_job by @ydshieh in #20736keep_in_fp32_modules support by @younesbelkada in #20683torch_tensorrt in DeepSpeed CI image for now by @ydshieh in #20758() in some usage of is_flaky by @ydshieh in #20749torch-tensorrt 1.3.0 for DeepSpeed CI by @ydshieh in #20764pipeline test by @younesbelkada in #20778layoutlm by @Narsil in #20776apex in DeepSpeed CI image by @ydshieh in #20788IMAGE_PROCESSOR_MAPPING by @younesbelkada in #20790sentencepiece in DeepSpeed CI image by @ydshieh in #20795Vision] [Refactor] Initialize weights on the correct place by @younesbelkada in #20803max_position_embeddings in config classes by @ydshieh in #20836use_cache in config classes by @ydshieh in #20844use_fast parameter in docstring by @stevhliu in #20840config.num_channels in CLIP-like modeling files by @ydshieh in #20857evaluate to the list of libraries required in generated notebooks by @MKhalusova in #20850LevitModelTest.test_problem_types by @ydshieh in #20859HubertModelIntegrationTest.test_inference_keyword_spotting by @ydshieh in #20863FSMT] Make it compatible with xxxForConditionalGeneration models by @younesbelkada in #20825MobileNet-v2] Fix ONNX typo by @younesbelkada in #20860fp16 for asr pipeline. by @Narsil in #20864T5] fix fp16 loading issue by @younesbelkada in #20878WhisperFeatureExtractor by @bofenghuang in #20936distributed_concat] ensure all_gather's inputs are contiguous by @stas00 in #20951AutomaticSpeechRecognitionPipeline by @bofenghuang in #20952_reorder_cache by @gante in #20964MinNewTokensLengthLogitsProcessor for .generate method #20814 by @kotikkonstantin in #20892decoder_attention_mask in generate function by @samuelpullely in #20726BLIP] Fix daily CI failing test by @younesbelkada in #20877past with past_key_values by @ArthurZucker in #20944documentation_tests.txt by @ydshieh in #21036torchscript tests for AltCLIP by @ydshieh in #21102min_new_tokens argument in generate() (implementation based on MinNewTokensLengthLogitsProcessor) by @silverriver in #21044RealmModelIntegrationTest.test_inference_open_qa by @ydshieh in #21136TFTapasEmbeddings by @ydshieh in #21107use_cache from model_kwargs by @gante in #21149installation.mdx to Korean by @wonhyeongseo in #20948test_save_pretrained_signatures slow test by @ydshieh in #21105blip support for training by @younesbelkada in #21021Mask2FormerForUniversalSegmentation by @ydshieh in #21175UperNetModelIntegrationTest by @ydshieh in #21192CVT] Fix module initialization issue by @younesbelkada in #21193automatic-speech-recognition asr for Whisper. by @Narsil in #21196huggingface_hub version by @ydshieh in #21212CONFIG_ARCHIVE_MAP_MAPPING_NAMES by @ydshieh in #21207GPTJ doctest by @ydshieh in #21213parallelism for CircleCI jobs work - but keep it 1 for now by @ydshieh in #21157BLIP] fix docstring for BlipTextxxx by @younesbelkada in #21224BLIP] fix doctest by @younesbelkada in #21217.save_pretrained() by @gante in #21264The following contributors have made significant changes to the library over the last release:
release notes
Published 1/25/2023
MinorContains breaking changesGenerationConfigThe generate method has multiple arguments whose defaults were lying in the model config. We have now decoupled these in a separate generation config, which makes it easier to store different sets of parameters for a given model, with different generation strategies. While we will keep supporting generate arguments in the model configuration for the foreseeable future, it is now recommended to use a generation config. You can learn more about its uses here and its documentation here.
GenerationConfig as the basis for .generate() parametrization by @gante in #20388GenerationConfig as the basis for .generate() parametrization by @gante in #20994GenerationConfig as the basis for .generate() parametrization by @gante in #21007ImageProcessorIn the vision integration, all feature extractor classes have been deprecated to be renamed to ImageProcessor. The old feature extractors will be fully removed in version 5 of Transformers and new vision models will only implement the ImageProcessor class, so be sure to switch your code to this new name sooner rather than later!
AltCLIP is a variant of CLIP obtained by switching the text encoder with a pretrained multilingual text encoder (XLM-Roberta). It has very close performances with CLIP on almost all tasks, and extends the original CLIP’s capabilities to multilingual understanding.
BLIP is a model that is able to perform various multi-modal tasks including visual question answering, image-text retrieval (image-text matching) and image captioning.
BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
BiT is a simple recipe for scaling up pre-training of ResNet-like architectures (specifically, ResNetv2). The method results in significant improvements for transfer learning.
EfficientFormer proposes a dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object detection and semantic segmentation.
GIT is a decoder-only Transformer that leverages CLIP’s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on image captioning and visual question answering benchmarks.
GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.
Graphormer is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.
OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.
The RoBERTa-PreLayerNorm model is identical to RoBERTa but uses the --encoder-normalize-before flag in fairseq.
Swin2R improves the SwinIR model by incorporating Swin Transformer v2 layers which mitigates issues such as training instability, resolution gaps between pre-training and fine-tuning, and hunger on data.
TimeSformer is the first video transformer. It inspired many transformer based video understanding and classification papers.
UPerNet is a general framework to effectively segment a wide range of concepts from images, leveraging any vision backbone like ConvNeXt or Swin.
ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial “tokens” for the Transformer. It’s the first architecture that attains similar results to familiar convolutional architectures.
Breaking a bit the one model per file policy, we introduce backbones (mainly for vision models) which can then be re-used in more complex models like DETR, MaskFormer, Mask2Former etc.
BeitDropPath layers by @younesbelkada in #20587natten with CUDA version by @ydshieh in #20546FEATURE_EXTRACTOR_MAPPING_NAMES by @ydshieh in #20551require_torch to 2 pipeline tests by @ydshieh in #20585tensorflow_probability for TF pipeline CI by @ydshieh in #20586classifier_dropout in config by @ydshieh in #20596set-output by $GITHUB_OUTPUT by @ydshieh in #20547.to function for ImageProcessors by @younesbelkada in #20536AutomaticSpeechRecognitionPipelineTests.run_pipeline_test by @ydshieh in #20597natten installation in docker file by @ydshieh in #206328bitmodels by @younesbelkada in #20651ViTHybrid] + [BiT] cleaner __init__ by @younesbelkada in #20649run_pipeline_test by @ydshieh in #20623dpt-hybrid support by @younesbelkada in #20645BiT] Small patch fix by @younesbelkada in #20657BackboneMixin by @ydshieh in #20660ViTHybrid] Fix accelerate slow tests by @younesbelkada in #20679test_tokenization_led by @IMvision12 in #20568test_multi_gpu_data_parallel_forward for MaskFormerSwinModelTest by @ydshieh in #20688ViTHybrid] fix last accelerate slow test by @younesbelkada in #20705accelerate support for LongT5 models by @pszemraj in #20341AutoModelTest.test_model_from_pretrained by @ydshieh in #20730layoutlm_job to exotic_models_job by @ydshieh in #20736keep_in_fp32_modules support by @younesbelkada in #20683torch_tensorrt in DeepSpeed CI image for now by @ydshieh in #20758() in some usage of is_flaky by @ydshieh in #20749torch-tensorrt 1.3.0 for DeepSpeed CI by @ydshieh in #20764pipeline test by @younesbelkada in #20778layoutlm by @Narsil in #20776apex in DeepSpeed CI image by @ydshieh in #20788IMAGE_PROCESSOR_MAPPING by @younesbelkada in #20790sentencepiece in DeepSpeed CI image by @ydshieh in #20795Vision] [Refactor] Initialize weights on the correct place by @younesbelkada in #20803max_position_embeddings in config classes by @ydshieh in #20836use_cache in config classes by @ydshieh in #20844use_fast parameter in docstring by @stevhliu in #20840config.num_channels in CLIP-like modeling files by @ydshieh in #20857evaluate to the list of libraries required in generated notebooks by @MKhalusova in #20850LevitModelTest.test_problem_types by @ydshieh in #20859HubertModelIntegrationTest.test_inference_keyword_spotting by @ydshieh in #20863FSMT] Make it compatible with xxxForConditionalGeneration models by @younesbelkada in #20825MobileNet-v2] Fix ONNX typo by @younesbelkada in #20860fp16 for asr pipeline. by @Narsil in #20864T5] fix fp16 loading issue by @younesbelkada in #20878WhisperFeatureExtractor by @bofenghuang in #20936distributed_concat] ensure all_gather's inputs are contiguous by @stas00 in #20951AutomaticSpeechRecognitionPipeline by @bofenghuang in #20952_reorder_cache by @gante in #20964MinNewTokensLengthLogitsProcessor for .generate method #20814 by @kotikkonstantin in #20892decoder_attention_mask in generate function by @samuelpullely in #20726BLIP] Fix daily CI failing test by @younesbelkada in #20877past with past_key_values by @ArthurZucker in #20944documentation_tests.txt by @ydshieh in #21036torchscript tests for AltCLIP by @ydshieh in #21102min_new_tokens argument in generate() (implementation based on MinNewTokensLengthLogitsProcessor) by @silverriver in #21044RealmModelIntegrationTest.test_inference_open_qa by @ydshieh in #21136TFTapasEmbeddings by @ydshieh in #21107use_cache from model_kwargs by @gante in #21149installation.mdx to Korean by @wonhyeongseo in #20948test_save_pretrained_signatures slow test by @ydshieh in #21105blip support for training by @younesbelkada in #21021Mask2FormerForUniversalSegmentation by @ydshieh in #21175UperNetModelIntegrationTest by @ydshieh in #21192CVT] Fix module initialization issue by @younesbelkada in #21193automatic-speech-recognition asr for Whisper. by @Narsil in #21196huggingface_hub version by @ydshieh in #21212CONFIG_ARCHIVE_MAP_MAPPING_NAMES by @ydshieh in #21207GPTJ doctest by @ydshieh in #21213parallelism for CircleCI jobs work - but keep it 1 for now by @ydshieh in #21157BLIP] fix docstring for BlipTextxxx by @younesbelkada in #21224BLIP] fix doctest by @younesbelkada in #21217.save_pretrained() by @gante in #21264The following contributors have made significant changes to the library over the last release:
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.