release notes
release notes
Published 3/15/2023
MinorContains breaking changesThe goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate() function of Whisper, which now uses the generation_config and implementing a batched timestamp prediction. The language and task can now also be setup when calling generate(). For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax 🚀 thanks to @andyehrenberg ! More whisper related commits:
WhisperModelTest by @ydshieh in #21883do_normalize by @ArthurZucker in #21263model_split_percents for WhisperModelTest by @ydshieh in #21922WhisperFeatureExtractor by @bofenghuang in #21938WhisperEncoderModelTest by @ydshieh in #22060Whiper] add get_input_embeddings to WhisperForAudioClassification by @younesbelkada in #22133DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.
X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.
The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
CLAP] Fix few broken things by @younesbelkada in #21670GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.
safetensors is a safe format of serialization of tensors, which has been supported in transformers as a first-class citizen for the past few versions.
This change enables explicitly forcing the from_pretrained method to use or not to use safetensors. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.
Example of usage:
from transformers import AutoModel
# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')
# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)
# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)
This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.
Example of usage with the model hosted in this folder on the Hub:
from transformers import CLIPTextModel
path = "huggingface/the-no-branch-repo" # or ./text_encoder if local
# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")
# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")
The bitsandbytes integration is overhauled, now offering a new configuration: the BytsandbytesConfig.
Read more about it in the documentation.
bnb] Introducing BitsAndBytesConfig by @younesbelkada in #21579bnb] fix bnb decoders bug by @younesbelkada in #21688This PR enables the user to make use of the PyTorch/XLA implementation of FSDP, including the newly added auto-wrap feature. Four arguments have been added to training_args.py to facilitate this functionality:
xla_fsdp: this flag is a string containing the location of a .json file which specifies the FSDP arguments the user wants to use when wrapping their model.xla_fsdp_min_num_params: this flag is an int which will set a size-based automatic wrapping policy which automatically FSDP wraps any module with at least xla_fsdp_min_num_params many parameters.xla_fsdp_transformer_layer_cls_to_wrap: this flag is a list of (case-sensitive) strings which will set a layer-class-based automatic wrapping policy which automatically FSDP wraps any module whose name matches one of the listed strings.xla_fsdp_grad_ckpt: this flag is a bool which determines whether gradient checkpointing is enabled for the automatically wrapped layers.Generate
This PR standardizes beam search behavior across all three frameworks through early_stopping. PyTorch is unchanged, but TensorFlow and Flax users will see a significant speedup if they keep the default generation parameters.
There are, however, minor differences in outputs of the .generate method with beam search on TensorFlow and Flax. It should be very small and will come with significant speedups, but in case it breaks your workflow, we recommend you downgrade to a previous version and let us know in a GitHub issue so that we may investigate what is going on.
Single model initialization
Model initialization has problems which led to the initialization being incoherent across models and across initialization techniques. This is technically a bugfix, but as it may result in your models being initialized with different values, we think it best to highlight it here.
This PR deprecated the parallelize API which has been replaced by accelerate months ago. We recommend loading the model using the device_map attribute and setting it to balanced to obtain the previous behavior.
Setting your own device_map is still permitted, but it needs to be a dictionary from module name to device, for example:
device_map = {'h.0': 0, 'h.1': 1, ...}
A new pipeline focused on zero-shot audio classification is added to the repository.
The task and model summaries have been refactored to take into account the larger number of tasks and models we now have.
t5] Fix T5 inference in float16 + bnb error by @younesbelkada in #21281TrainingArguments.label_names docs to reflect the correct default value behaviour by @fredtcaroli in #21288ImageProcessor in place of FeatureExtractor for pipelines by @Narsil in #20851oneformer. by @Narsil in #21292EfficientFormer by @ydshieh in #21294OneFormerModelIntegrationTest expected values by @ydshieh in #21295Blenderbot doctest by @younesbelkada in #21297past in prepare inputs for generation by @ArthurZucker in #21296model_class.__name__ and compare against XXX_MAPPING_NAMES by @ydshieh in #21304utils/documentation_tests.txt by @ydshieh in #21315TFEncoderDecoder tests by @ydshieh in #21301compute_transition_scores examples by @gante in #21323Perceiver doctest by @younesbelkada in #21318RobertaPreLayerNorm doctest by @ydshieh in #21337GitModelIntegrationTest.test_batched_generation device issue by @ydshieh in #21362max_length and max_new_tokens coexistence by @gante in #21347run_(clm|mlm).py examples] add streaming dataset support by @stas00 in #21343layer_norm_eps in some models by @ydshieh in #21336max_position_embeddings or max_target_positions by @gante in #21389Graphormer and fix its torchscript test failures by @ydshieh in #21380inputs_embeds by @gante in #214051.13.1 in push/schedule CI by @ydshieh in #21421bnb] Fine-tuning HF 8-bit models by @younesbelkada in #21290is_flaky by @ydshieh in #21426inputs_embeds support for .generate() with BLOOM models by @akreal in #21430ConvBertModelTest test by @ydshieh in #21438SpeechT5ForSpeechToSpeechIntegrationTests device issue by @ydshieh in #21460PushToHubCallback import in Share a model docs by @ireneisdoomed in #21457more_itertools dependency. by @Narsil in #21473prepare_inputs_for_generation by @gante in #21477past in favor of pat_key_values by @ArthurZucker in #21443Doc] Fix int8 docs by @younesbelkada in #21487GPT2TokenizerFast to the list of tokenizer to use for OPT. by @ArthurZucker in #20823compute_transition_scores by @gante in #21341report_to none by @stas00 in #21505image_processor in pipeline. by @Narsil in #21513eos_token_ids in model.generate(...) by @tokestermw in #21461__len__ method to _LazyAutoMapping by @ydshieh in #21522.generate() signature == PT .generate() signature by @gante in #21525.generate() can now be exported with dynamic length by @gante in #21474pipeline] A simple fix for half-precision & 8bit models by @younesbelkada in #21479torch_dtype="auto" to look up config.torch_dtype first, expand docs by @stas00 in #21524config.hidden_size by @stas00 in #21504Blip2] Add int8 support for blip2-flan-t5-xxl by @younesbelkada in #21574inputs_embeds support when generating with GPT-J by @dimitry12 in #21575test_constrained_beam_search_generate_dict_output by @gante in #21561bnb] Let's make the daily CI green 🍏 by @younesbelkada in #21597requires_grad on input embedding to train on top of frozen layers by @younesbelkada in #21598max_length is reached." from InfNaNLogitsProcessor documentation by @mmcdermott in #21634ImageProcessor] Refactor default mean & std to OPENAI_CLIP_MEAN & OPENAI_CLIP_STD by @younesbelkada in #21425BLIP] update blip path on slow tests by @younesbelkada in #21476PROCESSOR_MAPPING_NAMES and add tests by @ydshieh in #21703get_class_in_module by @ydshieh in #21709MBart] Fix cross attention mask check by @younesbelkada in #21730gptsan_japanese from doctest list to avoid GPU OOM by @ydshieh in #21722BigBirdForQuestionAnswering by @ydshieh in #21723ErnieMEmbeddings device issue by @ydshieh in #21726GPTSanJapaneseModel by @ydshieh in #21731GPTNeo] Fix gradient checkpointing bug by @younesbelkada in #21733max_length and num_beams by @bofenghuang in #21740concrete_args from outside available by @lygztq in #21775tests] add accelerate marker by @younesbelkada in #21743PerceiverFourierPositionEncoding with fp16 by @fxmarty in #21787ruff==0.0.253 by @ydshieh in #21828logger.warning_once and use it for grad checkpointing code by @stas00 in #21804MobileViTModelTest to TFMobileViTModelTest by @ydshieh in #21825T5] Fix torchquant issue by @younesbelkada in #21843Blip2] Add Blip2Model by @younesbelkada in #21817Blip2] Fix Blip-2 multi gpu by @younesbelkada in #21707PipelineTestCaseMeta 🚀 by @ydshieh in #21516Blip] Fix blip doctest by @younesbelkada in #21868test_load_default_pipelines_pt for ClapModel by @ydshieh in #21886inputs_embeds functionality when generating with BioGPT by @sidkiblawi in #21889d_kv by @ArthurZucker in #21896BridgeTowerModelTest by @ydshieh in #21908repo_utils_job by @ydshieh in #21928AlignModelTest tests by @ydshieh in #21923check_repo.py due to missing backends by @ydshieh in #21930XLMProphetNetModelIntegrationTest by @ydshieh in #21957torch.allclose for some tests by @ydshieh in #21966test_xglm_sample by @ydshieh in #21975Jukebox tests by @ydshieh in #21984notification_service.py by @ydshieh in #21992test_multi_gpu_data_parallel_forward for some model tests by @ydshieh in #21991AudioClassificationPipelineTests::test_small_model_pt for PT 2.0.0 by @ydshieh in #22023bnb] Fix bnb error message by @younesbelkada in #22026text_config_dict and vision_config_dict being saved for CLIP-like models by @ydshieh in #22035BridgeTower tests slow for now by @ydshieh in #22039huggingface_hub warnings in CI report by @ydshieh in #22054image_processing_donut to match code by @vermouthmjl in #22033Blip2] skip accelerate test by @younesbelkada in #22124is_pipeline_test_to_skip to specific model test classes by @ydshieh in #21999--optim adamw_torch_fused for pt-2.0+ by @stas00 in #22144The following contributors have made significant changes to the library over the last release:
release notes
Published 3/15/2023
MinorContains breaking changesThe goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
The Whisper model was integrated a few releases ago. This release offers significant performance optimizations when generating with timestamps. This was made possible by rewriting the generate() function of Whisper, which now uses the generation_config and implementing a batched timestamp prediction. The language and task can now also be setup when calling generate(). For more details about this refactoring checkout this colab.
Notably, whisper is also now supported in Flax 🚀 thanks to @andyehrenberg ! More whisper related commits:
WhisperModelTest by @ydshieh in #21883do_normalize by @ArthurZucker in #21263model_split_percents for WhisperModelTest by @ydshieh in #21922WhisperFeatureExtractor by @bofenghuang in #21938WhisperEncoderModelTest by @ydshieh in #22060Whiper] add get_input_embeddings to WhisperForAudioClassification by @younesbelkada in #22133DETA (short for Detection Transformers with Assignment) improves Deformable DETR by replacing the one-to-one bipartite Hungarian matching loss with one-to-many label assignments used in traditional detectors with non-maximum suppression (NMS). This leads to significant gains of up to 2.5 mAP.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.
X-MOD extends multilingual masked language models like XLM-R to include language-specific modular components (language adapters) during pre-training. For fine-tuning, the language adapters in each transformer layer are frozen.
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.
The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
CLAP] Fix few broken things by @younesbelkada in #21670GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
Informer is a method to be applied to long-sequence time-series forecasting. This method introduces a Probabilistic Attention mechanism to select the “active” queries rather than the “lazy” queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention.
safetensors is a safe format of serialization of tensors, which has been supported in transformers as a first-class citizen for the past few versions.
This change enables explicitly forcing the from_pretrained method to use or not to use safetensors. This unlocks a few use-cases, notably the possibility to enforce loading only from this format, limiting security risks.
Example of usage:
from transformers import AutoModel
# As of version v4.27.0, this loads the `pytorch_model.bin` by default if `safetensors` is not installed.
# It loads the `model.safetensors` file if `safetensors` is installed.
model = AutoModel.from_pretrained('bert-base-cased')
# This forces the load from the `model.safetensors` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=True)
# This forces the load from the `pytorch_model.bin` file.
model = AutoModel.from_pretrained('bert-base-cased', use_safetensors=False)
This PR adds a "variant" keyword argument to PyTorch's from_pretrained and save_pretrained so that multiple weight variants can be saved in the model repo.
Example of usage with the model hosted in this folder on the Hub:
from transformers import CLIPTextModel
path = "huggingface/the-no-branch-repo" # or ./text_encoder if local
# Loads the `no_ema` variant. This loads the `pytorch_model.fp16.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder", variant="fp16")
# This loads the no-variant checkpoint, loading the `pytorch_model.bin` file from this folder.
model = CLIPTextModel.from_pretrained(path, subfolder="text_encoder")
The bitsandbytes integration is overhauled, now offering a new configuration: the BytsandbytesConfig.
Read more about it in the documentation.
bnb] Introducing BitsAndBytesConfig by @younesbelkada in #21579bnb] fix bnb decoders bug by @younesbelkada in #21688This PR enables the user to make use of the PyTorch/XLA implementation of FSDP, including the newly added auto-wrap feature. Four arguments have been added to training_args.py to facilitate this functionality:
xla_fsdp: this flag is a string containing the location of a .json file which specifies the FSDP arguments the user wants to use when wrapping their model.xla_fsdp_min_num_params: this flag is an int which will set a size-based automatic wrapping policy which automatically FSDP wraps any module with at least xla_fsdp_min_num_params many parameters.xla_fsdp_transformer_layer_cls_to_wrap: this flag is a list of (case-sensitive) strings which will set a layer-class-based automatic wrapping policy which automatically FSDP wraps any module whose name matches one of the listed strings.xla_fsdp_grad_ckpt: this flag is a bool which determines whether gradient checkpointing is enabled for the automatically wrapped layers.Generate
This PR standardizes beam search behavior across all three frameworks through early_stopping. PyTorch is unchanged, but TensorFlow and Flax users will see a significant speedup if they keep the default generation parameters.
There are, however, minor differences in outputs of the .generate method with beam search on TensorFlow and Flax. It should be very small and will come with significant speedups, but in case it breaks your workflow, we recommend you downgrade to a previous version and let us know in a GitHub issue so that we may investigate what is going on.
Single model initialization
Model initialization has problems which led to the initialization being incoherent across models and across initialization techniques. This is technically a bugfix, but as it may result in your models being initialized with different values, we think it best to highlight it here.
This PR deprecated the parallelize API which has been replaced by accelerate months ago. We recommend loading the model using the device_map attribute and setting it to balanced to obtain the previous behavior.
Setting your own device_map is still permitted, but it needs to be a dictionary from module name to device, for example:
device_map = {'h.0': 0, 'h.1': 1, ...}
A new pipeline focused on zero-shot audio classification is added to the repository.
The task and model summaries have been refactored to take into account the larger number of tasks and models we now have.
t5] Fix T5 inference in float16 + bnb error by @younesbelkada in #21281TrainingArguments.label_names docs to reflect the correct default value behaviour by @fredtcaroli in #21288ImageProcessor in place of FeatureExtractor for pipelines by @Narsil in #20851oneformer. by @Narsil in #21292EfficientFormer by @ydshieh in #21294OneFormerModelIntegrationTest expected values by @ydshieh in #21295Blenderbot doctest by @younesbelkada in #21297past in prepare inputs for generation by @ArthurZucker in #21296model_class.__name__ and compare against XXX_MAPPING_NAMES by @ydshieh in #21304utils/documentation_tests.txt by @ydshieh in #21315TFEncoderDecoder tests by @ydshieh in #21301compute_transition_scores examples by @gante in #21323Perceiver doctest by @younesbelkada in #21318RobertaPreLayerNorm doctest by @ydshieh in #21337GitModelIntegrationTest.test_batched_generation device issue by @ydshieh in #21362max_length and max_new_tokens coexistence by @gante in #21347run_(clm|mlm).py examples] add streaming dataset support by @stas00 in #21343layer_norm_eps in some models by @ydshieh in #21336max_position_embeddings or max_target_positions by @gante in #21389Graphormer and fix its torchscript test failures by @ydshieh in #21380inputs_embeds by @gante in #214051.13.1 in push/schedule CI by @ydshieh in #21421bnb] Fine-tuning HF 8-bit models by @younesbelkada in #21290is_flaky by @ydshieh in #21426inputs_embeds support for .generate() with BLOOM models by @akreal in #21430ConvBertModelTest test by @ydshieh in #21438SpeechT5ForSpeechToSpeechIntegrationTests device issue by @ydshieh in #21460PushToHubCallback import in Share a model docs by @ireneisdoomed in #21457more_itertools dependency. by @Narsil in #21473prepare_inputs_for_generation by @gante in #21477past in favor of pat_key_values by @ArthurZucker in #21443Doc] Fix int8 docs by @younesbelkada in #21487GPT2TokenizerFast to the list of tokenizer to use for OPT. by @ArthurZucker in #20823compute_transition_scores by @gante in #21341report_to none by @stas00 in #21505image_processor in pipeline. by @Narsil in #21513eos_token_ids in model.generate(...) by @tokestermw in #21461__len__ method to _LazyAutoMapping by @ydshieh in #21522.generate() signature == PT .generate() signature by @gante in #21525.generate() can now be exported with dynamic length by @gante in #21474pipeline] A simple fix for half-precision & 8bit models by @younesbelkada in #21479torch_dtype="auto" to look up config.torch_dtype first, expand docs by @stas00 in #21524config.hidden_size by @stas00 in #21504Blip2] Add int8 support for blip2-flan-t5-xxl by @younesbelkada in #21574inputs_embeds support when generating with GPT-J by @dimitry12 in #21575test_constrained_beam_search_generate_dict_output by @gante in #21561bnb] Let's make the daily CI green 🍏 by @younesbelkada in #21597requires_grad on input embedding to train on top of frozen layers by @younesbelkada in #21598max_length is reached." from InfNaNLogitsProcessor documentation by @mmcdermott in #21634ImageProcessor] Refactor default mean & std to OPENAI_CLIP_MEAN & OPENAI_CLIP_STD by @younesbelkada in #21425BLIP] update blip path on slow tests by @younesbelkada in #21476PROCESSOR_MAPPING_NAMES and add tests by @ydshieh in #21703get_class_in_module by @ydshieh in #21709MBart] Fix cross attention mask check by @younesbelkada in #21730gptsan_japanese from doctest list to avoid GPU OOM by @ydshieh in #21722BigBirdForQuestionAnswering by @ydshieh in #21723ErnieMEmbeddings device issue by @ydshieh in #21726GPTSanJapaneseModel by @ydshieh in #21731GPTNeo] Fix gradient checkpointing bug by @younesbelkada in #21733max_length and num_beams by @bofenghuang in #21740concrete_args from outside available by @lygztq in #21775tests] add accelerate marker by @younesbelkada in #21743PerceiverFourierPositionEncoding with fp16 by @fxmarty in #21787ruff==0.0.253 by @ydshieh in #21828logger.warning_once and use it for grad checkpointing code by @stas00 in #21804MobileViTModelTest to TFMobileViTModelTest by @ydshieh in #21825T5] Fix torchquant issue by @younesbelkada in #21843Blip2] Add Blip2Model by @younesbelkada in #21817Blip2] Fix Blip-2 multi gpu by @younesbelkada in #21707PipelineTestCaseMeta 🚀 by @ydshieh in #21516Blip] Fix blip doctest by @younesbelkada in #21868test_load_default_pipelines_pt for ClapModel by @ydshieh in #21886inputs_embeds functionality when generating with BioGPT by @sidkiblawi in #21889d_kv by @ArthurZucker in #21896BridgeTowerModelTest by @ydshieh in #21908repo_utils_job by @ydshieh in #21928AlignModelTest tests by @ydshieh in #21923check_repo.py due to missing backends by @ydshieh in #21930XLMProphetNetModelIntegrationTest by @ydshieh in #21957torch.allclose for some tests by @ydshieh in #21966test_xglm_sample by @ydshieh in #21975Jukebox tests by @ydshieh in #21984notification_service.py by @ydshieh in #21992test_multi_gpu_data_parallel_forward for some model tests by @ydshieh in #21991AudioClassificationPipelineTests::test_small_model_pt for PT 2.0.0 by @ydshieh in #22023bnb] Fix bnb error message by @younesbelkada in #22026text_config_dict and vision_config_dict being saved for CLIP-like models by @ydshieh in #22035BridgeTower tests slow for now by @ydshieh in #22039huggingface_hub warnings in CI report by @ydshieh in #22054image_processing_donut to match code by @vermouthmjl in #22033Blip2] skip accelerate test by @younesbelkada in #22124is_pipeline_test_to_skip to specific model test classes by @ydshieh in #21999--optim adamw_torch_fused for pt-2.0+ by @stas00 in #22144The following contributors have made significant changes to the library over the last release:
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.