release notes
release notes
Published 6/27/2024
MinorContains breaking changesThe Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.
The abstract from the paper is the following:
This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations
The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.
RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.
The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.
InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.
LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.
A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:
The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:
- single model single file
- explicit code
- standardization of modeling code
- readable and educative code
- simple code
- least amount of modularity
This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.
We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.
If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.
We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.
A new optimizer is added in the Trainer.
Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.
Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.
New instance segmentation examples are added by @qubvel
As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation
config = MaskFormerConfig(backbone="microsoft/resnet-50", use_pretrained_backbone=True)
model = MaskFormerForInstanceSegmentation(config)
Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.
generate() 🔥🔥🔥 by @Cyrilvallez in #30536Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.
The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.
Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.
ignore_index attribute of the loss is updated to -100timm being updatedRecent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.
As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.
This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.
mamba slow forward by @vasqu in #30691tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912optimum-benchmark by @ydshieh in #30615torch.use_deterministic_algorithms for XPU by @faaany in #30774MptIntegrationTests expected outputs by @ydshieh in #30989uv==0.1.45 by @ydshieh in #31006test_model_parallelism device-agnostic by @faaany in #30844test_model_parallelism for 2 model test classes by @ydshieh in #31067[@main](https://github.com/main) by @ydshieh in #31065ninja from docker image build by @ydshieh in #31080accelerate as a hard requirement by @younesbelkada in #31090OPTForQuestionAnswering by @younesbelkada in #31092test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016transformers-cli env reporting by @statelesshz in #31003load_in_8bit with bnb config by @younesbelkada in #31136IS_GITHUB_CI by @younesbelkada in #31147GemmaModel] fix small typo by @ArthurZucker in #31202test_compile_static_cache by @ydshieh in #30991mistral.py::Mask4DTestHard by @ydshieh in #31212MistralIntegrationTest by @ydshieh in #31231BlipModel by @younesbelkada in #31235name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243benchmark job in push-important-models.yml by @ydshieh in #31259SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284str should be used not int when setting env variables by @statelesshz in #31272decoder_attention_mask shape by @ylacombe in #28071inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411tokenizer being popped twice by @gante in #31427TestDeepSpeedModelZoo device-agnostic by @faaany in #31402dataloader_persistent_workers=True by @bastienlc in #30627Qwen2ForTokenClassification by @kevinhu in #31440generate call from local path by @gante in #31470PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404metric_for_best_model errors by @tomaarsen in #31450GPT2] Add SDPA support by @vasqu in #31172test_config_object to test_ds_config_object by @faaany in #31403torch.compile support for AQLM by @younesbelkada in #31473wandb integration with SetFit model by @timothepearce in #30021tokenization_utils_base.py's docstring by @sadra-barikbin in #31510spectrogram_batch by @ravenouse in #27159TrainingArguments by @qgallouedec in #31503_no_split_module by @zucchini-nlp in #31566i18n by @SauravMaheshkar in #31584self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632GPT-NeoX] Add SDPA support by @vasqu in #31031past_key_values passed as kwargs by @gante in #31644The following contributors have made significant changes to the library over the last release:
spectrogram_batch (#27159)release notes
Published 6/27/2024
MinorContains breaking changesThe Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.
The abstract from the paper is the following:
This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations
The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.
RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.
The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.
InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.
LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.
A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:
The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:
- single model single file
- explicit code
- standardization of modeling code
- readable and educative code
- simple code
- least amount of modularity
This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.
We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.
If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.
We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.
A new optimizer is added in the Trainer.
Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.
Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.
New instance segmentation examples are added by @qubvel
As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation
config = MaskFormerConfig(backbone="microsoft/resnet-50", use_pretrained_backbone=True)
model = MaskFormerForInstanceSegmentation(config)
Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.
generate() 🔥🔥🔥 by @Cyrilvallez in #30536Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.
The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.
Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.
ignore_index attribute of the loss is updated to -100timm being updatedRecent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.
As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.
This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.
mamba slow forward by @vasqu in #30691tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912optimum-benchmark by @ydshieh in #30615torch.use_deterministic_algorithms for XPU by @faaany in #30774MptIntegrationTests expected outputs by @ydshieh in #30989uv==0.1.45 by @ydshieh in #31006test_model_parallelism device-agnostic by @faaany in #30844test_model_parallelism for 2 model test classes by @ydshieh in #31067[@main](https://github.com/main) by @ydshieh in #31065ninja from docker image build by @ydshieh in #31080accelerate as a hard requirement by @younesbelkada in #31090OPTForQuestionAnswering by @younesbelkada in #31092test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016transformers-cli env reporting by @statelesshz in #31003load_in_8bit with bnb config by @younesbelkada in #31136IS_GITHUB_CI by @younesbelkada in #31147GemmaModel] fix small typo by @ArthurZucker in #31202test_compile_static_cache by @ydshieh in #30991mistral.py::Mask4DTestHard by @ydshieh in #31212MistralIntegrationTest by @ydshieh in #31231BlipModel by @younesbelkada in #31235name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243benchmark job in push-important-models.yml by @ydshieh in #31259SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284str should be used not int when setting env variables by @statelesshz in #31272decoder_attention_mask shape by @ylacombe in #28071inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411tokenizer being popped twice by @gante in #31427TestDeepSpeedModelZoo device-agnostic by @faaany in #31402dataloader_persistent_workers=True by @bastienlc in #30627Qwen2ForTokenClassification by @kevinhu in #31440generate call from local path by @gante in #31470PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404metric_for_best_model errors by @tomaarsen in #31450GPT2] Add SDPA support by @vasqu in #31172test_config_object to test_ds_config_object by @faaany in #31403torch.compile support for AQLM by @younesbelkada in #31473wandb integration with SetFit model by @timothepearce in #30021tokenization_utils_base.py's docstring by @sadra-barikbin in #31510spectrogram_batch by @ravenouse in #27159TrainingArguments by @qgallouedec in #31503_no_split_module by @zucchini-nlp in #31566i18n by @SauravMaheshkar in #31584self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632GPT-NeoX] Add SDPA support by @vasqu in #31031past_key_values passed as kwargs by @gante in #31644The following contributors have made significant changes to the library over the last release:
spectrogram_batch (#27159)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.