release notes
release notes
Published 9/25/2024
MinorContains breaking changesThe Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.
An extract from the Qwen2-VL blogpost available here [blocked] is as follows:
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
They introduce two distinct audio interaction modes:
OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.
LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
The team releases an accompanying blog post.
he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.
The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.
GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.
An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.
The Liger kernel is now supported in the Trainer class.
This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).
The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.
It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248
transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.
This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.
The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.
We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:
pipe = pipeline("text-generation", model_checkpoint)
chat = [
{"role": "user", "content": "Can you format the answer in JSON?"},
{"role": "assistant", "content": '{"name": "'}
]
output = pipe(chat) # The model will continue outputting JSON!
We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.
mask_generation.md to Korean by @jeongiin in #32257idefics.md to Korean by @boyunJang in #32258image_to_image.md to Korean by @shinhyunji36 in #32327gptq.md to Korean by @1kmmk1 in #32293prompting.md to Korean by @chhaewxn in #32294quantization/quanto.md to Korean by @fabxoe in #32281image_feature_extraction.md to Korean by @mreraser in #32239chat_templating.md to Korean by @enchantee00 in #32362_supports_sdpa to True by @pocca2048 in #32457ko-llm_tutorial_optimization.md to Korean by @010kim in #32372trainer.md to Korean by @cjfghk5697 in #32260eetq.md to Korean by @jun048098 in #32352fsdp.md to Korean by @win2dvp21 in #32261bitsandbytes.md to Korean by @SeungAhSon in #32408inputs_embeds as input by @molbap in #32493test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516agent.md to Korean by @Jwaminju in #32351encodec model names by @Sai-Suraj-27 in #32581.push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094deepspeed.md to Korean by @4N3MONE in #32431awq.mdto Korean by @ahnjj in #32324test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545transformers tag to the modelcard by @LysandreJik in #32623WhisperGenerationMixin by @faaany in #32316test_tokenization_utils.py by @Sai-Suraj-27 in #32601tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678JetMoeIntegrationTest by @ydshieh in #32332doctest_glob by @Sai-Suraj-27 in #32475 falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837LogitsWarper and LogitsProcessor by @gante in #32626batch_size instead of max_batch_size by @gante in #32657to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871norm_before_gate usage by @vasqu in #32686tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891exceptions.ConnectionError by @younesbelkada in #31469AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849Processor.save_pretrained caused by #31691 by @leloykun in #32921use_cache=False by @gante in #32863PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659atol in test_forward_with_num_logits_to_keep by @gante in #33093isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123conversations.md to Korean by @newfull5 in #32468llm_optims.md to Korean by @yijun-lee in #32325return_dict_in_generate is False but should be True by @gante in #33146bitsandbytes) in docstrings by @rapsealk in #33230torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201num_logits_to_keep in composite models by @zucchini-nlp in #33168FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226inputs_embeds by @zucchini-nlp in #32932transformers[en] Documentation by @nnilayy in #33350FalconMambaForCausalLM by @younesbelkada in #33381FbgemmFp8Linear not preserving tensor shape by @vgel in #33239Zero-shot object detection documentation by @sergiopaniego in #33430SSH into runner info. to DM by @ydshieh in #33346train with a script by @faaany in #33423padding_side as call time kwargs by @zucchini-nlp in #33385Agents and tools documentation links typos by @sergiopaniego in #33471Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696accelerator.use_fp16 in examples by @hlky in #33513sequences_scores in the Whisper beam search output by @Nik-Kras in #32970model.config and model.generation_config 🔫 by @gante in #33480past_key_values is None by @gante in #33541attention_mask is 2D by @gante in #33575Mamba2] Move dt calculations to kernel by @vasqu in #33520gemma2 when instantiating a new cache by @gante in #33595test_generate_from_inputs_embeds_decoder_only by @gante in #33602torch_job by @ydshieh in #33593PreTrainedModel inheriting from GenerationMixin by @gante in #33203cache_implementation) by @gante in #33684The following contributors have made significant changes to the library over the last release:
chat_templating.md to Korean (#32362)ko-llm_tutorial_optimization.md to Korean (#32372)trainer.md to Korean (#32260)deepspeed.md to Korean (#32431)release notes
Published 9/25/2024
MinorContains breaking changesThe Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.
An extract from the Qwen2-VL blogpost available here [blocked] is as follows:
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
They introduce two distinct audio interaction modes:
OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.
LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
The team releases an accompanying blog post.
he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.
The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.
GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.
An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.
The Liger kernel is now supported in the Trainer class.
This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).
The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.
It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248
transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.
This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.
The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.
We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:
pipe = pipeline("text-generation", model_checkpoint)
chat = [
{"role": "user", "content": "Can you format the answer in JSON?"},
{"role": "assistant", "content": '{"name": "'}
]
output = pipe(chat) # The model will continue outputting JSON!
We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.
mask_generation.md to Korean by @jeongiin in #32257idefics.md to Korean by @boyunJang in #32258image_to_image.md to Korean by @shinhyunji36 in #32327gptq.md to Korean by @1kmmk1 in #32293prompting.md to Korean by @chhaewxn in #32294quantization/quanto.md to Korean by @fabxoe in #32281image_feature_extraction.md to Korean by @mreraser in #32239chat_templating.md to Korean by @enchantee00 in #32362_supports_sdpa to True by @pocca2048 in #32457ko-llm_tutorial_optimization.md to Korean by @010kim in #32372trainer.md to Korean by @cjfghk5697 in #32260eetq.md to Korean by @jun048098 in #32352fsdp.md to Korean by @win2dvp21 in #32261bitsandbytes.md to Korean by @SeungAhSon in #32408inputs_embeds as input by @molbap in #32493test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516agent.md to Korean by @Jwaminju in #32351encodec model names by @Sai-Suraj-27 in #32581.push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094deepspeed.md to Korean by @4N3MONE in #32431awq.mdto Korean by @ahnjj in #32324test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545transformers tag to the modelcard by @LysandreJik in #32623WhisperGenerationMixin by @faaany in #32316test_tokenization_utils.py by @Sai-Suraj-27 in #32601tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678JetMoeIntegrationTest by @ydshieh in #32332doctest_glob by @Sai-Suraj-27 in #32475 falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837LogitsWarper and LogitsProcessor by @gante in #32626batch_size instead of max_batch_size by @gante in #32657to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871norm_before_gate usage by @vasqu in #32686tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891exceptions.ConnectionError by @younesbelkada in #31469AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849Processor.save_pretrained caused by #31691 by @leloykun in #32921use_cache=False by @gante in #32863PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659atol in test_forward_with_num_logits_to_keep by @gante in #33093isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123conversations.md to Korean by @newfull5 in #32468llm_optims.md to Korean by @yijun-lee in #32325return_dict_in_generate is False but should be True by @gante in #33146bitsandbytes) in docstrings by @rapsealk in #33230torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201num_logits_to_keep in composite models by @zucchini-nlp in #33168FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226inputs_embeds by @zucchini-nlp in #32932transformers[en] Documentation by @nnilayy in #33350FalconMambaForCausalLM by @younesbelkada in #33381FbgemmFp8Linear not preserving tensor shape by @vgel in #33239Zero-shot object detection documentation by @sergiopaniego in #33430SSH into runner info. to DM by @ydshieh in #33346train with a script by @faaany in #33423padding_side as call time kwargs by @zucchini-nlp in #33385Agents and tools documentation links typos by @sergiopaniego in #33471Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696accelerator.use_fp16 in examples by @hlky in #33513sequences_scores in the Whisper beam search output by @Nik-Kras in #32970model.config and model.generation_config 🔫 by @gante in #33480past_key_values is None by @gante in #33541attention_mask is 2D by @gante in #33575Mamba2] Move dt calculations to kernel by @vasqu in #33520gemma2 when instantiating a new cache by @gante in #33595test_generate_from_inputs_embeds_decoder_only by @gante in #33602torch_job by @ydshieh in #33593PreTrainedModel inheriting from GenerationMixin by @gante in #33203cache_implementation) by @gante in #33684The following contributors have made significant changes to the library over the last release:
chat_templating.md to Korean (#32362)ko-llm_tutorial_optimization.md to Korean (#32372)trainer.md to Korean (#32260)deepspeed.md to Korean (#32431)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.