release notes
release notes
Published 4/5/2025
MinorContains new featuresLlama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories
Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:
pip install -U transformers[hf_xet]
Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:
torchrun –nproc-per-instance=8 script.py
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The model is detailed in the following paper.
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!
Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.
A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.
original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912DataCollatorForWholeWordMask by @capemox in #36903uv for installing packages by @Sai-Suraj-27 in #36957networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885qwen2_vl.md to Korean by @MinJu-Ha in #36750AwqConfigTest by @faaany in #37032test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120_VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305307 in RequestCounter by @ydshieh in #36953TASK_MAPPING by @saattrupdan in #37107min_new_tokens to prevent flaky length checks by @gante in #37175num_items_in_batch if necessary by @regisss in #36967utils/check_bad_commit.py by @ydshieh in #37272return_tensors in audio chat templates by @zucchini-nlp in #346010.11.2 by @ydshieh in #36962lru_cache for tokenization tests by @ydshieh in #36818return_dict logic to remove complicated if/else paths by @qubvel in #36794The following contributors have made significant changes to the library over the last release:
release notes
Published 4/5/2025
MinorContains new featuresLlama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories
Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:
pip install -U transformers[hf_xet]
Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:
torchrun –nproc-per-instance=8 script.py
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The model is detailed in the following paper.
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!
Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.
A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.
original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912DataCollatorForWholeWordMask by @capemox in #36903uv for installing packages by @Sai-Suraj-27 in #36957networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885qwen2_vl.md to Korean by @MinJu-Ha in #36750AwqConfigTest by @faaany in #37032test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120_VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305307 in RequestCounter by @ydshieh in #36953TASK_MAPPING by @saattrupdan in #37107min_new_tokens to prevent flaky length checks by @gante in #37175num_items_in_batch if necessary by @regisss in #36967utils/check_bad_commit.py by @ydshieh in #37272return_tensors in audio chat templates by @zucchini-nlp in #346010.11.2 by @ydshieh in #36962lru_cache for tokenization tests by @ydshieh in #36818return_dict logic to remove complicated if/else paths by @qubvel in #36794The following contributors have made significant changes to the library over the last release:
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.