release notes
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
release notes
Published 6/26/2025
MinorContains breaking changesGemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
torch_dtype=torch.bfloat16,
model="google/gemma-3n-e4b",
device="cuda",
)
output = pipe(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
text="<image_soft_token> in this image, there is"
)
print(output)
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
Read more about the model in the documentation

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
Read more about the model in the documentation.
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
Read more about the model in the documentation.
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
Read more about the model in the documentation.
MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.
The architecture of MiniMax is briefly described as follows:
For more details refer to the release blog post.
Read more about the model in the documentation.
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.
The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.
Read more about the model in the documentation.
The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!
Read more about the model in the documentation.
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.
Read more about the model in the documentation.
The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.
Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL
Read more about the model in the documentation.
The abstract from the report is the following:
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.
Read more about the model in the documentation.
SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
Read more about the model in the documentation.
In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the [@use](https://github.com/use)_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.
To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗
Support for Flash Attention 3 is added across the most popular models.
Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.
We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.
Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.
dtype for pipelines to auto by @Vaibhavs10 in #38882output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108Attention] Attention refactor for Whisper-based models by @vasqu in #38235compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127forced_decoder_ids by @gante in #38232liger-kernel to docker file by @ydshieh in #38292transformers env output by @yao-matrix in #38274forced_decoder_ids deletion by @gante in #38316beam_indices by @gante in #38259custom_generate and trust_remote_code by @gante in #38304vasqu to self-comment-ci.yml by @ydshieh in #38324FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321kernels for AMD docker images by @ydshieh in #38354OPT] Fix attention scaling by @vasqu in #38290get_default_device for torch<2.3 by @Cyrilvallez in #38376utils/notification_service.py by @ydshieh in #38379initialize_weights by @Cyrilvallez in #38382tokenizer -> tokenize by @foldl in #38357generation_config.json as base parameterization by @gante in #38330test_offloaded_cache_implementation) by @gante in #37896pixel_values with inputs_embeds by @dxoigmn in #38334CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424huggingface/transformers by @ydshieh in #38413from_pretrained by @pstjohn in #38155from_args_and_dict ProcessorMixin by @yonigozlan in #38296microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335BatchFeature and BatchEncoding by @lgeiger in #38459Gemma3IntegrationTest by @ydshieh in #38471SinkCache to a custom_generate repo by @gante in #38399Gemma2IntegrationTest by @ydshieh in #38492av by @ydshieh in #38548python3 by @S1ro1 in #38555utils/notification_service.py by @ydshieh in #38556chameleon tests by @ydshieh in #38565utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563deepseekv3 by @ydshieh in #38562FlexAttn] Fix models with unique characteristics by @vasqu in #38433repository field to benchmarks table by @McPatate in #38582mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)isort from dependencies by @Sai-Suraj-27 in #38616return_dict=False giving errors in a few VLM models by @ydshieh in #38519MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575test_initialization by @ydshieh in #38607ColQwen2ModelIntegrationTest by @ydshieh in #38583test_initialization for SwiftFormer by @ydshieh in #38636AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615InternVL integration test by @ydshieh in #38612aya_vision test by @ydshieh in #38674is_bitsandbytes_available() by @ved1beta in #38528llava tests by @ydshieh in #38722None instead of try/except by @zucchini-nlp in #38561average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785qwen_2_5 omni by @ydshieh in #38658llava_onevision tests by @ydshieh in #38791mllama by @ydshieh in #38704low_cpu_mem_usage by @Cyrilvallez in #38792llava_next tests by @ydshieh in #38813wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761qwen2_5_vl tests by @ydshieh in #38845auxiliary_in_channels default behavior in UperNet by @simonreise in #37540qwen3 tests by @ydshieh in #38862phi4_multimodal tests by @ydshieh in #38816qwen3_moe tests by @ydshieh in #38865raise from e in hub.py utility by @Wauplin in #37241fsmt tests by @ydshieh in #38904FalconMambaIntegrationTests by @ydshieh in #38566ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922test_initialization by @ydshieh in #38932mistral and mistral3 tests by @ydshieh in #38978is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818rag by @ydshieh in #38585Attention] Small fix on output attentions by @vasqu in #38948require_tf) by @gante in #38944The following contributors have made significant changes to the library over the last release:
liger-kernel to docker file (#38292)vasqu to self-comment-ci.yml (#38324)kernels for AMD docker images (#38354)utils/notification_service.py (#38379)CsmForConditionalGenerationIntegrationTest (#38424)huggingface/transformers (#38413)Gemma3IntegrationTest (#38471)Gemma2IntegrationTest (#38492)av (#38548)utils/notification_service.py (#38556)chameleon tests (#38565)utils/notification_service.py for AMD vs Nvidia (#38563)deepseekv3 (#38562)return_dict=False giving errors in a few VLM models (#38519)test_initialization (#38607)ColQwen2ModelIntegrationTest (#38583)test_initialization for SwiftFormer (#38636)AriaForConditionalGenerationModelTest on CircleCI (#38615)InternVL integration test (#38612)aya_vision test (#38674)llava tests (#38722)qwen_2_5 omni (#38658)llava_onevision tests (#38791)mllama (#38704)llava_next tests (#38813)qwen2_5_vl tests (#38845)qwen3 tests (#38862)phi4_multimodal tests (#38816)qwen3_moe tests (#38865)fsmt tests (#38904)FalconMambaIntegrationTests (#38566)test_initialization (#38932)mistral and mistral3 tests (#38978)rag (#38585)transformers env output (#38274)Attention] Refactor Attention Interface for Bart-based Models (#38108)FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)OPT] Fix attention scaling (#38290)Attention] Attention refactor for Whisper-based models (#38235)FlexAttn] Fix models with unique characteristics (#38433)Attention] Small fix on output attentions (#38948)microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)release notes
Published 6/26/2025
MinorContains breaking changesGemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
torch_dtype=torch.bfloat16,
model="google/gemma-3n-e4b",
device="cuda",
)
output = pipe(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
text="<image_soft_token> in this image, there is"
)
print(output)
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
Read more about the model in the documentation

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
Read more about the model in the documentation.
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
Read more about the model in the documentation.
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
Read more about the model in the documentation.
MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.
The architecture of MiniMax is briefly described as follows:
For more details refer to the release blog post.
Read more about the model in the documentation.
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.
The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.
Read more about the model in the documentation.
The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!
Read more about the model in the documentation.
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.
Read more about the model in the documentation.
The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.
Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL
Read more about the model in the documentation.
The abstract from the report is the following:
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.
Read more about the model in the documentation.
SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
Read more about the model in the documentation.
In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the [@use](https://github.com/use)_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.
To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗
Support for Flash Attention 3 is added across the most popular models.
Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.
We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.
Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.
dtype for pipelines to auto by @Vaibhavs10 in #38882output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108Attention] Attention refactor for Whisper-based models by @vasqu in #38235compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127forced_decoder_ids by @gante in #38232liger-kernel to docker file by @ydshieh in #38292transformers env output by @yao-matrix in #38274forced_decoder_ids deletion by @gante in #38316beam_indices by @gante in #38259custom_generate and trust_remote_code by @gante in #38304vasqu to self-comment-ci.yml by @ydshieh in #38324FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321kernels for AMD docker images by @ydshieh in #38354OPT] Fix attention scaling by @vasqu in #38290get_default_device for torch<2.3 by @Cyrilvallez in #38376utils/notification_service.py by @ydshieh in #38379initialize_weights by @Cyrilvallez in #38382tokenizer -> tokenize by @foldl in #38357generation_config.json as base parameterization by @gante in #38330test_offloaded_cache_implementation) by @gante in #37896pixel_values with inputs_embeds by @dxoigmn in #38334CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424huggingface/transformers by @ydshieh in #38413from_pretrained by @pstjohn in #38155from_args_and_dict ProcessorMixin by @yonigozlan in #38296microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335BatchFeature and BatchEncoding by @lgeiger in #38459Gemma3IntegrationTest by @ydshieh in #38471SinkCache to a custom_generate repo by @gante in #38399Gemma2IntegrationTest by @ydshieh in #38492av by @ydshieh in #38548python3 by @S1ro1 in #38555utils/notification_service.py by @ydshieh in #38556chameleon tests by @ydshieh in #38565utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563deepseekv3 by @ydshieh in #38562FlexAttn] Fix models with unique characteristics by @vasqu in #38433repository field to benchmarks table by @McPatate in #38582mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)isort from dependencies by @Sai-Suraj-27 in #38616return_dict=False giving errors in a few VLM models by @ydshieh in #38519MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575test_initialization by @ydshieh in #38607ColQwen2ModelIntegrationTest by @ydshieh in #38583test_initialization for SwiftFormer by @ydshieh in #38636AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615InternVL integration test by @ydshieh in #38612aya_vision test by @ydshieh in #38674is_bitsandbytes_available() by @ved1beta in #38528llava tests by @ydshieh in #38722None instead of try/except by @zucchini-nlp in #38561average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785qwen_2_5 omni by @ydshieh in #38658llava_onevision tests by @ydshieh in #38791mllama by @ydshieh in #38704low_cpu_mem_usage by @Cyrilvallez in #38792llava_next tests by @ydshieh in #38813wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761qwen2_5_vl tests by @ydshieh in #38845auxiliary_in_channels default behavior in UperNet by @simonreise in #37540qwen3 tests by @ydshieh in #38862phi4_multimodal tests by @ydshieh in #38816qwen3_moe tests by @ydshieh in #38865raise from e in hub.py utility by @Wauplin in #37241fsmt tests by @ydshieh in #38904FalconMambaIntegrationTests by @ydshieh in #38566ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922test_initialization by @ydshieh in #38932mistral and mistral3 tests by @ydshieh in #38978is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818rag by @ydshieh in #38585Attention] Small fix on output attentions by @vasqu in #38948require_tf) by @gante in #38944The following contributors have made significant changes to the library over the last release:
liger-kernel to docker file (#38292)vasqu to self-comment-ci.yml (#38324)kernels for AMD docker images (#38354)utils/notification_service.py (#38379)CsmForConditionalGenerationIntegrationTest (#38424)huggingface/transformers (#38413)Gemma3IntegrationTest (#38471)Gemma2IntegrationTest (#38492)av (#38548)utils/notification_service.py (#38556)chameleon tests (#38565)utils/notification_service.py for AMD vs Nvidia (#38563)deepseekv3 (#38562)return_dict=False giving errors in a few VLM models (#38519)test_initialization (#38607)ColQwen2ModelIntegrationTest (#38583)test_initialization for SwiftFormer (#38636)AriaForConditionalGenerationModelTest on CircleCI (#38615)InternVL integration test (#38612)aya_vision test (#38674)llava tests (#38722)qwen_2_5 omni (#38658)llava_onevision tests (#38791)mllama (#38704)llava_next tests (#38813)qwen2_5_vl tests (#38845)qwen3 tests (#38862)phi4_multimodal tests (#38816)qwen3_moe tests (#38865)fsmt tests (#38904)FalconMambaIntegrationTests (#38566)test_initialization (#38932)mistral and mistral3 tests (#38978)rag (#38585)transformers env output (#38274)Attention] Refactor Attention Interface for Bart-based Models (#38108)FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)OPT] Fix attention scaling (#38290)Attention] Attention refactor for Whisper-based models (#38235)FlexAttn] Fix models with unique characteristics (#38433)Attention] Small fix on output attentions (#38948)microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)