release notes
release notes
Published 3/4/2026
MinorContains breaking changesEuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.
Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.
from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]Modular] Fix file type regression (#44283) by @vasqu in [#44283]Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]AnyToAnyPipeline.__call__ docstring (#44229) by @alvarobartt in [#44229]test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]__setitem__ on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]simple] Fix up __repr__ whitespace/brackets (#44048) by @tomaarsen in [#44048]chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]get_audio_features (#44040) by @zucchini-nlp in [#44040]Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]The following contributors have made significant changes to the library over the last release:
has_similar_generate_outputs assertions (#44166)VersionComparison.from_string return type mismatch (#43709)test_generate_with_and_without_position_ids in GLM ORC (#44173)Kosmos2ModelTest test (#44061)release notes
Published 3/4/2026
MinorContains breaking changesEuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.
Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.
from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]Modular] Fix file type regression (#44283) by @vasqu in [#44283]Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]AnyToAnyPipeline.__call__ docstring (#44229) by @alvarobartt in [#44229]test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]__setitem__ on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]simple] Fix up __repr__ whitespace/brackets (#44048) by @tomaarsen in [#44048]chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]get_audio_features (#44040) by @zucchini-nlp in [#44040]Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]The following contributors have made significant changes to the library over the last release:
has_similar_generate_outputs assertions (#44166)VersionComparison.from_string return type mismatch (#43709)test_generate_with_and_without_position_ids in GLM ORC (#44173)Kosmos2ModelTest test (#44061)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.