release notes
release notes
Published 2 weeks ago
MinorContains breaking changesVideo Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.
Links: Documentation | Paper
UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.
Links: Documentation
The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.
Links: Documentation | Paper
Jina-Embeddings-V3 Model (#44251) by @Sai-Suraj-27 in #44251Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.
Links: Documentation
PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.
Links: Documentation | Paper
SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.
Links: Documentation
PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.
Links: Documentation
PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.
Links: Documentation | Paper
PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.
Links: Documentation | Paper | Blog Post
The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.
PreTrainedConfig and model config classes have been refactored to use [@dataclass](https://github.com/dataclass) and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.
Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.
Weight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.
The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.
Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.
supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.
Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.
processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.
FA4] Add kernels fallback (#44797) by @vasqu in [#44797]Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.
Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".
The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.
cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]tests_hub if no tests found (#45014) by @ydshieh in [#45014]attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]Qwen2VL (#44976) by @hmellor in [#44976]check_auto_docstrings (#44803) by @yonigozlan in [#44803]vllm x v5] nit (#44971) by @ArthurZucker in [#44971]T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]image_processing_utils_fast (#44897) by @yonigozlan in [#44897]NemotronH is torch compiled (#44854) by @ydshieh in [#44854]SizeDict (#44884) by @hmellor in [#44884]layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]PreTrainedModel (#44672) by @neo in [#44672]KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508].ai (#44489) by @tarekziade in [#44489][@strict](https://github.com/strict) (#44770) by @zucchini-nlp in [#44770]is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]config class in some model class definitions (#44715) by @ydshieh in [#44715]FA] Fix fa detection (#44703) by @vasqu in [#44703]set_encoder (#44698) by @hmellor in [#44698]parent issue (#44685) by @ydshieh in [#44685]ParallelInterface (#44640) by @michaelbenayoun in [#44640]Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]cache_position in more models (#44330) by @Cyrilvallez in [#44330]src/transformers/quantizers (#44412) by @tarekziade in [#44412]fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]_prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]mps device (#44506) by @michaelbenayoun in [#44506]GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]ty to 0.0.20 (#44494) by @tarekziade in [#44494]diffusers to CI docker file (#44480) by @ydshieh in [#44480]DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]The following contributors have made significant changes to the library over the last release:
tests_hub if no tests found (#45014)Update Transformers metadata after #43514 (#44941)processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)NemotronH is torch compiled (#44854)is_causal from EuroBertConfig (#44774)mlcd auto config/model/mapping issues (#44730)config class in some model class definitions (#44715)parent issue (#44685)VibeVoiceAcousticTokenizer (#44628)build_pr_documentation.yml (will be the new required job) (#44538)build_pr_documentation workflow for merge_group event (#44532)diffusers to CI docker file (#44480).ai (#44489)src/transformers/quantizers (#44412)ty to 0.0.20 (#44494)T5ModelIntegrationTest (#44934)Jina-Embeddings-V3 Model (#44251)MusicgenStereo integration tests (#44527)GPTNeoModelLanguageGenerationTest (#44515)MarianIntegrationTests (#44519)DepthProModelIntegrationTest (#44456)ProphetNetModelIntegrationTest (#44439)higgs_audio_v2 tests (#44482)release notes
Published 2 weeks ago
MinorContains breaking changesVideo Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.
Links: Documentation | Paper
UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.
Links: Documentation
The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.
Links: Documentation | Paper
Jina-Embeddings-V3 Model (#44251) by @Sai-Suraj-27 in #44251Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.
Links: Documentation
PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.
Links: Documentation | Paper
SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.
Links: Documentation
PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.
Links: Documentation
PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.
Links: Documentation | Paper
PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.
Links: Documentation | Paper | Blog Post
The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.
PreTrainedConfig and model config classes have been refactored to use [@dataclass](https://github.com/dataclass) and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.
Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.
Weight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.
The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.
Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.
supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.
Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.
processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.
FA4] Add kernels fallback (#44797) by @vasqu in [#44797]Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.
Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".
The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.
cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]tests_hub if no tests found (#45014) by @ydshieh in [#45014]attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]Qwen2VL (#44976) by @hmellor in [#44976]check_auto_docstrings (#44803) by @yonigozlan in [#44803]vllm x v5] nit (#44971) by @ArthurZucker in [#44971]T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]image_processing_utils_fast (#44897) by @yonigozlan in [#44897]NemotronH is torch compiled (#44854) by @ydshieh in [#44854]SizeDict (#44884) by @hmellor in [#44884]layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]PreTrainedModel (#44672) by @neo in [#44672]KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508].ai (#44489) by @tarekziade in [#44489][@strict](https://github.com/strict) (#44770) by @zucchini-nlp in [#44770]is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]config class in some model class definitions (#44715) by @ydshieh in [#44715]FA] Fix fa detection (#44703) by @vasqu in [#44703]set_encoder (#44698) by @hmellor in [#44698]parent issue (#44685) by @ydshieh in [#44685]ParallelInterface (#44640) by @michaelbenayoun in [#44640]Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]cache_position in more models (#44330) by @Cyrilvallez in [#44330]src/transformers/quantizers (#44412) by @tarekziade in [#44412]fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]_prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]mps device (#44506) by @michaelbenayoun in [#44506]GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]ty to 0.0.20 (#44494) by @tarekziade in [#44494]diffusers to CI docker file (#44480) by @ydshieh in [#44480]DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]The following contributors have made significant changes to the library over the last release:
tests_hub if no tests found (#45014)Update Transformers metadata after #43514 (#44941)processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)NemotronH is torch compiled (#44854)is_causal from EuroBertConfig (#44774)mlcd auto config/model/mapping issues (#44730)config class in some model class definitions (#44715)parent issue (#44685)VibeVoiceAcousticTokenizer (#44628)build_pr_documentation.yml (will be the new required job) (#44538)build_pr_documentation workflow for merge_group event (#44532)diffusers to CI docker file (#44480).ai (#44489)src/transformers/quantizers (#44412)ty to 0.0.20 (#44494)T5ModelIntegrationTest (#44934)Jina-Embeddings-V3 Model (#44251)MusicgenStereo integration tests (#44527)GPTNeoModelLanguageGenerationTest (#44515)MarianIntegrationTests (#44519)DepthProModelIntegrationTest (#44456)ProphetNetModelIntegrationTest (#44439)higgs_audio_v2 tests (#44482)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.