release notes
release notes
Published 2/17/2025
MinorContains breaking changesHelium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.
The abstract from this update is the following:
Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.
Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.
This feature exists in TRL and has been migrated to transformers for easier usage.
An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.
In this release, several processors have been standardized and have seen their fast version be contributed.
DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.
segmentation maps support for DPT image processor by @simonreise in #34345The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.
The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:
The ignore_index property of the llava configuration has been removed as it was not serving a purpose.
Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.
Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).
GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.
max_length by @gante in #36120generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677GenerationConfig(cache_implementation="static") by @gante in #35679SequenceBiasLogitsProcessor by @gante in #35699torch.compile(model.forward) as a fast test by @gante in #34544Pipelines have received several bug fixes and improvements which are detailed below.
test_custom_4d_attention_mask by @ydshieh in #35606EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101test_beam_search_low_memory by @ydshieh in #35611MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614Phi] bias should be True by @ArthurZucker in #35650Compile] Only test compiling model forward pass by @ArthurZucker in #35658zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722pt_to_tf by @gante in #35672check_circleci_user job by @Sai-Suraj-27 in #32866MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735PeftModel by @ambroser53 in #35680MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755self-comment-ci.yml by @ydshieh in #35548timm import behaviour by @rwightman in #35800test_batching_equivalence's flakiness by @ydshieh in #35729TimmWrapper by @ariG23498 in #35744timm tag to timm-wrapper models. by @pcuenca in #35794get_cached_models by @Wauplin in #35809docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198benchmark code by @gante in #35730self-comment-ci.yml by @ydshieh in #35816working-directory in self-comment-ci.yml by @ydshieh in #35833head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818tests] remove some flash attention class tests by @ArthurZucker in #35817num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757test_pipelines_video_classification that was always failing by @CalOmnie in #35842Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881_supports_static_cache = True for some model classes by @ydshieh in #34975test_generated_length_assisted_generation by @keyboardAnt in #34935unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955test_assisted_decoding_matches_greedy_search by @ydshieh in #35951transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940Tester object has no attribute '_testMethodName' by @faaany in #35781TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971benchmark.yml by @ydshieh in #35974generation / quantization) by @ydshieh in #35341self-comment-ci.yml by @ydshieh in #36030Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987past_key_values by @yaswanth19 in #35890test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050trainer.md by @faaany in #36066perf_infer_gpu_one.md by @faaany in #36087torch.export and fix some vision models by @qubvel in #35124output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735PretrainedConfig and PreTrainedModel by @hmellor in #36091test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154get_default_model_revision by @MarcoGorelli in #35982DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763check_repository_consistency run faster by MP by @ydshieh in #36175test-save-trainer by @zucchini-nlp in #36191The following contributors have made significant changes to the library over the last release:
DataCollatorForMultipleChoice from the docs to the package (#34763)release notes
Published 2/17/2025
MinorContains breaking changesHelium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.
The abstract from this update is the following:
Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.
Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.
This feature exists in TRL and has been migrated to transformers for easier usage.
An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.
In this release, several processors have been standardized and have seen their fast version be contributed.
DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.
segmentation maps support for DPT image processor by @simonreise in #34345The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.
The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:
The ignore_index property of the llava configuration has been removed as it was not serving a purpose.
Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.
Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).
GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.
max_length by @gante in #36120generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677GenerationConfig(cache_implementation="static") by @gante in #35679SequenceBiasLogitsProcessor by @gante in #35699torch.compile(model.forward) as a fast test by @gante in #34544Pipelines have received several bug fixes and improvements which are detailed below.
test_custom_4d_attention_mask by @ydshieh in #35606EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101test_beam_search_low_memory by @ydshieh in #35611MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614Phi] bias should be True by @ArthurZucker in #35650Compile] Only test compiling model forward pass by @ArthurZucker in #35658zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722pt_to_tf by @gante in #35672check_circleci_user job by @Sai-Suraj-27 in #32866MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735PeftModel by @ambroser53 in #35680MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755self-comment-ci.yml by @ydshieh in #35548timm import behaviour by @rwightman in #35800test_batching_equivalence's flakiness by @ydshieh in #35729TimmWrapper by @ariG23498 in #35744timm tag to timm-wrapper models. by @pcuenca in #35794get_cached_models by @Wauplin in #35809docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198benchmark code by @gante in #35730self-comment-ci.yml by @ydshieh in #35816working-directory in self-comment-ci.yml by @ydshieh in #35833head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818tests] remove some flash attention class tests by @ArthurZucker in #35817num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757test_pipelines_video_classification that was always failing by @CalOmnie in #35842Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881_supports_static_cache = True for some model classes by @ydshieh in #34975test_generated_length_assisted_generation by @keyboardAnt in #34935unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955test_assisted_decoding_matches_greedy_search by @ydshieh in #35951transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940Tester object has no attribute '_testMethodName' by @faaany in #35781TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971benchmark.yml by @ydshieh in #35974generation / quantization) by @ydshieh in #35341self-comment-ci.yml by @ydshieh in #36030Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987past_key_values by @yaswanth19 in #35890test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050trainer.md by @faaany in #36066perf_infer_gpu_one.md by @faaany in #36087torch.export and fix some vision models by @qubvel in #35124output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735PretrainedConfig and PreTrainedModel by @hmellor in #36091test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154get_default_model_revision by @MarcoGorelli in #35982DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763check_repository_consistency run faster by MP by @ydshieh in #36175test-save-trainer by @zucchini-nlp in #36191The following contributors have made significant changes to the library over the last release:
DataCollatorForMultipleChoice from the docs to the package (#34763)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.