release notes
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
release notes
Published 7/25/2025
MinorContains breaking changesIn order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:
transformers is bloatedtransformers is slowOur team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."
The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!
It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).
This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!
This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.
We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
Other models from the family can be found at Ernie 4.5 MoE.

Ernie 4.5] Add ernie text models by @vasqu in #39228Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's realease blog post.
The model is available in two checkpoints:
Voxtral builds on Ministral-3B by adding audio processing capabilities:
LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"
The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.
The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.
The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.
EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.
The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.
We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.
FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.
Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:
from transformers import AutoModelForCausalLM, FPQuantConfig
model = AutoModelForCausalLM.from_pretrained(
"qwen/Qwen3-8B",
quantization_config=FPQuantConfig(),
device_map="cuda",
torch_dtype=torch.bfloat16,
)
FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).
The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.
The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!
You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here
Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:
model.set_attn_implementation("kernels-community/flash-attn3")
This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).
We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.
https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e
Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.
This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.
Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).
The server supports the following REST APIs:
/v1/chat/completions/v1/responses/v1/audio/transcriptions/v1/modelsRelevant commits:
transformers chat and transformers serve by @LysandreJik in #38443transformers serve by @LysandreJik in #39149generation_config by @gante in #39230transformers serve by @LysandreJik in #39155/v1/audio/transcriptions) by @gante in #39434Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.
See the evolution here:
Some notable refactors:
KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.
output_attentions or output_hidden_statesSuch attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.
We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.
average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406test_compare_unprocessed_logit_scores by @ydshieh in #39053t5gemma tests by @ydshieh in #39052layoutlmv3 tests by @ydshieh in #39050Gemma3nProcessorTest by @ydshieh in #39068mistral3 tests by @ydshieh in #38989dots1 tests by @ydshieh in #39088test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079test_sdpa_can_dispatch_on_flash by @ydshieh in #39092[@lru](https://github.com/lru)_cache() to [@lru](https://github.com/lru)_cache to match styles from #38883. by @rasmi in #39093run-slow by @ydshieh in #39100llama tests by @ydshieh in #39161Dia] Change ckpt path in docs by @vasqu in #39181from_pretrained by @qubvel in #39184fastspeech2_conformer tests by @ydshieh in #39229is not None -> isinstance(..., dict) by @qubvel in #39145segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312tests/generation/test_utils.py by @ydshieh in #39254test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248smollm3 by @gante in #39271PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158test_generate_compile_model_forward by @ydshieh in #39276datasets 4.0 by @lhoestq in #39156aria tests by @ydshieh in #39277test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315src/ for doctest (for now) by @ydshieh in #39316max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206phi3 tests by @ydshieh in #39312position_ids in masking_utils by @Cyrilvallez in #39310test_sdpa_can_dispatch_on_flash by @ydshieh in #39259timm (for perception_lm) by @ydshieh in #39380/v1/models output payload by @alvarobartt in #39414set_tracer_provider and set_meter_provider calls by @McPatate in #39422JetMoeForCausalLM by @Phoenix-Shen in #37830ContinuousBatchProcessor by @qgallouedec in #39372CI] Fix partially red CI by @vasqu in #39448GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362datasets pin by @gante in #39500args_doc.py to auto_docstring.py by @yonigozlan in #39439_supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660MambaCache to modeling_mamba.py by @manueldeprada in #38086perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441CI] Fix post merge ernie 4.5 by @vasqu in #39561docs/source/ko/_toctree.yml by @jungnerd in #39516supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505device_mesh have multiple dim by @S1ro1 in #38949test_export_static_cache by @gante in #39662Ernie 4.5] Post merge adaptations by @vasqu in #39664kyutai tests by @ydshieh in #39416typing.Literal as type of tool parameters or return value by @grf53 in #39633The following contributors have made significant changes to the library over the last release:
segmentation_maps support to MobileNetV2ImageProcessor (#37312)docs/source/ko/_toctree.yml (#39516)release notes
Published 7/25/2025
MinorContains breaking changesIn order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:
transformers is bloatedtransformers is slowOur team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."
The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!
It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).
This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!
This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.
We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
Other models from the family can be found at Ernie 4.5 MoE.

Ernie 4.5] Add ernie text models by @vasqu in #39228Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's realease blog post.
The model is available in two checkpoints:
Voxtral builds on Ministral-3B by adding audio processing capabilities:
LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"
The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.
The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.
The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.
EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.
The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.
We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.
FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.
Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:
from transformers import AutoModelForCausalLM, FPQuantConfig
model = AutoModelForCausalLM.from_pretrained(
"qwen/Qwen3-8B",
quantization_config=FPQuantConfig(),
device_map="cuda",
torch_dtype=torch.bfloat16,
)
FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).
The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.
The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!
You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here
Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:
model.set_attn_implementation("kernels-community/flash-attn3")
This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).
We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.
https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e
Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.
This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.
Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).
The server supports the following REST APIs:
/v1/chat/completions/v1/responses/v1/audio/transcriptions/v1/modelsRelevant commits:
transformers chat and transformers serve by @LysandreJik in #38443transformers serve by @LysandreJik in #39149generation_config by @gante in #39230transformers serve by @LysandreJik in #39155/v1/audio/transcriptions) by @gante in #39434Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.
See the evolution here:
Some notable refactors:
KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.
output_attentions or output_hidden_statesSuch attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.
We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.
average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406test_compare_unprocessed_logit_scores by @ydshieh in #39053t5gemma tests by @ydshieh in #39052layoutlmv3 tests by @ydshieh in #39050Gemma3nProcessorTest by @ydshieh in #39068mistral3 tests by @ydshieh in #38989dots1 tests by @ydshieh in #39088test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079test_sdpa_can_dispatch_on_flash by @ydshieh in #39092[@lru](https://github.com/lru)_cache() to [@lru](https://github.com/lru)_cache to match styles from #38883. by @rasmi in #39093run-slow by @ydshieh in #39100llama tests by @ydshieh in #39161Dia] Change ckpt path in docs by @vasqu in #39181from_pretrained by @qubvel in #39184fastspeech2_conformer tests by @ydshieh in #39229is not None -> isinstance(..., dict) by @qubvel in #39145segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312tests/generation/test_utils.py by @ydshieh in #39254test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248smollm3 by @gante in #39271PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158test_generate_compile_model_forward by @ydshieh in #39276datasets 4.0 by @lhoestq in #39156aria tests by @ydshieh in #39277test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315src/ for doctest (for now) by @ydshieh in #39316max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206phi3 tests by @ydshieh in #39312position_ids in masking_utils by @Cyrilvallez in #39310test_sdpa_can_dispatch_on_flash by @ydshieh in #39259timm (for perception_lm) by @ydshieh in #39380/v1/models output payload by @alvarobartt in #39414set_tracer_provider and set_meter_provider calls by @McPatate in #39422JetMoeForCausalLM by @Phoenix-Shen in #37830ContinuousBatchProcessor by @qgallouedec in #39372CI] Fix partially red CI by @vasqu in #39448GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362datasets pin by @gante in #39500args_doc.py to auto_docstring.py by @yonigozlan in #39439_supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660MambaCache to modeling_mamba.py by @manueldeprada in #38086perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441CI] Fix post merge ernie 4.5 by @vasqu in #39561docs/source/ko/_toctree.yml by @jungnerd in #39516supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505device_mesh have multiple dim by @S1ro1 in #38949test_export_static_cache by @gante in #39662Ernie 4.5] Post merge adaptations by @vasqu in #39664kyutai tests by @ydshieh in #39416typing.Literal as type of tool parameters or return value by @grf53 in #39633The following contributors have made significant changes to the library over the last release:
segmentation_maps support to MobileNetV2ImageProcessor (#37312)docs/source/ko/_toctree.yml (#39516)