release notes
release notes
Published 2 weeks ago
MinorContains new featuresDiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models by employing an encoder-decoder architecture specifically optimized for inference speed. During inference, DiffusionGemma leverages multi-canvas sampling, where rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. This block-autoregressive approach facilitates text generation at higher speeds compared to traditional sequential generation methods.
Links: Documentation
DeepSeek-V3.2-Exp is an experimental model from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. Built on top of DeepSeek-V3.1-Terminus with a 685B-parameter Mixture-of-Experts backbone, it reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens while maintaining virtually identical benchmark performance. The work was extended in DeepSeek-V3.2 which pairs DSA with scalable reinforcement learning and achieves gold-medal level results on competition math and competitive programming benchmarks.
Links: Documentation | Paper
The KernelConfig API was extended to support n-to-1 module fusion and parameter transformation, simplifying how custom kernels are integrated with Transformers modules. Additional fixes include resolving a dtype mismatch in the Mamba2 CUDA kernel path for NemotronH/Zamba2, adding fine-grained fp8/fp4 Triton kernel support, and correcting the FalconMamba fast-path warning to recommend pip install kernels instead of mamba-ssm.
out_proj) (#46487) by @yuekaizhang in [#46487]pip install kernels in fast-path warning (#46343) by @Anai-Guo in [#46343]Fixed model parallel beam search bugs in the Qwen2-VL, Qwen2.5-VL, and Qwen3-VL MoE model families, and added documentation for tensor parallelism support with continuous batching.
pr-ci-caller.yml (#46505) by @ydshieh in [#46505].github/workflows/pr-ci-post-dashboard-link.yml (#46499) by @ydshieh in [#46499]no_inherit_decorators and fixup wrong RoPE related inheritances (#46440) by @Bissmella in [#46440]pipeline_tutorial.md, pipeline_gradio.md, pipeline_webserver.md and add_new_pipeline.md. (#46388) by @filipinescu in [#46388]fast_tokenizers.md, custom_tokenizers.md, tokenizer_summary.md, image_processors.md and video_processors.md. (#46356) by @filipinescu in [#46356]The following contributors have made significant changes to the library over the last release:
release notes
Published 2 weeks ago
MinorContains new featuresDiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models by employing an encoder-decoder architecture specifically optimized for inference speed. During inference, DiffusionGemma leverages multi-canvas sampling, where rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. This block-autoregressive approach facilitates text generation at higher speeds compared to traditional sequential generation methods.
Links: Documentation
DeepSeek-V3.2-Exp is an experimental model from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. Built on top of DeepSeek-V3.1-Terminus with a 685B-parameter Mixture-of-Experts backbone, it reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens while maintaining virtually identical benchmark performance. The work was extended in DeepSeek-V3.2 which pairs DSA with scalable reinforcement learning and achieves gold-medal level results on competition math and competitive programming benchmarks.
Links: Documentation | Paper
The KernelConfig API was extended to support n-to-1 module fusion and parameter transformation, simplifying how custom kernels are integrated with Transformers modules. Additional fixes include resolving a dtype mismatch in the Mamba2 CUDA kernel path for NemotronH/Zamba2, adding fine-grained fp8/fp4 Triton kernel support, and correcting the FalconMamba fast-path warning to recommend pip install kernels instead of mamba-ssm.
out_proj) (#46487) by @yuekaizhang in [#46487]pip install kernels in fast-path warning (#46343) by @Anai-Guo in [#46343]Fixed model parallel beam search bugs in the Qwen2-VL, Qwen2.5-VL, and Qwen3-VL MoE model families, and added documentation for tensor parallelism support with continuous batching.
pr-ci-caller.yml (#46505) by @ydshieh in [#46505].github/workflows/pr-ci-post-dashboard-link.yml (#46499) by @ydshieh in [#46499]no_inherit_decorators and fixup wrong RoPE related inheritances (#46440) by @Bissmella in [#46440]pipeline_tutorial.md, pipeline_gradio.md, pipeline_webserver.md and add_new_pipeline.md. (#46388) by @filipinescu in [#46388]fast_tokenizers.md, custom_tokenizers.md, tokenizer_summary.md, image_processors.md and video_processors.md. (#46356) by @filipinescu in [#46356]The following contributors have made significant changes to the library over the last release:
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.