release notes
release notes
Published 2/16/2026
MinorContains breaking changesVoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
The zAI team launches GLM-5, and introduces it as such:
GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
The Qwen team launches Qwen 3.5, and introduces it as such:
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
Attn] New attn mask interface everywhere (#42848)🚨 This one is quite breaking for super super super old modles: 🚨 🚨
convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellorAGENTS.md (#43763) by @tarekziadeModular Dependencies] Fixup qwen rms norms (#43772) by @vasquRepo Consistency] Fix rms norm (#43803) by @vasqucheck_model_inputs implementation (#43765) by @Cyrilvallezdo_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliuJamba] Fallback to slow path and warn instead of error out (#43889) by @vasqufix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsencheck_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez_keys_to_ignore_on_load_missing for now (#43893) by @ArthurZuckerinput_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallezimage_url content support in apply_chat_template (#43786) by @kaixuanliugenerate (#43734) by @zucchini-nlprun_*_no‑trainer.py examples (#42769) by @casincarun_*_no‑trainer.py examples (#43947) by @casincaout_features (#43886) by @zucchini-nlpget_number_of_image_tokens (#43948) by @zucchini-nlpother_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshiehThe following contributors have made significant changes to the library over the last release:
release notes
Published 2/16/2026
MinorContains breaking changesVoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
The zAI team launches GLM-5, and introduces it as such:
GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
The Qwen team launches Qwen 3.5, and introduces it as such:
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
Attn] New attn mask interface everywhere (#42848)🚨 This one is quite breaking for super super super old modles: 🚨 🚨
convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellorAGENTS.md (#43763) by @tarekziadeModular Dependencies] Fixup qwen rms norms (#43772) by @vasquRepo Consistency] Fix rms norm (#43803) by @vasqucheck_model_inputs implementation (#43765) by @Cyrilvallezdo_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliuJamba] Fallback to slow path and warn instead of error out (#43889) by @vasqufix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsencheck_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez_keys_to_ignore_on_load_missing for now (#43893) by @ArthurZuckerinput_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallezimage_url content support in apply_chat_template (#43786) by @kaixuanliugenerate (#43734) by @zucchini-nlprun_*_no‑trainer.py examples (#42769) by @casincarun_*_no‑trainer.py examples (#43947) by @casincaout_features (#43886) by @zucchini-nlpget_number_of_image_tokens (#43948) by @zucchini-nlpother_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshiehThe following contributors have made significant changes to the library over the last release:
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.