release notes
release notes
Published 2/21/2024
MinorContains breaking changesGemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM, GemmaForCausalLM or pipeline interface!
Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2"
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", load_in_4bit=True
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto"
)
model.generation_config.cache_implementation = "static"
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.
StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.
StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.
The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.
Static past key value cache allows LlamaForCausalLM' s forward pass to be compiled using torch.compile !
This means that (cuda) graphs can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5 ms to run with this on an A100! Equivalent to TGI performances! ⚡️
Core generation] Adds support for static KV cache by @ArthurZucker in #27931CLeanup] Revert SDPA attention changes that got in the static kv cache PR by @ArthurZucker in #29027⚠️ Support for generate is not included yet. This feature is experimental and subject to changes in subsequent releases.
from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
import torch
import os
# compilation triggers multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "true"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
torch_dtype=torch.float16
)
# set up the static cache in advance of using the model
model._setup_cache(StaticCache, max_batch_size=1, max_cache_len=128)
# trigger compilation!
compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# run the model as usual
input_text = "A few facts about the universe: "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda").input_ids
model_outputs = compiled_model(input_ids)
HfQuantizer makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer
HfQuantizer class for quantization-related stuff in modeling_utils.py by @poedator in #26610HfQuantizer] Move it to "Developper guides" by @younesbelkada in #28768HFQuantizer] Remove check_packages_compatibility logic by @younesbelkada in #28789AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287
The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased), have been moved under organizations.
You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567
Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.
The Mistral model was added to the library in Flax.
With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit(), which should simplify a lot of code and eliminate a long-standing source of annoyances.
Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone will raise an error. You can override by setting
config.use_pretrained_backbone = True after creating a config. However, it is not yet guaranteed to be fully backwards compatible.
from transformers import MaskFormerConfig, MaskFormerModel
config = MaskFormerConfig(
use_pretrained_backbone=False,
backbone="microsoft/resnet-18"
)
config.use_pretrained_backbone = True
# Both models have resnet-18 backbone weights and all other weights randomly
# initialized
model_1 = MaskFormerModel(config)
model_2 = MaskFormerModel(config)
Introduce a helper function load_backbone to load a backbone from a backbone's model config e.g. ResNetConfig, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.
from transformers import ResNetConfig, MaskFormerConfig
from transformers.utils.backbone_utils import load_backbone
# Resnet defines the backbone model to load
config = ResNetConfig()
backbone = load_backbone(config)
# Maskformer config defines a model which uses a resnet backbone
config = MaskFormerConfig(use_timm_backbone=True, backbone="resnet18")
backbone = load_backbone(config)
config = MaskFormerConfig(backbone_config=ResNetConfig())
backbone = load_backbone(config)
Backbone] Use load_backbone instead of AutoBackbone.from_config by @amyeroberts in #28661Add in API references, list supported backbones, updated examples, clarification and moving information to better reflect usage and docs
Llava] Update convert_llava_weights_to_hf.py script by @isaac-vidas in #28617GPTNeoX] Fix GPTNeoX + Flash Attention 2 issue by @younesbelkada in #28645SigLIP] Only import tokenizer if sentencepiece available by @amyeroberts in #28636PartialState().default_device as it has been officially released by @statelesshz in #27256tensor_size - fix copy/paste error msg typo by @scruel in #28660CodeGenTokenizer by @cmathw in #28628GenerationConfig, now the generation_config.json can be loaded successfully by @ParadoxZW in #28604chore] Add missing space in warning by @tomaarsen in #28695Vilt] align input and model dtype in the ViltPatchEmbeddings forward pass by @faaany in #28633docs] Improve visualization for vertical parallelism by @petergtz in #28583LocalEntryNotFoundError during processor_config.json loading by @ydshieh in #28709docs] Update preprocessing.md by @velaia in #28719weights_only by @ydshieh in #28725GatedRepoError to use cache file (fix #28558). by @scruel in #28566Siglip] protect from imports if sentencepiece not installed by @amyeroberts in #28737DepthEstimationPipeline's docstring by @ydshieh in #28733Block. by @xkszltl in #28727load_in_8bit and load_in_4bit at the same time by @osanseviero in #28266bnb] Fix bnb slow tests by @younesbelkada in #28788torch.arange dtype on float usage to avoid incorrect initialization by @gante in #28760is_torch_bf16_available_on_device more strict by @ydshieh in #28796-v for pytest on CircleCI by @ydshieh in #28840test_encoder_decoder_model_generate for vision_encoder_deocder as flaky by @amyeroberts in #28842Doc] update contribution guidelines by @ArthurZucker in #28858save_only_model with load_best_model_at_end for DeepSpeed/FSDP by @pacman100 in #28866FastSpeech2ConformerModelTest and skip it on CPU by @ydshieh in #28888torchaudio get the correct version in torch_and_flax_job by @ydshieh in #28899logging_first_step by removing "evaluate" by @Sai-Suraj-27 in #28884Exception when trying to generate 0 tokens ⚠️ by @danielkorat in #28621torch_dtype as str to actual torch data type (i.e. "float16" …to torch.float16) by @KossaiSbai in #28208pipelines] updated docstring with vqa alias by @cmahmut in #28951test_save_load_fast_init_from_base as flaky by @gante in #28930NllbTokenizer] refactor with added tokens decoder by @ArthurZucker in #27717DETR] Update the processing to adapt masks & bboxes to reflect padding by @amyeroberts in #28363quantization_config is in config but not passed as an arg by @younesbelkada in #28988AutoQuantizer]: enhance trainer + not supported quant methods by @younesbelkada in #28991Doc] Fix docbuilder - make BackboneMixin and BackboneConfigMixin importable from utils. by @amyeroberts in #29002test_trainer to float32 by @statelesshz in #28920Trainer / tags]: Fix trainer + tags when users do not pass "tags" to trainer.push_to_hub() by @younesbelkada in #29009logger.warning + inline with recent refactor by @younesbelkada in #29039test_save_load_low_cpu_mem_usage tests by @amyeroberts in #29043generation/utils.py::GenerateEncoderDecoderOutput's docstring by @sadra-barikbin in #29044auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. by @pacman100 in #29058Awq] Add peft support for AWQ by @younesbelkada in #28987bnb / tests]: Fix currently failing bnb tests by @younesbelkada in #29092bert-base-cased tokenizer configuration test by @LysandreJik in #29105examples/pytorch/text-classification/run_classification.py by @Ja1Zhou in #29072pipelines/base.py::Pipeline::_sanitize_parameters()'s docstring by @sadra-barikbin in #29102gradient_checkpointing] default to use it for torch 2.3 by @ArthurZucker in #28538Trainer / bnb]: Add RMSProp from bitsandbytes to HF Trainer by @younesbelkada in #29082bnb / tests] Propagate the changes from #29092 to 4-bit tests by @younesbelkada in #29122cuda kernels] only compile them when initializing by @ArthurZucker in #29133PEFT / Trainer ] Handle better peft + quantized compiled models by @younesbelkada in #29055Core tokenization] add_dummy_prefix_space option to help with latest issues by @ArthurZucker in #28010pipeline] Add pool option to image feature extraction pipeline by @amyeroberts in #28985The following contributors have made significant changes to the library over the last release:
HfQuantizer class for quantization-related stuff in modeling_utils.py (#26610)StableLM (#28810)release notes
Published 2/21/2024
MinorContains breaking changesGemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM, GemmaForCausalLM or pipeline interface!
Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2"
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto", load_in_4bit=True
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b", device_map="auto"
)
model.generation_config.cache_implementation = "static"
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids)
The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.
StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.
StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.
The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.
Static past key value cache allows LlamaForCausalLM' s forward pass to be compiled using torch.compile !
This means that (cuda) graphs can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5 ms to run with this on an A100! Equivalent to TGI performances! ⚡️
Core generation] Adds support for static KV cache by @ArthurZucker in #27931CLeanup] Revert SDPA attention changes that got in the static kv cache PR by @ArthurZucker in #29027⚠️ Support for generate is not included yet. This feature is experimental and subject to changes in subsequent releases.
from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
import torch
import os
# compilation triggers multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "true"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
torch_dtype=torch.float16
)
# set up the static cache in advance of using the model
model._setup_cache(StaticCache, max_batch_size=1, max_cache_len=128)
# trigger compilation!
compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# run the model as usual
input_text = "A few facts about the universe: "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda").input_ids
model_outputs = compiled_model(input_ids)
HfQuantizer makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer
HfQuantizer class for quantization-related stuff in modeling_utils.py by @poedator in #26610HfQuantizer] Move it to "Developper guides" by @younesbelkada in #28768HFQuantizer] Remove check_packages_compatibility logic by @younesbelkada in #28789AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287
The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased), have been moved under organizations.
You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567
Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.
The Mistral model was added to the library in Flax.
With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit(), which should simplify a lot of code and eliminate a long-standing source of annoyances.
Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone will raise an error. You can override by setting
config.use_pretrained_backbone = True after creating a config. However, it is not yet guaranteed to be fully backwards compatible.
from transformers import MaskFormerConfig, MaskFormerModel
config = MaskFormerConfig(
use_pretrained_backbone=False,
backbone="microsoft/resnet-18"
)
config.use_pretrained_backbone = True
# Both models have resnet-18 backbone weights and all other weights randomly
# initialized
model_1 = MaskFormerModel(config)
model_2 = MaskFormerModel(config)
Introduce a helper function load_backbone to load a backbone from a backbone's model config e.g. ResNetConfig, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.
from transformers import ResNetConfig, MaskFormerConfig
from transformers.utils.backbone_utils import load_backbone
# Resnet defines the backbone model to load
config = ResNetConfig()
backbone = load_backbone(config)
# Maskformer config defines a model which uses a resnet backbone
config = MaskFormerConfig(use_timm_backbone=True, backbone="resnet18")
backbone = load_backbone(config)
config = MaskFormerConfig(backbone_config=ResNetConfig())
backbone = load_backbone(config)
Backbone] Use load_backbone instead of AutoBackbone.from_config by @amyeroberts in #28661Add in API references, list supported backbones, updated examples, clarification and moving information to better reflect usage and docs
Llava] Update convert_llava_weights_to_hf.py script by @isaac-vidas in #28617GPTNeoX] Fix GPTNeoX + Flash Attention 2 issue by @younesbelkada in #28645SigLIP] Only import tokenizer if sentencepiece available by @amyeroberts in #28636PartialState().default_device as it has been officially released by @statelesshz in #27256tensor_size - fix copy/paste error msg typo by @scruel in #28660CodeGenTokenizer by @cmathw in #28628GenerationConfig, now the generation_config.json can be loaded successfully by @ParadoxZW in #28604chore] Add missing space in warning by @tomaarsen in #28695Vilt] align input and model dtype in the ViltPatchEmbeddings forward pass by @faaany in #28633docs] Improve visualization for vertical parallelism by @petergtz in #28583LocalEntryNotFoundError during processor_config.json loading by @ydshieh in #28709docs] Update preprocessing.md by @velaia in #28719weights_only by @ydshieh in #28725GatedRepoError to use cache file (fix #28558). by @scruel in #28566Siglip] protect from imports if sentencepiece not installed by @amyeroberts in #28737DepthEstimationPipeline's docstring by @ydshieh in #28733Block. by @xkszltl in #28727load_in_8bit and load_in_4bit at the same time by @osanseviero in #28266bnb] Fix bnb slow tests by @younesbelkada in #28788torch.arange dtype on float usage to avoid incorrect initialization by @gante in #28760is_torch_bf16_available_on_device more strict by @ydshieh in #28796-v for pytest on CircleCI by @ydshieh in #28840test_encoder_decoder_model_generate for vision_encoder_deocder as flaky by @amyeroberts in #28842Doc] update contribution guidelines by @ArthurZucker in #28858save_only_model with load_best_model_at_end for DeepSpeed/FSDP by @pacman100 in #28866FastSpeech2ConformerModelTest and skip it on CPU by @ydshieh in #28888torchaudio get the correct version in torch_and_flax_job by @ydshieh in #28899logging_first_step by removing "evaluate" by @Sai-Suraj-27 in #28884Exception when trying to generate 0 tokens ⚠️ by @danielkorat in #28621torch_dtype as str to actual torch data type (i.e. "float16" …to torch.float16) by @KossaiSbai in #28208pipelines] updated docstring with vqa alias by @cmahmut in #28951test_save_load_fast_init_from_base as flaky by @gante in #28930NllbTokenizer] refactor with added tokens decoder by @ArthurZucker in #27717DETR] Update the processing to adapt masks & bboxes to reflect padding by @amyeroberts in #28363quantization_config is in config but not passed as an arg by @younesbelkada in #28988AutoQuantizer]: enhance trainer + not supported quant methods by @younesbelkada in #28991Doc] Fix docbuilder - make BackboneMixin and BackboneConfigMixin importable from utils. by @amyeroberts in #29002test_trainer to float32 by @statelesshz in #28920Trainer / tags]: Fix trainer + tags when users do not pass "tags" to trainer.push_to_hub() by @younesbelkada in #29009logger.warning + inline with recent refactor by @younesbelkada in #29039test_save_load_low_cpu_mem_usage tests by @amyeroberts in #29043generation/utils.py::GenerateEncoderDecoderOutput's docstring by @sadra-barikbin in #29044auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. by @pacman100 in #29058Awq] Add peft support for AWQ by @younesbelkada in #28987bnb / tests]: Fix currently failing bnb tests by @younesbelkada in #29092bert-base-cased tokenizer configuration test by @LysandreJik in #29105examples/pytorch/text-classification/run_classification.py by @Ja1Zhou in #29072pipelines/base.py::Pipeline::_sanitize_parameters()'s docstring by @sadra-barikbin in #29102gradient_checkpointing] default to use it for torch 2.3 by @ArthurZucker in #28538Trainer / bnb]: Add RMSProp from bitsandbytes to HF Trainer by @younesbelkada in #29082bnb / tests] Propagate the changes from #29092 to 4-bit tests by @younesbelkada in #29122cuda kernels] only compile them when initializing by @ArthurZucker in #29133PEFT / Trainer ] Handle better peft + quantized compiled models by @younesbelkada in #29055Core tokenization] add_dummy_prefix_space option to help with latest issues by @ArthurZucker in #28010pipeline] Add pool option to image feature extraction pipeline by @amyeroberts in #28985The following contributors have made significant changes to the library over the last release:
HfQuantizer class for quantization-related stuff in modeling_utils.py (#26610)StableLM (#28810)🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.