release notes
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
release notes
Published 7/23/2024
MinorContains breaking changesThe Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.
To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.
We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.
The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.
The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.
Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.
Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!
A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.
🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.
Example of a script changed as a result of this PR:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))
get_seq_length method by @sanchit-gandhi in #31661keras-nlp<0.14 pin by @gante in #31684tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764sdpa support for SigLIP by @qubvel in #31499TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827TrainingArguments by @andstor in #31812vocab_size in other two VLMs by @zucchini-nlp in #31681.generate() by @voidism in #29619_init_weights for ResNetPreTrainedModel by @ydshieh in #31851_init_weights for ResNetPreTrainedModel" by @ydshieh in #31868duplicate field definitions in some classes by @Sai-Suraj-27 in #31888push_to_hub=True in TrainingArguments by @SunMarc in #31808warnings in a with block to avoid flaky tests by @ydshieh in #31893ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902Gemma2] Support FA2 softcapping by @ArthurZucker in #318871st argument name in classmethods by @Sai-Suraj-27 in #31907SlidingWindowCache.reset() by @gante in #31917Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935Cache and cache_position being default by @gante in #31898sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951logits_warper update in models with custom generate fn by @gante in #31957create_repo() function call by @Sai-Suraj-27 in #31947test_stage3_nvme_offload by @faaany in #31881src/transformers/__init__.py by @Sai-Suraj-27 in #31993log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945sdpa and FA2 for CLIP by @qubvel in #31940numpy<2.0 by @ydshieh in #32018head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050duplicate entries in a dictionary by @Sai-Suraj-27 in #32041huggingface_hub 0.24 by @Wauplin in #32054mktemp() function by @Sai-Suraj-27 in #32123ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748GPTNeoX and GPT2 by @vasqu in #31944The following contributors have made significant changes to the library over the last release:
.generate() (#29619)release notes
Published 7/23/2024
MinorContains breaking changesThe Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.
To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.
We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.
The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.
The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.
Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.
Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!
A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.
🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.
Example of a script changed as a result of this PR:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))
get_seq_length method by @sanchit-gandhi in #31661keras-nlp<0.14 pin by @gante in #31684tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764sdpa support for SigLIP by @qubvel in #31499TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827TrainingArguments by @andstor in #31812vocab_size in other two VLMs by @zucchini-nlp in #31681.generate() by @voidism in #29619_init_weights for ResNetPreTrainedModel by @ydshieh in #31851_init_weights for ResNetPreTrainedModel" by @ydshieh in #31868duplicate field definitions in some classes by @Sai-Suraj-27 in #31888push_to_hub=True in TrainingArguments by @SunMarc in #31808warnings in a with block to avoid flaky tests by @ydshieh in #31893ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902Gemma2] Support FA2 softcapping by @ArthurZucker in #318871st argument name in classmethods by @Sai-Suraj-27 in #31907SlidingWindowCache.reset() by @gante in #31917Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935Cache and cache_position being default by @gante in #31898sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951logits_warper update in models with custom generate fn by @gante in #31957create_repo() function call by @Sai-Suraj-27 in #31947test_stage3_nvme_offload by @faaany in #31881src/transformers/__init__.py by @Sai-Suraj-27 in #31993log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945sdpa and FA2 for CLIP by @qubvel in #31940numpy<2.0 by @ydshieh in #32018head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050duplicate entries in a dictionary by @Sai-Suraj-27 in #32041huggingface_hub 0.24 by @Wauplin in #32054mktemp() function by @Sai-Suraj-27 in #32123ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748GPTNeoX and GPT2 by @vasqu in #31944The following contributors have made significant changes to the library over the last release:
.generate() (#29619)