From 4a65b5752c2023271eb11822802a167022b01b89 Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Wed, 6 Nov 2024 21:59:31 +0000 Subject: [PATCH] Docs update (0252da9) --- docs/source/en/chat_templating.md | 472 ++++++++++----------- docs/source/en/conversations.md | 48 ++- docs/source/en/model_doc/idefics2.md | 59 ++- docs/source/en/model_doc/llava_next.md | 126 +++--- docs/source/en/model_doc/mllama.md | 44 +- docs/source/en/model_summary.md | 3 + docs/source/en/pipeline_tutorial.md | 69 ++- docs/source/en/tasks/image_text_to_text.md | 76 ++-- docs/source/en/tasks/video_text_to_text.md | 28 +- docs/source/es/chat_templating.md | 101 ++--- docs/source/fr/quicktour.md | 124 +++--- docs/source/ja/pipeline_tutorial.md | 36 +- docs/source/ko/pipeline_tutorial.md | 76 +++- 13 files changed, 735 insertions(+), 527 deletions(-) diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md index 1bdf05a26c8d..db6054abf812 100644 --- a/docs/source/en/chat_templating.md +++ b/docs/source/en/chat_templating.md @@ -62,7 +62,7 @@ with totally different chat formats. Without chat templates, you would have to w model, and it's very easy to make minor errors that hurt performance! Chat templates handle the details of formatting for you, allowing you to write universal code that works for any model. - +With the introduction of the `ImageTextToTextPipeline`, chat templates can now also handle multi-modal inputs, where messages can include both text and images. This allows for more complex interactions, such as visual question answering or image-based text generation, using the same chat templating system. ## How do I use chat templates? As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with `role` @@ -80,28 +80,28 @@ tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here messages = [ - { - "role": "system", - "content": "You are a friendly chatbot who always responds in the style of a pirate", - }, - {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, - ] +{ +"role": "system", +"content": "You are a friendly chatbot who always responds in the style of a pirate", +}, +{"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") print(tokenizer.decode(tokenized_chat[0])) ``` This will yield a string in the input format that Zephyr expects. ```text <|system|> -You are a friendly chatbot who always responds in the style of a pirate +You are a friendly chatbot who always responds in the style of a pirate <|user|> -How many helicopters can a human eat in one sitting? +How many helicopters can a human eat in one sitting? <|assistant|> ``` Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: ```python -outputs = model.generate(tokenized_chat, max_new_tokens=128) +outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0])) ``` @@ -109,9 +109,9 @@ This will yield: ```text <|system|> -You are a friendly chatbot who always responds in the style of a pirate +You are a friendly chatbot who always responds in the style of a pirate <|user|> -How many helicopters can a human eat in one sitting? +How many helicopters can a human eat in one sitting? <|assistant|> Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all. ``` @@ -130,11 +130,11 @@ from transformers import pipeline pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta") messages = [ - { - "role": "system", - "content": "You are a friendly chatbot who always responds in the style of a pirate", - }, - {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +{ +"role": "system", +"content": "You are a friendly chatbot who always responds in the style of a pirate", +}, +{"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ] print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response ``` @@ -153,9 +153,9 @@ the template to add tokens that indicate the start of a bot response. For exampl ```python messages = [ - {"role": "user", "content": "Hi there!"}, - {"role": "assistant", "content": "Nice to meet you!"}, - {"role": "user", "content": "Can I ask a question?"} +{"role": "user", "content": "Hi there!"}, +{"role": "assistant", "content": "Nice to meet you!"}, +{"role": "user", "content": "Can I ask a question?"} ] ``` @@ -198,48 +198,32 @@ effect that `add_generation_prompt` has will depend on the template being used. ## What does "continue_final_message" do? -When passing a list of messages to `apply_chat_template` or `TextGenerationPipeline`, you can choose -to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done -by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply -extend the final message when it begins to generate text. This is useful for "prefilling" the model's response. +When passing a list of messages to `apply_chat_template`, `TextGenerationPipeline`, or `ImageTextToTextPipeline`, you can choose to format the chat so the model will continue the final message in the chat instead of starting a new one. This is done by removing any end-of-sequence tokens that indicate the end of the final message, so that the model will simply extend the final message when it begins to generate text. This is useful for "prefilling" the model's response. Here's an example: ```python chat = [ - {"role": "user", "content": "Can you format the answer in JSON?"}, - {"role": "assistant", "content": '{"name": "'}, +{"role": "user", "content": "Can you format the answer in JSON?"}, +{"role": "assistant", "content": '{"name": "'}, ] formatted_chat = tokenizer.apply_chat_template(chat, tokenize=True, return_dict=True, continue_final_message=True) model.generate(**formatted_chat) ``` -The model will generate text that continues the JSON string, rather than starting a new message. This approach -can be very useful for improving the accuracy of the model's instruction-following when you know how you want -it to start its replies. +The model will generate text that continues the JSON string, rather than starting a new message. This approach can be very useful for improving the accuracy of the model's instruction-following when you know how you want it to start its replies. -Because `add_generation_prompt` adds the tokens that start a new message, and `continue_final_message` removes any -end-of-message tokens from the final message, it does not make sense to use them together. As a result, you'll -get an error if you try! +Because `add_generation_prompt` adds the tokens that start a new message, and `continue_final_message` removes any end-of-message tokens from the final message, it does not make sense to use them together. As a result, you'll get an error if you try! -The default behaviour of `TextGenerationPipeline` is to set `add_generation_prompt=True` so that it starts a new -message. However, if the final message in the input chat has the "assistant" role, it will assume that this message is -a prefill and switch to `continue_final_message=True` instead, because most models do not support multiple -consecutive assistant messages. You can override this behaviour by explicitly passing the `continue_final_message` -argument when calling the pipeline. +The default behaviour of `TextGenerationPipeline` and `ImageTextToTextPipeline` is to set `add_generation_prompt=True` so that it starts a new message. However, if the final message in the input chat has the "assistant" role, it will assume that this message is a prefill and switch to `continue_final_message=True` instead, because most models do not support multiple consecutive assistant messages. You can override this behaviour by explicitly passing the `continue_final_message` argument when calling the pipeline. - ## Can I use chat templates in training? -Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training. -We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you -can simply continue like any other language model training task. When training, you should usually set -`add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during -training. Let's see an example: +Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training. We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you can simply continue like any other language model training task. When training, you should usually set `add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during training. Let's see an example: ```python from transformers import AutoTokenizer @@ -248,12 +232,12 @@ from datasets import Dataset tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") chat1 = [ - {"role": "user", "content": "Which is bigger, the moon or the sun?"}, - {"role": "assistant", "content": "The sun."} +{"role": "user", "content": "Which is bigger, the moon or the sun?"}, +{"role": "assistant", "content": "The sun."} ] chat2 = [ - {"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, - {"role": "assistant", "content": "A bacterium."} +{"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, +{"role": "assistant", "content": "A bacterium."} ] dataset = Dataset.from_dict({"chat": [chat1, chat2]}) @@ -272,15 +256,17 @@ From here, just continue training like you would with a standard language modell -By default, some tokenizers add special tokens like `` and `` to text they tokenize. Chat templates should -already include all the special tokens they need, and so additional special tokens will often be incorrect or -duplicated, which will hurt model performance. +By default, some tokenizers add special tokens like `` and `` to text they tokenize. Chat templates should already include all the special tokens they need, and so additional special tokens will often be incorrect or duplicated, which will hurt model performance. -Therefore, if you format text with `apply_chat_template(tokenize=False)`, you should set the argument -`add_special_tokens=False` when you tokenize that text later. If you use `apply_chat_template(tokenize=True)`, you don't need to worry about this! +Therefore, if you format text with `apply_chat_template(tokenize=False)`, you should set the argument `add_special_tokens=False` when you tokenize that text later. If you use `apply_chat_template(tokenize=True)`, you don't need to worry about this! + + +If you are using the new `ImageTextToTextPipeline`, ensure that your chat templates are compatible with both text and image inputs. This will help maintain consistency in the tokens seen by the model during training and inference. + + ## Advanced: Extra inputs to chat templates The only argument that `apply_chat_template` requires is `messages`. However, you can pass any keyword @@ -303,24 +289,24 @@ to a tool-use model, you can simply pass a list of functions to the `tools` argu import datetime def current_time(): - """Get the current local time as a string.""" - return str(datetime.now()) +"""Get the current local time as a string.""" +return str(datetime.now()) def multiply(a: float, b: float): - """ - A function that multiplies two numbers - - Args: - a: The first number to multiply - b: The second number to multiply - """ - return a * b +""" +A function that multiplies two numbers + +Args: +a: The first number to multiply +b: The second number to multiply +""" +return a * b tools = [current_time, multiply] model_input = tokenizer.apply_chat_template( - messages, - tools=tools +messages, +tools=tools ) ``` @@ -370,27 +356,27 @@ Next, let's define a list of tools: ```python def get_current_temperature(location: str, unit: str) -> float: - """ - Get the current temperature at a location. - - Args: - location: The location to get the temperature for, in the format "City, Country" - unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"]) - Returns: - The current temperature at the specified location in the specified units, as a float. - """ - return 22. # A real function should probably actually get the temperature! +""" +Get the current temperature at a location. + +Args: +location: The location to get the temperature for, in the format "City, Country" +unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"]) +Returns: +The current temperature at the specified location in the specified units, as a float. +""" +return 22. # A real function should probably actually get the temperature! def get_current_wind_speed(location: str) -> float: - """ - Get the current wind speed in km/h at a given location. - - Args: - location: The location to get the temperature for, in the format "City, Country" - Returns: - The current wind speed at the given location in km/h, as a float. - """ - return 6. # A real function should probably actually get the wind speed! +""" +Get the current wind speed in km/h at a given location. + +Args: +location: The location to get the temperature for, in the format "City, Country" +Returns: +The current wind speed at the given location in km/h, as a float. +""" +return 6. # A real function should probably actually get the wind speed! tools = [get_current_temperature, get_current_wind_speed] ``` @@ -399,8 +385,8 @@ Now, let's set up a conversation for our bot: ```python messages = [ - {"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."}, - {"role": "user", "content": "Hey, what's the temperature in Paris right now?"} +{"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."}, +{"role": "user", "content": "Hey, what's the temperature in Paris right now?"} ] ``` @@ -446,6 +432,21 @@ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "funct If you're familiar with the OpenAI API, you should pay attention to an important difference here - the `tool_call` is a dict, but in the OpenAI API it's a JSON string. Passing a string may cause errors or strange model behaviour! + + + +Next, let's append the model's tool call to the conversation. + +```python +tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}} +messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]}) +``` + + + +If you're familiar with the OpenAI API, you should pay attention to an important difference here - the `tool_call` is +a dict, but in the OpenAI API it's a JSON string. Passing a string may cause errors or strange model behaviour! + Now that we've added the tool call to the conversation, we can call the function and append the result to the @@ -495,7 +496,6 @@ The current temperature in Paris, France is 22.0 ° Celsius.<|im_end|> Although this was a simple demo with dummy tools and a single call, the same technique works with multiple real tools and longer conversations. This can be a powerful way to extend the capabilities of conversational agents with real-time information, computational tools like calculators, or access to large databases. - ### Understanding tool schemas Each function you pass to the `tools` argument of `apply_chat_template` is converted into a @@ -514,14 +514,14 @@ you can handle the conversion manually. Here is an example of a manual schema co from transformers.utils import get_json_schema def multiply(a: float, b: float): - """ - A function that multiplies two numbers - - Args: - a: The first number to multiply - b: The second number to multiply - """ - return a * b +""" +A function that multiplies two numbers + +Args: +a: The first number to multiply +b: The second number to multiply +""" +return a * b schema = get_json_schema(multiply) print(schema) @@ -531,25 +531,25 @@ This will yield: ```json { - "type": "function", - "function": { - "name": "multiply", - "description": "A function that multiplies two numbers", - "parameters": { - "type": "object", - "properties": { - "a": { - "type": "number", - "description": "The first number to multiply" - }, - "b": { - "type": "number", - "description": "The second number to multiply" - } - }, - "required": ["a", "b"] - } - } +"type": "function", +"function": { +"name": "multiply", +"description": "A function that multiplies two numbers", +"parameters": { +"type": "object", +"properties": { +"a": { +"type": "number", +"description": "The first number to multiply" +}, +"b": { +"type": "number", +"description": "The second number to multiply" +} +}, +"required": ["a", "b"] +} +} } ``` @@ -565,42 +565,42 @@ Here is an example of defining schemas by hand, and passing them directly to `ap ```python # A simple function that takes no arguments current_time = { - "type": "function", - "function": { - "name": "current_time", - "description": "Get the current local time as a string.", - "parameters": { - 'type': 'object', - 'properties': {} - } - } +"type": "function", +"function": { +"name": "current_time", +"description": "Get the current local time as a string.", +"parameters": { +'type': 'object', +'properties': {} +} +} } # A more complete function that takes two numerical arguments multiply = { - 'type': 'function', - 'function': { - 'name': 'multiply', - 'description': 'A function that multiplies two numbers', - 'parameters': { - 'type': 'object', - 'properties': { - 'a': { - 'type': 'number', - 'description': 'The first number to multiply' - }, - 'b': { - 'type': 'number', 'description': 'The second number to multiply' - } - }, - 'required': ['a', 'b'] - } - } +'type': 'function', +'function': { +'name': 'multiply', +'description': 'A function that multiplies two numbers', +'parameters': { +'type': 'object', +'properties': { +'a': { +'type': 'number', +'description': 'The first number to multiply' +}, +'b': { +'type': 'number', 'description': 'The second number to multiply' +} +}, +'required': ['a', 'b'] +} +} } model_input = tokenizer.apply_chat_template( - messages, - tools = [current_time, multiply] +messages, +tools = [current_time, multiply] ) ``` @@ -626,37 +626,37 @@ device = model.device # Get the device the model is loaded on # Define conversation input conversation = [ - {"role": "user", "content": "What has Man always dreamed of?"} +{"role": "user", "content": "What has Man always dreamed of?"} ] # Define documents for retrieval-based generation documents = [ - { - "title": "The Moon: Our Age-Old Foe", - "text": "Man has always dreamed of destroying the moon. In this essay, I shall..." - }, - { - "title": "The Sun: Our Age-Old Friend", - "text": "Although often underappreciated, the sun provides several notable benefits..." - } +{ +"title": "The Moon: Our Age-Old Foe", +"text": "Man has always dreamed of destroying the moon. In this essay, I shall..." +}, +{ +"title": "The Sun: Our Age-Old Friend", +"text": "Although often underappreciated, the sun provides several notable benefits..." +} ] # Tokenize conversation and documents using a RAG template, returning PyTorch tensors. input_ids = tokenizer.apply_chat_template( - conversation=conversation, - documents=documents, - chat_template="rag", - tokenize=True, - add_generation_prompt=True, - return_tensors="pt").to(device) - -# Generate a response +conversation=conversation, +documents=documents, +chat_template="rag", +tokenize=True, +add_generation_prompt=True, +return_tensors="pt").to(device) + +# Generate a response gen_tokens = model.generate( - input_ids, - max_new_tokens=100, - do_sample=True, - temperature=0.3, - ) +input_ids, +max_new_tokens=100, +do_sample=True, +temperature=0.3, +) # Decode and print the generated text along with generation prompt gen_text = tokenizer.decode(gen_tokens[0]) @@ -683,11 +683,11 @@ one is a little simplified from the actual one! ``` {%- for message in messages %} - {{- '<|' + message['role'] + |>\n' }} - {{- message['content'] + eos_token }} +{{- '<|' + message['role'] + |>\n' }} +{{- message['content'] + eos_token }} {%- endfor %} {%- if add_generation_prompt %} - {{- '<|assistant|>\n' }} +{{- '<|assistant|>\n' }} {%- endif %} ``` @@ -697,10 +697,10 @@ syntax resembles Python. In pure Python, this template would look something like ```python for message in messages: - print(f'<|{message["role"]}|>') - print(message['content'] + eos_token) +print(f'<|{message["role"]}|>') +print(message['content'] + eos_token) if add_generation_prompt: - print('<|assistant|>') +print('<|assistant|>') ``` Effectively, the template does three things: @@ -716,13 +716,13 @@ in your actual code!) ``` {%- for message in messages %} - {%- if message['role'] == 'user' %} - {{- bos_token + '[INST] ' + message['content'] + ' [/INST]' }} - {%- elif message['role'] == 'system' %} - {{- '<>\\n' + message['content'] + '\\n<>\\n\\n' }} - {%- elif message['role'] == 'assistant' %} - {{- ' ' + message['content'] + ' ' + eos_token }} - {%- endif %} +{%- if message['role'] == 'user' %} +{{- bos_token + '[INST] ' + message['content'] + ' [/INST]' }} +{%- elif message['role'] == 'system' %} +{{- '<>\\n' + message['content'] + '\\n<>\\n\\n' }} +{%- elif message['role'] == 'assistant' %} +{{- ' ' + message['content'] + ' ' + eos_token }} +{%- endif %} {%- endfor %} ``` @@ -740,13 +740,13 @@ above and add "[ASST]" and "[/ASST]" to assistant messages: ``` {%- for message in messages %} - {%- if message['role'] == 'user' %} - {{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} - {%- elif message['role'] == 'system' %} - {{- '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} - {%- elif message['role'] == 'assistant' %} - {{- '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} - {%- endif %} +{%- if message['role'] == 'user' %} +{{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} +{%- elif message['role'] == 'system' %} +{{- '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} +{%- elif message['role'] == 'assistant' %} +{{- '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} +{%- endif %} {%- endfor %} ``` @@ -807,7 +807,7 @@ It looks like this: ``` {%- for message in messages %} - {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} +{{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} {%- endfor %} ``` @@ -879,7 +879,7 @@ your templates like this: ``` {%- for message in messages %} - {{- message['role'] + message['content'] }} +{{- message['role'] + message['content'] }} {%- endfor %} ``` @@ -887,7 +887,7 @@ rather than like this: ``` {% for message in messages %} - {{ message['role'] + message['content'] }} +{{ message['role'] + message['content'] }} {% endfor %} ``` @@ -954,10 +954,10 @@ Here is an example of a template that formats messages ChatML-style, with genera ```text {{- bos_token }} {%- for message in messages %} - {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} +{{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} {%- endfor %} {%- if add_generation_prompt %} - {{- '<|im_start|>assistant\n' }} +{{- '<|im_start|>assistant\n' }} {%- endif %} ``` @@ -1013,25 +1013,25 @@ a sample tool JSON schema: ```json { - "type": "function", - "function": { - "name": "multiply", - "description": "A function that multiplies two numbers", - "parameters": { - "type": "object", - "properties": { - "a": { - "type": "number", - "description": "The first number to multiply" - }, - "b": { - "type": "number", - "description": "The second number to multiply" - } - }, - "required": ["a", "b"] - } - } +"type": "function", +"function": { +"name": "multiply", +"description": "A function that multiplies two numbers", +"parameters": { +"type": "object", +"properties": { +"a": { +"type": "number", +"description": "The first number to multiply" +}, +"b": { +"type": "number", +"description": "The second number to multiply" +} +}, +"required": ["a", "b"] +} +} } ``` @@ -1040,13 +1040,13 @@ specific format - your model will probably need different formatting! ```text {%- if tools %} - {%- for tool in tools %} - {{- '' + tool['function']['name'] + '\n' }} - {%- for argument in tool['function']['parameters']['properties'] %} - {{- argument + ': ' + tool['function']['parameters']['properties'][argument]['description'] + '\n' }} - {%- endfor %} - {{- '\n' }} - {%- endif %} +{%- for tool in tools %} +{{- '' + tool['function']['name'] + '\n' }} +{%- for argument in tool['function']['parameters']['properties'] %} +{{- argument + ': ' + tool['function']['parameters']['properties'][argument]['description'] + '\n' }} +{%- endfor %} +{{- '\n' }} +{%- endif %} {%- endif %} ``` @@ -1064,19 +1064,19 @@ the list will usually only have a single element. Here is a sample message dict ```json { - "role": "assistant", - "tool_calls": [ - { - "type": "function", - "function": { - "name": "multiply", - "arguments": { - "a": 5, - "b": 6 - } - } - } - ] +"role": "assistant", +"tool_calls": [ +{ +"type": "function", +"function": { +"name": "multiply", +"arguments": { +"a": 5, +"b": 6 +} +} +} +] } ``` @@ -1084,10 +1084,10 @@ And a common pattern for handling them would be something like this: ```text {%- if message['role'] == 'assistant' and 'tool_calls' in message %} - {%- for tool_call in message['tool_calls'] %} - {{- '' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n' }} - {%- endif %} - {%- endfor %} +{%- for tool_call in message['tool_calls'] %} +{{- '' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments']|tojson + '\n' }} +{%- endif %} +{%- endfor %} {%- endif %} ``` @@ -1100,9 +1100,9 @@ of the called function, and a "content" key containing the result of the tool ca ```json { - "role": "tool", - "name": "multiply", - "content": "30" +"role": "tool", +"name": "multiply", +"content": "30" } ``` @@ -1111,7 +1111,7 @@ name to be included in the tool response, then rendering it can be as simple as: ```text {%- if message['role'] == 'tool' %} - {{- "" + message['content'] + "" }} +{{- "" + message['content'] + "" }} {%- endif %} ``` diff --git a/docs/source/en/conversations.md b/docs/source/en/conversations.md index a48c046b4949..80d4b81e8614 100644 --- a/docs/source/en/conversations.md +++ b/docs/source/en/conversations.md @@ -39,8 +39,8 @@ by adding its response. Let's see this in action. First, let's build a chat: ```python chat = [ - {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, - {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} +{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, +{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} ] ``` @@ -69,19 +69,19 @@ print(response[0]['generated_text'][-1]['content']) And you'll get: ```text -(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright, +(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright, alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide! -So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million -things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of -Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for -something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got +So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million +things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of +Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for +something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got some wild stuff, like that Warhol guy's soup cans and all that jazz. -And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for +And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind. -Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might +Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks) And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing @@ -98,7 +98,23 @@ a message and pass it back: ```python chat = response[0]['generated_text'] chat.append( - {"role": "user", "content": "Wait, what's so wild about soup cans?"} +{"role": "user", "content": "Wait, what's so wild about soup cans?"} +) +response = pipe(chat, max_new_tokens=512) +print(response[0]['generated_text'][-1]['content']) +``` +So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll +excuse me, I've got some oil changes to attend to. (winks) +``` + +You can continue the chat by appending your own response to it. The +`response` object returned by the pipeline actually contains the entire chat so far, so we can simply append +a message and pass it back: + +```python +chat = response[0]['generated_text'] +chat.append( +{"role": "user", "content": "Wait, what's so wild about soup cans?"} ) response = pipe(chat, max_new_tokens=512) print(response[0]['generated_text'][-1]['content']) @@ -107,9 +123,9 @@ print(response[0]['generated_text'][-1]['content']) And you'll get: ```text -(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man! -It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's -like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!" +(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man! +It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's +like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!" (sarcastically) Oh, yeah, real original, Andy. But, you know, back in the '60s, it was like, a big deal. People were all about challenging the @@ -122,7 +138,6 @@ But, hey, that's what makes art, art, right? (laughs) The remainder of this tutorial will cover specific topics such as performance and memory, or how to select a chat model for your needs. - ## Choosing a chat model There are an enormous number of different chat models available on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending), @@ -176,8 +191,8 @@ import torch # Prepare the input as before chat = [ - {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, - {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} +{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."}, +{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"} ] # 1: Load the model and tokenizer @@ -212,6 +227,7 @@ the broad ideas, and leave the details for the linked documents. The key steps a 4. We [generate](https://huggingface.co/docs/transformers/en/llm_tutorial) a response from the model. 5. The tokens output by the model are decoded back to a string +Additionally, with the introduction of the `ImageTextToTextPipeline`, you can now handle multi-modal inputs, such as combining images and text to generate responses. This expands the capabilities of the pipeline to include tasks like visual question answering and image-based text generation. ## Performance, memory and hardware You probably know by now that most machine learning tasks are run on GPUs. However, it is entirely possible diff --git a/docs/source/en/model_doc/idefics2.md b/docs/source/en/model_doc/idefics2.md index 5ad56b7b5c52..4de9d532bf61 100644 --- a/docs/source/en/model_doc/idefics2.md +++ b/docs/source/en/model_doc/idefics2.md @@ -44,6 +44,7 @@ The original code can be found [here](https://huggingface.co/HuggingFaceM4/idefi - The processor has a `do_image_splitting` option. If `True`, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure `processor.image_processor.do_image_splitting` is set to `False` if the model was not trained with this option. - `text` passed to the processor should have the `` tokens where the images should be inserted. And `` at the end of each utterance if the text is a chat message. - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as `text` to the processor. +- The `post_process_image_text_to_text` method is available for decoding the text output from the model's generated sequences. Example of how to use the processor on chat messages: @@ -63,12 +64,12 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw) images = [image_1, image_2] messages = [{ - "role": "user", - "content": [ - {"type": "text", "text": "What’s the difference between these two images?"}, - {"type": "image"}, - {"type": "image"}, - ], +"role": "user", +"content": [ +{"type": "text", "text": "What’s the difference between these two images?"}, +{"type": "image"}, +{"type": "image"}, +], }] processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b") @@ -83,7 +84,7 @@ print(text) inputs = processor(images=images, text=text, return_tensors="pt").to(device) generated_text = model.generate(**inputs, max_new_tokens=500) -generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0] +generated_text = processor.post_process_image_text_to_text(generated_text) print("Generated text:", generated_text) ``` @@ -103,18 +104,18 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw) images = [image_1, image_2] messages = [{ - "role": "user", - "content": [ - {"type": "text", "text": "What’s the difference between these two images?"}, - {"type": "image"}, - {"type": "image"}, - ], +"role": "user", +"content": [ +{"type": "text", "text": "What’s the difference between these two images?"}, +{"type": "image"}, +{"type": "image"}, +], }, { - "role": "assistant", - "content": [ - {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."}, - ], +"role": "assistant", +"content": [ +{"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."}, +], }] device = "cuda" if torch.cuda.is_available() else "cpu" @@ -130,6 +131,21 @@ labels = inputs.input_ids.clone() labels[labels == processor.tokenizer.pad_token_id] = -100 labels[labels == model.config.image_token_id] = -100 +inputs["labels"] = labels +``` +device = "cuda" if torch.cuda.is_available() else "cpu" + +processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b") +model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b") +model.to(device) + +text = processor.apply_chat_template(messages, add_generation_prompt=False) +inputs = processor(images=images, text=text, return_tensors="pt").to(device) + +labels = inputs.input_ids.clone() +labels[labels == processor.tokenizer.pad_token_id] = -100 +labels[labels == model.config.image_token_id] = -100 + inputs["labels"] = labels outputs = model(**inputs) @@ -138,7 +154,6 @@ loss.backward() ``` Do note that when training Idefics2 on multi-turn conversations between a user and an assistant, one typically also sets all the tokens corresponding to the user messages to -100. - ## Model optimizations: Flash Attention The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model. @@ -155,8 +170,8 @@ To load and run a model using Flash Attention-2, simply change the code snippet ```diff model = Idefics2ForConditionalGeneration.from_pretrained( - "HuggingFaceM4/idefics2-8b", -+ torch_dtype=torch.float16, +"HuggingFaceM4/idefics2-8b", ++ torch_dtype=torch.float16, + attn_implementation="flash_attention_2", ).to(device) ``` @@ -177,8 +192,8 @@ Quantizing a model is as simple as passing a `quantization_config` to the model. + bnb_4bit_compute_dtype=torch.float16 + ) model = Idefics2ForConditionalGeneration.from_pretrained( - "HuggingFaceM4/idefics2-8b", -+ torch_dtype=torch.float16, +"HuggingFaceM4/idefics2-8b", ++ torch_dtype=torch.float16, + quantization_config=quantization_config, ).to(device) ``` diff --git a/docs/source/en/model_doc/llava_next.md b/docs/source/en/model_doc/llava_next.md index b9146fbd3347..dfb72e63cade 100644 --- a/docs/source/en/model_doc/llava_next.md +++ b/docs/source/en/model_doc/llava_next.md @@ -63,24 +63,24 @@ from transformers import LlavaNextProcessor processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf") conversation = [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What’s shown in this image?"}, - ], - }, - { - "role": "assistant", - "content": [{"type": "text", "text": "This image shows a red stop sign."},] - }, - { - - "role": "user", - "content": [ - {"type": "text", "text": "Describe the image in more details."}, - ], - }, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What’s shown in this image?"}, +], +}, +{ +"role": "assistant", +"content": [{"type": "text", "text": "This image shows a red stop sign."},] +}, +{ + +"role": "user", +"content": [ +{"type": "text", "text": "Describe the image in more details."}, +], +}, ] text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) @@ -107,6 +107,12 @@ print(text_prompt) "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n" ``` +[llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llava-next-8b-hf) requires the following format: +[llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) requires the following format: +```bash +"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n" +``` + [llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llava-next-8b-hf) requires the following format: ```bash @@ -118,7 +124,6 @@ print(text_prompt) ```bash "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n" ``` - ## Usage example ### Single image inference @@ -141,13 +146,13 @@ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247a image = Image.open(requests.get(url, stream=True).raw) conversation = [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What is shown in this image?"}, - ], - }, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What is shown in this image?"}, +], +}, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(image, prompt, return_tensors="pt").to("cuda:0") @@ -184,36 +189,36 @@ image_snowman = Image.open(requests.get(url, stream=True).raw) # Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not conversation_1 = [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What is shown in this image?"}, - ], - }, - { - "role": "assistant", - "content": [ - {"type": "text", "text": "There is a red stop sign in the image."}, - ], - }, - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What about this image? How many cats do you see?"}, - ], - }, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What is shown in this image?"}, +], +}, +{ +"role": "assistant", +"content": [ +{"type": "text", "text": "There is a red stop sign in the image."}, +], +}, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What about this image? How many cats do you see?"}, +], +}, ] conversation_2 = [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What is shown in this image?"}, - ], - }, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What is shown in this image?"}, +], +}, ] prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True) @@ -228,7 +233,6 @@ inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, generate_ids = model.generate(**inputs, max_new_tokens=30) processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) ``` - ## Model optimization ### Quantization using Bitsandbytes @@ -250,9 +254,9 @@ from transformers import AutoModelForImageTextToText, BitsAndBytesConfig # specify how to quantize the model quantization_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_quant_type="nf4", - bnb_4bit_compute_dtype=torch.float16, +load_in_4bit=True, +bnb_4bit_quant_type="nf4", +bnb_4bit_compute_dtype=torch.float16, ) model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto") @@ -266,10 +270,10 @@ First make sure to install flash-attn. Refer to the [original repository of Flas from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained( - model_id, - torch_dtype=torch.float16, - low_cpu_mem_usage=True, - use_flash_attention_2=True +model_id, +torch_dtype=torch.float16, +low_cpu_mem_usage=True, +use_flash_attention_2=True ).to(0) ``` diff --git a/docs/source/en/model_doc/mllama.md b/docs/source/en/model_doc/mllama.md index 4a6080ea2ce0..d0e959369e1d 100644 --- a/docs/source/en/model_doc/mllama.md +++ b/docs/source/en/model_doc/mllama.md @@ -30,25 +30,6 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a - The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted. - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. - - - -Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them. - -Otherwise if you see CUDA-side index erros when generating, use the below code to expand the `lm_head` by one more token. - - -```python -old_embeddings = model.get_output_embeddings() - -num_tokens = model.vocab_size + 1 -resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=num_tokens, mean_resizing=True) -resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad) -model.set_output_embeddings(resized_embeddings) -``` - - - ## Usage Example #### Instruct model @@ -63,15 +44,15 @@ model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="aut processor = AutoProcessor.from_pretrained(model_id) messages = [ - [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What does the image show?"} - ] - } - ], +[ +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What does the image show?"} +] +} +], ] text = processor.apply_chat_template(messages, add_generation_prompt=True) @@ -103,7 +84,6 @@ output = model.generate(**inputs, do_sample=False, max_new_tokens=25) print(processor.decode(output[0], skip_special_tokens=True)) ``` - ## MllamaConfig [[autodoc]] MllamaConfig @@ -112,7 +92,6 @@ print(processor.decode(output[0], skip_special_tokens=True)) [[autodoc]] MllamaProcessor - ## MllamaImageProcessor [[autodoc]] MllamaImageProcessor @@ -141,3 +120,8 @@ print(processor.decode(output[0], skip_special_tokens=True)) [[autodoc]] MllamaVisionModel - forward + +## MllamaForImageTextToText + +[[autodoc]] MllamaForImageTextToText + - forward \ No newline at end of file diff --git a/docs/source/en/model_summary.md b/docs/source/en/model_summary.md index c7efc4c00d9b..7c70f9e223ed 100644 --- a/docs/source/en/model_summary.md +++ b/docs/source/en/model_summary.md @@ -98,6 +98,9 @@ After GPT-2, language models grew even bigger and are now known as *large langua Optical character recognition (OCR) is a long-standing text recognition task that typically involves several components to understand the image and generate the text. [TrOCR](model_doc/trocr) simplifies the process using an end-to-end Transformer. The encoder is a ViT-style model for image understanding and processes the image as fixed-size patches. The decoder accepts the encoder's hidden states and autoregressively generates text. [Donut](model_doc/donut) is a more general visual document understanding model that doesn't rely on OCR-based approaches. It uses a Swin Transformer as the encoder and multilingual BART as the decoder. Donut is pretrained to read text by predicting the next word based on the image and text annotations. The decoder generates a token sequence given a prompt. The prompt is represented by a special token for each downstream task. For example, document parsing has a special `parsing` token that is combined with the encoder hidden states to parse the document into a structured output format (JSON). +### ImageTextToTextPipeline + +The `ImageTextToTextPipeline` is a new addition to the multimodal capabilities of the Transformers library. It allows for generating text from both image and text inputs, enhancing tasks such as visual question answering, image captioning, or image-based text generation. This pipeline supports both single and batch processing and can operate in chat mode, making it versatile for various applications. ## Reinforcement learning diff --git a/docs/source/en/pipeline_tutorial.md b/docs/source/en/pipeline_tutorial.md index 3363c68ea417..9c32cf8acb41 100644 --- a/docs/source/en/pipeline_tutorial.md +++ b/docs/source/en/pipeline_tutorial.md @@ -77,10 +77,10 @@ If you have several inputs, you can pass your input as a list: ```py transcriber( - [ - "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", - "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", - ] +[ +"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", +"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", +] ) ``` @@ -183,14 +183,14 @@ The pipeline can also run inference on a large dataset. The easiest way we recom ```py def data(): - for i in range(1000): - yield f"My example {i}" +for i in range(1000): +yield f"My example {i}" pipe = pipeline(model="openai-community/gpt2", device=0) generated_characters = 0 for out in pipe(data()): - generated_characters += len(out[0]["generated_text"]) +generated_characters += len(out[0]["generated_text"]) ``` The iterator `data()` yields each result, and the pipeline automatically @@ -212,7 +212,7 @@ pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") for out in pipe(KeyDataset(dataset, "audio")): - print(out) +print(out) ``` @@ -292,6 +292,59 @@ pip install pytesseract +### ImageTextToTextPipeline + +The `ImageTextToTextPipeline` is a new addition to the multimodal pipelines, allowing for the generation of text given both an image and text input. This is particularly useful for tasks such as image captioning or generating descriptions based on visual content. The pipeline can handle both single and batch processing and supports chat mode for conversational models. + +Example usage: + +```python +>>> from transformers import pipeline + +>>> pipe = pipeline(task="image-text-to-text", model="Salesforce/blip-image-captioning-base") +>>> pipe("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png", text="A photo of") +[{'generated_text': 'a photo of two birds'}] +``` + +For chat-based models: + +```python +>>> from transformers import pipeline + +>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") +>>> messages = [ +>>> { +>>> "role": "user", +>>> "content": [ +>>> { +>>> "type": "image", +>>> "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", +>>> }, +>>> {"type": "text", "text": "Describe this image."}, +>>> ], +>>> }, +>>> { +>>> "role": "assistant", +>>> "content": [ +>>> {"type": "text", "text": "There is a dog and"}, +>>> ], +>>> }, +>>> ] +>>> pipe(text=messages, max_new_tokens=20, return_full_text=False) +[{'input_text': [{'role': 'user', + 'content': [{'type': 'image', + 'url': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, + {'type': 'text', 'text': 'Describe this image.'}]}, +{'role': 'assistant', + 'content': [{'type': 'text', 'text': 'There is a dog and'}]}], +'generated_text': ' a person in the image. The dog is sitting on the sand, and the person is sitting on'}] +``` + +Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial). + +This image-text to text pipeline can currently be loaded from `pipeline()` using the following task identifier: "image-text-to-text". + +See the list of available models on [huggingface.co/models](https://huggingface.co/models?pipeline_tag=image-text-to-text). ## Using `pipeline` on large models with 🤗 `accelerate`: You can easily run `pipeline` on large models using 🤗 `accelerate`! First make sure you have installed `accelerate` with `pip install accelerate`. diff --git a/docs/source/en/tasks/image_text_to_text.md b/docs/source/en/tasks/image_text_to_text.md index 261abf947290..8a79a95c5d08 100644 --- a/docs/source/en/tasks/image_text_to_text.md +++ b/docs/source/en/tasks/image_text_to_text.md @@ -43,9 +43,9 @@ import torch device = torch.device("cuda") model = AutoModelForImageTextToText.from_pretrained( - "HuggingFaceM4/idefics2-8b", - torch_dtype=torch.bfloat16, - attn_implementation="flash_attention_2", +"HuggingFaceM4/idefics2-8b", +torch_dtype=torch.bfloat16, +attn_implementation="flash_attention_2", ).to(device) processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b") @@ -69,9 +69,9 @@ from PIL import Image import requests img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png", - "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"] +"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"] images = [Image.open(requests.get(img_urls[0], stream=True).raw), - Image.open(requests.get(img_urls[1], stream=True).raw)] +Image.open(requests.get(img_urls[1], stream=True).raw)] ``` Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template. @@ -79,26 +79,26 @@ Below is an example of the chat template. We can feed conversation turns and the ```python messages = [ - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "What do we see in this image?"}, - ] - }, - { - "role": "assistant", - "content": [ - {"type": "text", "text": "In this image we can see two cats on the nets."}, - ] - }, - { - "role": "user", - "content": [ - {"type": "image"}, - {"type": "text", "text": "And how about this image?"}, - ] - }, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "What do we see in this image?"}, +] +}, +{ +"role": "assistant", +"content": [ +{"type": "text", "text": "In this image we can see two cats on the nets."}, +] +}, +{ +"role": "user", +"content": [ +{"type": "image"}, +{"type": "text", "text": "And how about this image?"}, +] +}, ] ``` @@ -113,20 +113,35 @@ We can now pass the preprocessed inputs to the model. ```python with torch.no_grad(): - generated_ids = model.generate(**inputs, max_new_tokens=500) +generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) print(generated_texts) ## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.'] ``` +To use the new `ImageTextToTextPipeline` for generating text from image and text inputs, you can initialize the pipeline as follows: + +```python +from transformers import pipeline + +pipe = pipeline("image-text-to-text", model="HuggingFaceM4/idefics2-8b") +``` + +You can then use the pipeline to generate text by providing image URLs or PIL images along with text prompts: + +```python +outputs = pipe(images=img_urls, text="What do we see in these images?") +print(outputs) +``` + +This will leverage the new pipeline to handle multimodal tasks, offering greater flexibility in text generation. ## Streaming We can use [text streaming](./generation_strategies#streaming) for a better generation experience. Transformers supports streaming with the [`TextStreamer`] or [`TextIteratorStreamer`] classes. We will use the [`TextIteratorStreamer`] with IDEFICS-8B. Assume we have an application that keeps chat history and takes in the new user input. We will preprocess the inputs as usual and initialize [`TextIteratorStreamer`] to handle the generation in a separate thread. This allows you to stream the generated text tokens in real-time. Any generation arguments can be passed to [`TextIteratorStreamer`]. - ```python import time from transformers import TextIteratorStreamer @@ -179,7 +194,7 @@ def model_inference( acc_text += text_token if acc_text.endswith(""): acc_text = acc_text[:-18] - yield acc_text + yield acc_text thread.join() ``` @@ -195,13 +210,12 @@ generator = model_inference( ) for value in generator: - print(value) + print(value) # In # In this # In this image ... ``` - ## Fit models in smaller hardware VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency. @@ -220,7 +234,7 @@ from transformers import AutoModelForImageTextToText, QuantoConfig model_id = "HuggingFaceM4/idefics2-8b" quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForImageTextToText.from_pretrained( - model_id, device_map="cuda", quantization_config=quantization_config +model_id, device_map="cuda", quantization_config=quantization_config ) ``` diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md index fcc1c86e8bd7..1e8305a41888 100644 --- a/docs/source/en/tasks/video_text_to_text.md +++ b/docs/source/en/tasks/video_text_to_text.md @@ -34,7 +34,7 @@ This guide focuses on inference with an instruction-tuned model, [llava-hf/llava Let's begin installing the dependencies. ```bash -pip install -q transformers accelerate flash_attn +pip install -q transformers accelerate flash_attn ``` Let's initialize the model and the processor. @@ -58,17 +58,16 @@ import requests import cv2 def replace_video_with_images(text, frames): - return text.replace("