Promptless · promptless · Nov 6, 2024
diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md
diff --git a/docs/source/en/conversations.md b/docs/source/en/conversations.md
@@ -39,8 +39,8 @@ by adding its response. Let's see this in action. First, let's build a chat:
 
 ```python
 chat = [
-    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
-    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
+{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
+{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
 ]
 ```
 
@@ -69,19 +69,19 @@ print(response[0]['generated_text'][-1]['content'])
 And you'll get:
 
 ```text
-(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright, 
+(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
 alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!
 
-So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million 
-things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of 
-Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for 
-something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got 
+So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
+things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
+Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
+something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
 some wild stuff, like that Warhol guy's soup cans and all that jazz.
 
-And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for 
+And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
 those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.
 
-Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might 
+Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
 even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)
 
 And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
@@ -98,7 +98,23 @@ a message and pass it back:
 ```python
 chat = response[0]['generated_text']
 chat.append(
-    {"role": "user", "content": "Wait, what's so wild about soup cans?"}
+{"role": "user", "content": "Wait, what's so wild about soup cans?"}
+)
+response = pipe(chat, max_new_tokens=512)
+print(response[0]['generated_text'][-1]['content'])
+```
+So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
+excuse me, I've got some oil changes to attend to. (winks)
+```
+
+You can continue the chat by appending your own response to it. The
+`response` object returned by the pipeline actually contains the entire chat so far, so we can simply append
+a message and pass it back:
+
+```python
+chat = response[0]['generated_text']
+chat.append(
+{"role": "user", "content": "Wait, what's so wild about soup cans?"}
 )
 response = pipe(chat, max_new_tokens=512)
 print(response[0]['generated_text'][-1]['content'])
@@ -107,9 +123,9 @@ print(response[0]['generated_text'][-1]['content'])
 And you'll get:
 
 ```text
-(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man! 
-It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's 
-like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!" 
+(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
+It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
+like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
 (sarcastically) Oh, yeah, real original, Andy.
 
 But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
@@ -122,7 +138,6 @@ But, hey, that's what makes art, art, right? (laughs)
 
 The remainder of this tutorial will cover specific topics such
 as performance and memory, or how to select a chat model for your needs.
-
 ## Choosing a chat model
 
 There are an enormous number of different chat models available on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending),
@@ -176,8 +191,8 @@ import torch
 
 # Prepare the input as before
 chat = [
-    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
-    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
+{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
+{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
 ]
 
 # 1: Load the model and tokenizer
@@ -212,6 +227,7 @@ the broad ideas, and leave the details for the linked documents. The key steps a
 4. We [generate](https://huggingface.co/docs/transformers/en/llm_tutorial) a response from the model.
 5. The tokens output by the model are decoded back to a string
 
+Additionally, with the introduction of the `ImageTextToTextPipeline`, you can now handle multi-modal inputs, such as combining images and text to generate responses. This expands the capabilities of the pipeline to include tasks like visual question answering and image-based text generation.
 ## Performance, memory and hardware
 
 You probably know by now that most machine learning tasks are run on GPUs. However, it is entirely possible

diff --git a/docs/source/en/model_doc/idefics2.md b/docs/source/en/model_doc/idefics2.md
@@ -44,6 +44,7 @@ The original code can be found [here](https://huggingface.co/HuggingFaceM4/idefi
 - The processor has a `do_image_splitting` option. If `True`, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure `processor.image_processor.do_image_splitting` is set to `False` if the model was not trained with this option.
 - `text` passed to the processor should have the `<image>` tokens where the images should be inserted. And `<end_of_utterance>` at the end of each utterance if the text is a chat message.
 - The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as `text` to the processor.
+- The `post_process_image_text_to_text` method is available for decoding the text output from the model's generated sequences.
 
 Example of how to use the processor on chat messages:
 
@@ -63,12 +64,12 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw)
 images = [image_1, image_2]
 
 messages = [{
-    "role": "user",
-    "content": [
-        {"type": "text", "text": "What’s the difference between these two images?"},
-        {"type": "image"},
-        {"type": "image"},
-    ],
+"role": "user",
+"content": [
+{"type": "text", "text": "What’s the difference between these two images?"},
+{"type": "image"},
+{"type": "image"},
+],
 }]
 
 processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
@@ -83,7 +84,7 @@ print(text)
 inputs = processor(images=images, text=text, return_tensors="pt").to(device)
 
 generated_text = model.generate(**inputs, max_new_tokens=500)
-generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
+generated_text = processor.post_process_image_text_to_text(generated_text)
 print("Generated text:", generated_text)
 ```
 
@@ -103,18 +104,18 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw)
 images = [image_1, image_2]
 
 messages = [{
-    "role": "user",
-    "content": [
-        {"type": "text", "text": "What’s the difference between these two images?"},
-        {"type": "image"},
-        {"type": "image"},
-    ],
+"role": "user",
+"content": [
+{"type": "text", "text": "What’s the difference between these two images?"},
+{"type": "image"},
+{"type": "image"},
+],
 },
 {
-    "role": "assistant",
-    "content": [
-        {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
-    ],
+"role": "assistant",
+"content": [
+{"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
+],
 }]
 
 device = "cuda" if torch.cuda.is_available() else "cpu"
@@ -130,6 +131,21 @@ labels = inputs.input_ids.clone()
 labels[labels == processor.tokenizer.pad_token_id] = -100
 labels[labels == model.config.image_token_id] = -100
 
+inputs["labels"] = labels
+```
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
+model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
+model.to(device)
+
+text = processor.apply_chat_template(messages, add_generation_prompt=False)
+inputs = processor(images=images, text=text, return_tensors="pt").to(device)
+
+labels = inputs.input_ids.clone()
+labels[labels == processor.tokenizer.pad_token_id] = -100
+labels[labels == model.config.image_token_id] = -100
+
 inputs["labels"] = labels
 
 outputs = model(**inputs)
@@ -138,7 +154,6 @@ loss.backward()
 ```
 
 Do note that when training Idefics2 on multi-turn conversations between a user and an assistant, one typically also sets all the tokens corresponding to the user messages to -100.
-
 ## Model optimizations: Flash Attention
 
 The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
@@ -155,8 +170,8 @@ To load and run a model using Flash Attention-2, simply change the code snippet
 
 ```diff
 model = Idefics2ForConditionalGeneration.from_pretrained(
-    "HuggingFaceM4/idefics2-8b",
-+    torch_dtype=torch.float16,    
+"HuggingFaceM4/idefics2-8b",
++    torch_dtype=torch.float16,
 +    attn_implementation="flash_attention_2",
 ).to(device)
 ```
@@ -177,8 +192,8 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
 +    bnb_4bit_compute_dtype=torch.float16
 + )
 model = Idefics2ForConditionalGeneration.from_pretrained(
-    "HuggingFaceM4/idefics2-8b",
-+    torch_dtype=torch.float16,    
+"HuggingFaceM4/idefics2-8b",
++    torch_dtype=torch.float16,
 +    quantization_config=quantization_config,
 ).to(device)
 ```