Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
472 changes: 236 additions & 236 deletions docs/source/en/chat_templating.md

Large diffs are not rendered by default.

48 changes: 32 additions & 16 deletions docs/source/en/conversations.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ by adding its response. Let's see this in action. First, let's build a chat:

```python
chat = [
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]
```

Expand Down Expand Up @@ -69,19 +69,19 @@ print(response[0]['generated_text'][-1]['content'])
And you'll get:

```text
(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
(sigh) Oh boy, you're asking me for advice? You're gonna need a map, pal! Alright,
alright, I'll give you the lowdown. But don't say I didn't warn you, I'm a robot, not a tour guide!

So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
So, you wanna know what's fun to do in the Big Apple? Well, let me tell you, there's a million
things to do, but I'll give you the highlights. First off, you gotta see the sights: the Statue of
Liberty, Central Park, Times Square... you know, the usual tourist traps. But if you're lookin' for
something a little more... unusual, I'd recommend checkin' out the Museum of Modern Art. It's got
some wild stuff, like that Warhol guy's soup cans and all that jazz.

And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
And if you're feelin' adventurous, take a walk across the Brooklyn Bridge. Just watch out for
those pesky pigeons, they're like little feathered thieves! (laughs) Get it? Thieves? Ah, never mind.

Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
Now, if you're lookin' for some serious fun, hit up the comedy clubs in Greenwich Village. You might
even catch a glimpse of some up-and-coming comedians... or a bunch of wannabes tryin' to make it big. (winks)

And finally, if you're feelin' like a real New Yorker, grab a slice of pizza from one of the many amazing
Expand All @@ -98,7 +98,23 @@ a message and pass it back:
```python
chat = response[0]['generated_text']
chat.append(
{"role": "user", "content": "Wait, what's so wild about soup cans?"}
{"role": "user", "content": "Wait, what's so wild about soup cans?"}
)
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])
```
So, there you have it, pal! That's my expert advice on what to do in New York. Now, if you'll
excuse me, I've got some oil changes to attend to. (winks)
```

You can continue the chat by appending your own response to it. The
`response` object returned by the pipeline actually contains the entire chat so far, so we can simply append
a message and pass it back:

```python
chat = response[0]['generated_text']
chat.append(
{"role": "user", "content": "Wait, what's so wild about soup cans?"}
)
response = pipe(chat, max_new_tokens=512)
print(response[0]['generated_text'][-1]['content'])
Expand All @@ -107,9 +123,9 @@ print(response[0]['generated_text'][-1]['content'])
And you'll get:

```text
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(laughs) Oh, you're killin' me, pal! You don't get it, do you? Warhol's soup cans are like, art, man!
It's like, he took something totally mundane, like a can of soup, and turned it into a masterpiece. It's
like, "Hey, look at me, I'm a can of soup, but I'm also a work of art!"
(sarcastically) Oh, yeah, real original, Andy.

But, you know, back in the '60s, it was like, a big deal. People were all about challenging the
Expand All @@ -122,7 +138,6 @@ But, hey, that's what makes art, art, right? (laughs)

The remainder of this tutorial will cover specific topics such
as performance and memory, or how to select a chat model for your needs.

## Choosing a chat model

There are an enormous number of different chat models available on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending),
Expand Down Expand Up @@ -176,8 +191,8 @@ import torch

# Prepare the input as before
chat = [
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
{"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
{"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

# 1: Load the model and tokenizer
Expand Down Expand Up @@ -212,6 +227,7 @@ the broad ideas, and leave the details for the linked documents. The key steps a
4. We [generate](https://huggingface.co/docs/transformers/en/llm_tutorial) a response from the model.
5. The tokens output by the model are decoded back to a string

Additionally, with the introduction of the `ImageTextToTextPipeline`, you can now handle multi-modal inputs, such as combining images and text to generate responses. This expands the capabilities of the pipeline to include tasks like visual question answering and image-based text generation.
## Performance, memory and hardware

You probably know by now that most machine learning tasks are run on GPUs. However, it is entirely possible
Expand Down
59 changes: 37 additions & 22 deletions docs/source/en/model_doc/idefics2.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ The original code can be found [here](https://huggingface.co/HuggingFaceM4/idefi
- The processor has a `do_image_splitting` option. If `True`, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure `processor.image_processor.do_image_splitting` is set to `False` if the model was not trained with this option.
- `text` passed to the processor should have the `<image>` tokens where the images should be inserted. And `<end_of_utterance>` at the end of each utterance if the text is a chat message.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as `text` to the processor.
- The `post_process_image_text_to_text` method is available for decoding the text output from the model's generated sequences.

Example of how to use the processor on chat messages:

Expand All @@ -63,12 +64,12 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between these two images?"},
{"type": "image"},
{"type": "image"},
],
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between these two images?"},
{"type": "image"},
{"type": "image"},
],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
Expand All @@ -83,7 +84,7 @@ print(text)
inputs = processor(images=images, text=text, return_tensors="pt").to(device)

generated_text = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_text, skip_special_tokens=True)[0]
generated_text = processor.post_process_image_text_to_text(generated_text)
print("Generated text:", generated_text)
```

Expand All @@ -103,18 +104,18 @@ image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between these two images?"},
{"type": "image"},
{"type": "image"},
],
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between these two images?"},
{"type": "image"},
{"type": "image"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
],
"role": "assistant",
"content": [
{"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
],
}]

device = "cuda" if torch.cuda.is_available() else "cpu"
Expand All @@ -130,6 +131,21 @@ labels = inputs.input_ids.clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == model.config.image_token_id] = -100

inputs["labels"] = labels
```
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
model.to(device)

text = processor.apply_chat_template(messages, add_generation_prompt=False)
inputs = processor(images=images, text=text, return_tensors="pt").to(device)

labels = inputs.input_ids.clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
labels[labels == model.config.image_token_id] = -100

inputs["labels"] = labels

outputs = model(**inputs)
Expand All @@ -138,7 +154,6 @@ loss.backward()
```

Do note that when training Idefics2 on multi-turn conversations between a user and an assistant, one typically also sets all the tokens corresponding to the user messages to -100.

## Model optimizations: Flash Attention

The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
Expand All @@ -155,8 +170,8 @@ To load and run a model using Flash Attention-2, simply change the code snippet

```diff
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
+ attn_implementation="flash_attention_2",
).to(device)
```
Expand All @@ -177,8 +192,8 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
+ bnb_4bit_compute_dtype=torch.float16
+ )
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
+ quantization_config=quantization_config,
).to(device)
```
Expand Down
Loading