Skip to content

Commit 5a4a08a

Browse files
authored
Merge branch 'main' into kernel_mapping_error_resolve
2 parents 04e27cb + d08b98b commit 5a4a08a

File tree

1,157 files changed

+30524
-65638
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,157 files changed

+30524
-65638
lines changed

.github/workflows/get-pr-info.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ on:
4040
description: "The sha of the merge commit for the pull request (created by GitHub) in the base repository"
4141
value: ${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_SHA }}
4242
PR_MERGE_COMMIT_BASE_SHA:
43-
description: "The sha of the parent commit of the the merge commit on the target branch in the base repository"
43+
description: "The sha of the parent commit of the merge commit on the target branch in the base repository"
4444
value: ${{ jobs.get-pr-info.outputs.PR_MERGE_COMMIT_BASE_SHA }}
4545
PR_HEAD_COMMIT_DATE:
4646
description: "The date of the head sha of the pull request branch in the head repository"

.github/workflows/self-comment-ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ env:
2727
jobs:
2828
get-pr-number:
2929
name: Get PR number
30-
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
30+
if: ${{ github.event.issue.state == 'open' && contains(fromJSON('["ydshieh", "ArthurZucker", "zucchini-nlp", "molbap", "gante", "LysandreJik", "Cyrilvallez", "Rocketknight1", "SunMarc", "eustlb", "MekkCyber", "vasqu", "ivarflakstad", "stevhliu", "ebezzam", "remi-or", "itazap", "3outeille"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') || startsWith(github.event.comment.body, 'run slow') || startsWith(github.event.comment.body, 'run_slow')) }}
3131
uses: ./.github/workflows/get-pr-number.yml
3232

3333
get-pr-info:

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,9 +125,9 @@ If you're contributing a **vision-language model** (or any multimodal model that
125125
All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
126126

127127
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
128-
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. [Modular guide](./modular_transformers#implementing-a-modular-file) shows a quick way to set up a modular file.
128+
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well. [Modular guide](./docs/source/en/modular_transformers.md#implementing-a-modular-file) shows a quick way to set up a modular file.
129129
- Reuse existing patterns from similar models as much as possible
130-
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in ["Transformers modeling backend"](./transformers_as_backend#multimodal-models)
130+
- You can make the model compatible with inference engines such as vLLM or SGLang, and enable zero-effort integration. See specific requirements for model implementation in ["Transformers modeling backend"](./docs/source/en/transformers_as_backend.md#multimodal-models)
131131

132132
To verify your modular file is correct, run:
133133

MIGRATION_GUIDE_V5.md

Lines changed: 485 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ pipeline("the secret to baking a really good cake is ")
134134
To chat with a model, the usage pattern is the same. The only difference is you need to construct a chat history (the input to `Pipeline`) between you and the system.
135135

136136
> [!TIP]
137-
> You can also chat with a model directly from the command line.
137+
> You can also chat with a model directly from the command line, as long as [`transformers serve` is running](https://huggingface.co/docs/transformers/main/en/serving).
138138
> ```shell
139139
> transformers chat Qwen/Qwen2.5-0.5B-Instruct
140140
> ```

benchmark_v2/framework/benchmark_config.py

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
import itertools
33
import json
44
import logging
5+
from functools import lru_cache
56
from typing import Any
67

7-
from transformers.utils.import_utils import is_flash_attn_2_available
8+
from transformers.utils.import_utils import is_flash_attn_2_available, is_kernels_available
89

910

1011
KERNELIZATION_AVAILABLE = False
@@ -18,17 +19,36 @@
1819
logger = logging.getLogger(__name__)
1920

2021

22+
@lru_cache
23+
def is_fa2_or_kernel_available() -> bool:
24+
"""Returns True if the flash_attn_2 or a fallback kernel is available"""
25+
# Early return if flash_attn_2 is available
26+
if is_flash_attn_2_available():
27+
return True
28+
# Early return if kernels is not available
29+
if not is_kernels_available():
30+
logger.warning(
31+
"flash_attention_2 is not available. kernels is not installed. Benchmarking flash_attention_2 will not "
32+
"be possible."
33+
)
34+
return False
35+
# If kernels is available, try to get the flash_attn_2 kernel
36+
try:
37+
from kernels import get_kernel
38+
39+
get_kernel("kernels-community/flash-attn")
40+
except Exception as _:
41+
logger.warning(
42+
"flash_attention_2 is not available. kernels is installed, but the flash_attn kernel is not available."
43+
"Benchmarking flash_attention_2 will not be possible."
44+
)
45+
return False
46+
47+
2148
class BenchmarkConfig:
2249
"""Configuration for a single benchmark scenario."""
2350

24-
all_attn_implementations = [
25-
("flash_attention_2", None),
26-
("eager", None),
27-
("sdpa", "math"),
28-
("sdpa", "flash_attention"),
29-
("flex_attention", None),
30-
]
31-
51+
all_attn_implementations = ["flash_attention_2", "eager", "sdpa", "flex_attention"]
3252
all_compiled_modes = [None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"]
3353

3454
def __init__(
@@ -41,7 +61,6 @@ def __init__(
4161
sequence_length: int = 128,
4262
num_tokens_to_generate: int = 128,
4363
attn_implementation: str = "eager",
44-
sdpa_backend: str | None = None,
4564
compile_mode: str | None = None,
4665
compile_options: dict[str, Any] | None = None,
4766
kernelize: bool = False,
@@ -59,7 +78,6 @@ def __init__(
5978
self.num_tokens_to_generate = num_tokens_to_generate
6079
# Generation parameters
6180
self.attn_implementation = attn_implementation
62-
self.sdpa_backend = sdpa_backend
6381
# Optimization parameters
6482
self.compile_mode = compile_mode
6583
self.compile_options = compile_options if compile_options is not None else {}
@@ -75,34 +93,21 @@ def check_validity(self, skip_validity_check: bool = False) -> None:
7593
if skip_validity_check:
7694
return
7795
# Check FA is installed
78-
if self.attn_implementation == "flash_attention_2" and not is_flash_attn_2_available():
79-
logger.warning(
80-
"Flash attention does not support compile mode. Defaulting to SDPA w/ flash attention backend."
81-
)
96+
is_fa = self.attn_implementation == "flash_attention_2"
97+
if is_fa and not is_fa2_or_kernel_available():
98+
logger.warning("Flash attention is not available. Defaulting to SDPA.")
8299
self.attn_implementation = "sdpa"
83-
self.sdpa_backend = "flash_attention"
84100
# Flash attention does not support compile mode, so we turn it off # FIXME: it would be better to support it
85-
is_fa = self.attn_implementation == "flash_attention_2"
86-
is_fa |= self.attn_implementation == "sdpa" and self.sdpa_backend == "flash_attention"
87-
if is_fa:
101+
if is_fa and self.compile_mode is not None:
88102
logger.warning("Flash attention does not support compile mode. Turning off compile mode.")
89103
self.compile_mode = None
90-
# Handle SDPA backend if not determined by the config (needs to be done before skipping duplicates)
91-
if self.attn_implementation == "sdpa" and self.sdpa_backend is None:
92-
default_backend = "flash_attention" # FIXME: torch has a _cur_sdpa_kernel_backends but it fails
93-
logger.warning(f"No SDPA backend provided, using {default_backend} instead.")
94-
self.sdpa_backend = default_backend
104+
# Handle continuous batching cases
95105
if self.continuous_batching:
96106
if self.attn_implementation == "flex_attention":
97107
logger.error(
98-
"disabling continuous batching because of invalid configuration: flex attention is not supported"
108+
"Disabling continuous batching because of invalid configuration: flex attention is not supported."
99109
)
100110
self.continuous_batching = False
101-
elif self.attn_implementation == "sdpa" and self.sdpa_backend is not None:
102-
logger.warning(
103-
"when continuous batching is enabled, sdpa_backend must be None because of the attention mask, setting it to None"
104-
)
105-
self.sdpa_backend = "math"
106111

107112
@property
108113
def hash(self) -> str:
@@ -115,7 +120,6 @@ def infer_name(self, compact: bool = True) -> str:
115120
gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
116121
dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
117122
attn_code = self.attn_implementation
118-
attn_code += f"_{self.sdpa_backend}" if self.attn_implementation == "sdpa" else ""
119123
compile_str = f"compiled_{self.compile_mode}" if self.compile_mode is not None else "uncompiled"
120124
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
121125
continuous_batching_str = "cb" if self.continuous_batching else "generate"
@@ -125,7 +129,6 @@ def infer_name(self, compact: bool = True) -> str:
125129
gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
126130
dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
127131
attn_code = f"{self.attn_implementation} attention"
128-
attn_code += f" with {self.sdpa_backend} backend" if self.attn_implementation == "sdpa" else ""
129132
compile_str = "compiled" if self.compile_mode is not None else "not compiled"
130133
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
131134
continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
@@ -145,7 +148,6 @@ def to_dict(self) -> dict[str, Any]:
145148
"sequence_length": self.sequence_length,
146149
"num_tokens_to_generate": self.num_tokens_to_generate,
147150
"attn_implementation": self.attn_implementation,
148-
"sdpa_backend": self.sdpa_backend,
149151
"compile_mode": self.compile_mode,
150152
"compile_options": self.compile_options | {}, # to avoid inplace modification of the original dict
151153
"kernelize": self.kernelize,
@@ -162,7 +164,6 @@ def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "
162164
sequence_length=data.get("sequence_length", 128),
163165
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
164166
attn_implementation=data.get("attn_implementation", "eager"),
165-
sdpa_backend=data.get("sdpa_backend"),
166167
compile_mode=data.get("compile_mode"),
167168
compile_options=data.get("compile_options"),
168169
kernelize=data.get("kernelize", False),
@@ -213,7 +214,7 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
213214
configs = []
214215
# Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
215216
if level >= 3:
216-
for attn_implementation, sdpa_backend in BenchmarkConfig.all_attn_implementations:
217+
for attn_implementation in BenchmarkConfig.all_attn_implementations:
217218
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
218219
compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
219220
for cm in compile_modes:
@@ -222,7 +223,6 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
222223
configs.append(
223224
BenchmarkConfig(
224225
attn_implementation=attn_implementation,
225-
sdpa_backend=sdpa_backend,
226226
compile_mode=cm,
227227
kernelize=kernelize_on,
228228
continuous_batching=cb_on,
@@ -240,5 +240,5 @@ def get_config_by_level(level: int) -> list[BenchmarkConfig]:
240240
configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_mode="default"))
241241
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_mode="default", kernelize=True))
242242
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
243-
configs.append(BenchmarkConfig(attn_implementation="paged|sdpa", continuous_batching=True))
243+
configs.append(BenchmarkConfig(attn_implementation="sdpa", continuous_batching=True))
244244
return configs

0 commit comments

Comments
 (0)