-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add KV Cache Memory Estimator Example Script #29736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ksenthilnathan02
wants to merge
2
commits into
vllm-project:main
Choose a base branch
from
ksenthilnathan02:feature/kv-cache-memory-estimator
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+119
−0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| """ | ||
| Motivation: | ||
| A frequent question from vLLM users is how to estimate the memory required for | ||
| the attention key/value (KV) cache when scaling up context length, batch size, | ||
| or model size. While the underlying formulas are simple, there was no clear, | ||
| standalone example in the repository that demonstrates how to compute an | ||
| approximate KV memory footprint directly from a model’s configuration. | ||
|
|
||
| What this example provides: | ||
| This script extracts the relevant architectural attributes (number of layers and | ||
| hidden size) from a Hugging Face model configuration and applies a simple KV | ||
| sizing rule to estimate memory usage for a given seq_len, batch_size, and dtype. | ||
| The goal is to give users a back-of-the-envelope understanding of how KV cache | ||
| memory scales — without requiring them to run inference or inspect GPU memory. | ||
|
|
||
| Why this is helpful: | ||
| - Helps plan for long-context inference workloads | ||
| - Allows users to reason about memory tradeoffs before running vLLM | ||
| - Clarifies how KV memory scales with model architecture | ||
| - Useful for educational purposes when learning about LLM inference internals | ||
|
|
||
| This estimator intentionally abstracts away fragmentation, paged layout | ||
| overhead, and other runtime details. It is meant as a planning aid, not a | ||
| precise profiler. | ||
| """ | ||
|
|
||
| import argparse | ||
| from dataclasses import dataclass | ||
|
|
||
| try: | ||
| from transformers import AutoConfig | ||
| except ImportError as e: | ||
| raise SystemExit( | ||
| "This example requires `transformers`. Install it with:\n" | ||
| " pip install transformers\n" | ||
| ) from e | ||
|
|
||
|
|
||
| DTYPE_BYTES = { | ||
| "fp16": 2, | ||
| "bf16": 2, | ||
| "fp32": 4, | ||
| "int8": 1, | ||
| } | ||
|
|
||
|
|
||
| @dataclass | ||
| class KVEstimate: | ||
| model_name: str | ||
| num_layers: int | ||
| hidden_size: int | ||
| seq_len: int | ||
| batch_size: int | ||
| dtype: str | ||
|
|
||
| def total_elements(self) -> int: | ||
| # KV per token per layer = 2 * hidden_size | ||
| return self.batch_size * self.seq_len * self.num_layers * (2 * self.hidden_size) | ||
|
|
||
| def total_bytes(self) -> int: | ||
| return self.total_elements() * DTYPE_BYTES[self.dtype] | ||
|
|
||
| def total_gb(self) -> float: | ||
| return self.total_bytes() / (1024 ** 3) | ||
|
|
||
| def pretty(self) -> str: | ||
| return ( | ||
| f"Model: {self.model_name}\n" | ||
| f"Layers: {self.num_layers}\n" | ||
| f"Hidden size: {self.hidden_size}\n" | ||
| f"Batch size: {self.batch_size}\n" | ||
| f"Seq length: {self.seq_len}\n" | ||
| f"Dtype: {self.dtype}\n" | ||
| f"-------------------------------\n" | ||
| f"Approx KV cache memory: {self.total_gb():.2f} GB\n" | ||
| ) | ||
|
|
||
|
|
||
| def load_model_config(model_name: str): | ||
| cfg = AutoConfig.from_pretrained(model_name) | ||
|
|
||
| num_layers = getattr(cfg, "num_hidden_layers", getattr(cfg, "n_layer", None)) | ||
| hidden_size = getattr(cfg, "hidden_size", getattr(cfg, "n_embd", None)) | ||
|
|
||
| if num_layers is None or hidden_size is None: | ||
| raise ValueError( | ||
| f"Could not extract num_layers/hidden_size from config for {model_name}." | ||
| ) | ||
|
|
||
| return num_layers, hidden_size | ||
|
|
||
|
|
||
| def parse_args(): | ||
| parser = argparse.ArgumentParser(description="Estimate KV cache memory usage.") | ||
| parser.add_argument("--model", type=str, required=True) | ||
| parser.add_argument("--seq-len", type=int, required=True) | ||
| parser.add_argument("--batch-size", type=int, default=1) | ||
| parser.add_argument("--dtype", type=str, default="fp16", choices=DTYPE_BYTES.keys()) | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def main(): | ||
| args = parse_args() | ||
| num_layers, hidden_size = load_model_config(args.model) | ||
|
|
||
| est = KVEstimate( | ||
| model_name=args.model, | ||
| num_layers=num_layers, | ||
| hidden_size=hidden_size, | ||
| seq_len=args.seq_len, | ||
| batch_size=args.batch_size, | ||
| dtype=args.dtype, | ||
| ) | ||
|
|
||
| print(est.pretty()) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current formula for
total_elementsoverestimates the KV cache size for models that use Grouped-Query Attention (GQA) or Multi-Query Attention (MQA). The formula2 * hidden_sizeis equivalent to2 * num_attention_heads * head_size, but the KV cache size depends onnum_key_value_heads.The correct formula for the number of elements per token per layer is
2 * num_key_value_heads * head_size.For models like Llama-3-8B, where
num_attention_heads=32andnum_key_value_heads=8, this leads to a 4x overestimation of the KV cache memory.To fix this, the script should be updated to extract
num_key_value_headsandhead_sizefrom the model config and use them in the calculation. Here is a suggested refactoring of theKVEstimateclass,load_model_configfunction, andmainfunction to implement this correction.