From a899a948a70f8c6fe1a517eecd3b21a98b66cedf Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Fri, 8 Nov 2024 22:19:24 +0000 Subject: [PATCH] Docs update (e7e2dd1) --- docs/source/en/quantization/awq.md | 6 ++++++ docs/source/en/quantization/contribute.md | 4 +++- docs/source/ja/main_classes/quantization.md | 7 ++++--- docs/source/zh/main_classes/quantization.md | 5 +++-- 4 files changed, 16 insertions(+), 6 deletions(-) diff --git a/docs/source/en/quantization/awq.md b/docs/source/en/quantization/awq.md index ca26844edd02..9cc1f55f89e1 100644 --- a/docs/source/en/quantization/awq.md +++ b/docs/source/en/quantization/awq.md @@ -127,6 +127,7 @@ The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7
Fused module
+ | Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) | |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| | 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) | @@ -180,6 +181,7 @@ model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quant The parameter `modules_to_fuse` should include: + - `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list. - `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list. - `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers). @@ -231,7 +233,11 @@ Note this feature is supported on AMD GPUs. + +**Important:** The minimum required Python version for using `autoawq` is now 3.9. Ensure your environment meets this requirement to avoid compatibility issues. + + ## CPU support Recent versions of `autoawq` supports CPU with ipex op optimizations. To get started, first install the latest version of `autoawq` by running: diff --git a/docs/source/en/quantization/contribute.md b/docs/source/en/quantization/contribute.md index fb7ef6992223..5f0d044348de 100644 --- a/docs/source/en/quantization/contribute.md +++ b/docs/source/en/quantization/contribute.md @@ -32,7 +32,7 @@ Before integrating a new quantization method into Transformers, ensure the metho class Linear4bit(nn.Module): def __init__(self, ...): ... - + def forward(self, x): return my_4bit_kernel(x, self.weight, self.bias) ``` @@ -44,6 +44,7 @@ This way, Transformers models can be easily quantized by replacing some instance For some quantization methods, they may require "pre-quantizing" the models through data calibration (e.g., AWQ). In this case, we prefer to only support inference in Transformers and let the third-party library maintained by the ML community deal with the model quantization itself. +- Ensure that the environment meets the minimum Python version requirement of 3.9. ## Build a new HFQuantizer class 1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py) and make sure to expose the new quantization config inside Transformers main `init` by adding it to the [`_import_structure`](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) object of [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py). @@ -64,6 +65,7 @@ For some quantization methods, they may require "pre-quantizing" the models thro 6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights. + 7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization` and adding a new row in the table in `docs/source/en/quantization/overview.md`. 8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods. diff --git a/docs/source/ja/main_classes/quantization.md b/docs/source/ja/main_classes/quantization.md index a93d06b25745..3ac7beba401f 100644 --- a/docs/source/ja/main_classes/quantization.md +++ b/docs/source/ja/main_classes/quantization.md @@ -30,6 +30,8 @@ rendered properly in your Markdown viewer. 以下のコードを実行するには、以下の要件がインストールされている必要があります: +- Python 3.9 以上が必要です。 + - 最新の `AutoGPTQ` ライブラリをインストールする。 `pip install auto-gptq` をインストールする。 @@ -43,7 +45,6 @@ rendered properly in your Markdown viewer. `pip install --upgrade accelerate` を実行する。 GPTQ統合は今のところテキストモデルのみをサポートしているので、視覚、音声、マルチモーダルモデルでは予期せぬ挙動に遭遇するかもしれないことに注意してください。 - ### Load and quantize a model GPTQ は、量子化モデルを使用する前に重みのキャリブレーションを必要とする量子化方法です。トランスフォーマー モデルを最初から量子化する場合は、量子化モデルを作成するまでに時間がかかることがあります (`facebook/opt-350m`モデルの Google colab では約 5 分)。 @@ -193,7 +194,7 @@ model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4 torch.float32 ``` -### FP4 quantization +### FP4 quantization #### Requirements @@ -442,6 +443,6 @@ Hugging Face エコシステムのアダプターの公式サポートにより [[autodoc]] BitsAndBytesConfig -## Quantization with 🤗 `optimum` +## Quantization with 🤗 `optimum` `optimum`でサポートされている量子化方法の詳細については、[Optimum ドキュメント](https://huggingface.co/docs/optimum/index) を参照し、これらが自分のユースケースに適用できるかどうかを確認してください。 diff --git a/docs/source/zh/main_classes/quantization.md b/docs/source/zh/main_classes/quantization.md index d303906a9956..09855409c3a3 100644 --- a/docs/source/zh/main_classes/quantization.md +++ b/docs/source/zh/main_classes/quantization.md @@ -139,8 +139,9 @@ model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", att - 安装最新版本的`accelerate`库: `pip install --upgrade accelerate` -请注意,目前GPTQ集成仅支持文本模型,对于视觉、语音或多模态模型可能会遇到预期以外结果。 +- Python 版本要求至少为 3.9 +请注意,目前GPTQ集成仅支持文本模型,对于视觉、语音或多模态模型可能会遇到预期以外结果。 ### 加载和量化模型 GPTQ是一种在使用量化模型之前需要进行权重校准的量化方法。如果您想从头开始对transformers模型进行量化,生成量化模型可能需要一些时间(在Google Colab上对`facebook/opt-350m`模型量化约为5分钟)。 @@ -307,7 +308,7 @@ torch.float32 ``` -### FP4 量化 +### FP4 量化 #### 要求