You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* [SW-184941] INC CI, CD and Promotion
Change-Id: I60c420f9776e1bdab7bb9e02e5bcbdb6891bfe52
* [SW-183320]updated setup.py
Change-Id: I592af89486cb1d9e0b5197521c428920197a9103
* [SW-177474] add HQT FP8 porting code
Change-Id: I4676f13a5ed43c444f2ec68675cc41335e7234dd
Signed-off-by: Zhou Yuwen <zyuwen@habana.ai>
* [SW-189361] Fix white list extend
Change-Id: Ic2021c248798fce37710d28014a6d59259c868a3
* [SW-191317] Raise exception according to hqt config object
Change-Id: I06ba8fa912c811c88912987c11e5c12ef328348a
* [SW-184714] Port HQT code into INC
HQT lib content was copied as is under fp8_quant
Tests were copied to 3.x torch location
Change-Id: Iec6e1fa7ac4bf1df1c95b429524c40e32bc13ac9
* [SW-184714] Add internal folder to fp8 quant
This is a folder used for experiments,
not to be used by users
Change-Id: I9e221ae582794e304e95392c0f37638f7bce69bc
* [SW-177468] Removed unused code + cleanup
Change-Id: I4d27c067e87c1a30eb1da9df16a16c46d092c638
* Fix errors in regression_detection
Change-Id: Iee5318bd5593ba349812516eb5641958ece3c438
* [SW-187731] Save orig module as member of patched module
This allows direct usage of the original module methods,
which solves torch compile issue
Change-Id: I464d8bd1bacdfc3cd1f128a67114e1e43f092632
* [SW-190899] Install packages according to configuration
Change-Id: I570b490658f5d2c5399ba1db93f8f52f56449525
* [SW-184689] use finalize_calibration intrenaly for one step flow
Change-Id: Ie0b8b426c951cf57ed7e6e678c86813fb2d05c89
* [SW-191945] align requirement_pt.txt in gerrit INC with Github INC
Change-Id: If5c0dbf21bf989af37a8e29246e4f8760cd215ef
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [SW-192358] Remove HQT reference in INC
Change-Id: Ic25f9323486596fa2dc6d909cd568a37ab84dd5e
* [SW-191415] update fp8 maxAbs observer using torch.copy_
Change-Id: I3923c832f9a8a2b14e392f3f4719d233a457702f
* [SW-184943] Enhance INC WOQ model loading
- Support loading huggingface WOQ model
- Abstract WeightOnlyLinear base class. Add INCWeightOnlyLinear and HPUWeighOnlyLinear subclasses
- Load woq linear weight module by module
- Save hpu format tensor to reuse it once load it again
Change-Id: I679a42759b49e1f45f52bbb0bdae8580a23d0bcf
* [SW-190303] Implement HPUWeightOnlyLinear class in INC
Change-Id: Ie05c8787e708e2c3559dce24ef0758d6c498ac41
* [SW-192809] fix json_file bug when instantiating FP8Config class
Change-Id: I4a715d0a706efe20ccdb49033755cabbc729ccdc
Signed-off-by: Zhou Yuwen <zyuwen@habana.ai>
* [SW-192931] align setup.py with github INC and remove fp8_convert
Change-Id: Ibbc157646cfcfad64b323ecfd96b9bbda5ba9e2f
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [SW-192917] Update all HQT logic files with pre-commit check
Change-Id: I119dc8578cb10932fd1a8a674a8bdbf61f978e42
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* update docstring
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
* add fp8 example and document (#1639)
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* Update settings to be compatible with gerrit
* enhance ut
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
* move fp8 sample to helloworld folder
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
* update torch version of habana docker
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update readme demo
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* update WeightOnlyLinear to INCWeightOnlyLinear
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add docstring for FP8Config
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* fix pylint
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* update fp8 test scripts
Signed-off-by: chensuyue <suyue.chen@intel.com>
* delete deps
Signed-off-by: chensuyue <suyue.chen@intel.com>
* update container into v1.17.0
Signed-off-by: chensuyue <suyue.chen@intel.com>
* update docker version
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* update pt ut
Signed-off-by: chensuyue <suyue.chen@intel.com>
* add lib path
Signed-off-by: chensuyue <suyue.chen@intel.com>
* fix dir issue
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update fp8 test scope
Signed-off-by: chensuyue <suyue.chen@intel.com>
* fix typo
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* update fp8 test scope
Signed-off-by: chensuyue <suyue.chen@intel.com>
* update pre-commit-ci
Signed-off-by: chensuyue <suyue.chen@intel.com>
* work around for hpu
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* fix UT
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* fix parameter
Signed-off-by: chensuyue <suyue.chen@intel.com>
* omit some test
Signed-off-by: chensuyue <suyue.chen@intel.com>
* update main page example to llm loading
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix autotune
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
---------
Signed-off-by: Zhou Yuwen <zyuwen@habana.ai>
Signed-off-by: xinhe3 <xinhe3@hababa.ai>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: chensuyue <suyue.chen@intel.com>
Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai>
Co-authored-by: Ron Ben Moshe <rbenmoshe@habana.ai>
Co-authored-by: Uri Livne <ulivne@habana.ai>
Co-authored-by: Danny Semiat <dsemiat@habana.ai>
Co-authored-by: smarkovichgolan <smarkovich@habana.ai>
Co-authored-by: Dudi Lester <dlester@habana.ai>
After successfully installing these packages, try your first quantization program.
73
73
74
-
### Weight-Only Quantization (LLMs)
75
-
Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gaudi2 AI Accelerator, Nvidia GPU, best device will be selected automatically.
Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator.
76
76
77
77
To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.
99
+
### Weight-Only Large Language Model Loading (LLMs)
118
100
119
-
### Static Quantization (Non-LLMs)
101
+
Following example code demonstrates weight-only large language model loading on Intel Gaudi2 AI Accelerator.
120
102
121
103
```python
122
-
from torchvision import models
104
+
from neural_compressor.torch.quantization import load
105
+
106
+
model_name ="TheBloke/Llama-2-7B-GPTQ"
107
+
model = load(
108
+
model_name_or_path=model_name,
109
+
format="huggingface",
110
+
device="hpu",
111
+
torch_dtype=torch.bfloat16,
112
+
)
113
+
```
123
114
124
-
from neural_compressor.config import PostTrainingQuantConfig
125
-
from neural_compressor.data import DataLoader, Datasets
Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
3.[Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
7
+
4.[Examples](#examples)
8
+
9
+
## Introduction
10
+
11
+
Float point 8 (FP8) is a promising data type for low precision quantization which provides a data distribution that is completely different from INT8 and it's shown as below.
12
+
13
+
<divalign="center">
14
+
<img src="./imgs/fp8_dtype.png" height="250"/>
15
+
</div>
16
+
17
+
Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).
18
+
19
+
Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
<td class="tg-0pky">The observer to measure the statistics.</td>
52
+
<td class="tg-0pky">maxabs (default), saves all tensors to files.</td>
53
+
</tr>
54
+
<tr>
55
+
<td class="tg-0pky">allowlist</td>
56
+
<td class="tg-0pky">List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules. Not setting the list at all is not recommended as it will set the allowlist to these modules only: torch.nn.Linear, torch.nn.Conv2d, and BMM.</td>
<td class="tg-0pky">The mode, measure or quantize, to run HQT with.</td>
67
+
<td class="tg-0pky">MEASURE - Measure statistics of all modules and emit the results to dump_stats_path.<br>QUANTIZE - Quantize and run the model according to the provided measurements.<br>AUTO (default) - Select from [MEASURE, QUANTIZE] automatically.</td>
68
+
</tr>
69
+
<tr>
70
+
<td class="tg-0pky">dump_stats_path</td>
71
+
<td class="tg-0pky">The path to save and load the measurements. The path is created up until the level before last "/". The string after the last / will be used as prefix to all the measurement files that will be created.</td>
<td class="tg-0pky">The method for calculating the scale from the measurement.</td>
77
+
<td class="tg-0pky">- without_scale - Convert to/from FP8 without scaling.<br>- unit_scale - Always use scale of 1.<br>- maxabs_hw (default) - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then aligned to the corresponding HW accelerated scale.<br>- maxabs_pow2 - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2.<br>- maxabs_hw_opt_weight - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as maxabs_hw.<br>- act_maxabs_pow2_weights_pcs_opt_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_hw_opt_weight. Scale of activations is calculated the same as maxabs_pow2.<br>- act_maxabs_hw_weights_pcs_maxabs_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_pow2. Scale of activations is calculated the same as maxabs_hw.</td>
78
+
</tr>
79
+
<tr>
80
+
<td class="tg-0pky">measure_exclude</td>
81
+
<td class="tg-0pky">If this attribute is not defined, the default is OUTPUT. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process.</td>
82
+
<td class="tg-0pky">NONE - All tensors are measured.<br>OUTPUT (default) - Excludes measurement of output tensors.</td>
83
+
</tr>
84
+
</tbody></table>
85
+
86
+
## Get Start with FP8 Quantization
87
+
88
+
### Demo Usage
89
+
90
+
```python
91
+
from neural_compressor.torch.quantization import (
| Large Language Model (LLM) |[Link](https://github.com/HabanaAI/optimum-habana-fork/tree/habana-main/examples/text-generation#running-with-fp8)|
112
+
113
+
> Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally.
0 commit comments