Conversation
Summary of ChangesHello @yueming-yuan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the framework's capabilities for INT4 quantization. It introduces a more sophisticated quantization scheme that supports asymmetric weight representation and provides dedicated utilities for converting Hugging Face models to this new INT4 format. These changes aim to improve efficiency and flexibility when working with low-precision models, particularly in distributed environments, by refining how weights are processed and synchronized. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for INT4 asymmetric quantization, including new conversion scripts and updates to the core quantization logic. The changes are a valuable addition. My review focuses on improving robustness and maintainability. I've identified a few instances of hardcoded device names that should be dynamic, a significant code duplication issue in one of the new scripts that should be refactored, and a common bug in command-line argument parsing. Addressing these points will make the code more robust and easier to maintain.
| awq_linear.bias = linear.bias.clone().half() | ||
|
|
||
| pack_num = 32 // awq_linear.w_bit | ||
| device = torch.device(f"cuda:{torch.cuda.current_device()}") |
There was a problem hiding this comment.
The device is hardcoded to the current CUDA device. This can cause issues if the model's linear layer is on a different device (e.g., cuda:1 when the current device is cuda:0). It's better to use the device of the input linear layer's weight.
| device = torch.device(f"cuda:{torch.cuda.current_device()}") | |
| device = linear.weight.device |
| qw, s, zp = pack_layer(param, group_size, is_symmetric) | ||
| qweight_name = name.replace(".weight", ".weight_packed") | ||
| scale_name = name.replace(".weight", ".weight_scale") | ||
| weight_shape = torch.tensor(param.shape, dtype=torch.int32, device="cuda") |
There was a problem hiding this comment.
The device for weight_shape is hardcoded to 'cuda'. This will fail if the parameter is on a different device (e.g., CPU). It should use the device of the parameter itself.
| weight_shape = torch.tensor(param.shape, dtype=torch.int32, device="cuda") | |
| weight_shape = torch.tensor(param.shape, dtype=torch.int32, device=param.device) |
| def pack_to_int32( | ||
| value, | ||
| num_bits, | ||
| packed_dim=1, | ||
| sym=False, | ||
| ): | ||
| # if value.dtype is not torch.int8: | ||
| # raise ValueError("Tensor must be quantized to torch.int8 before packing") | ||
|
|
||
| if num_bits > 8: | ||
| raise ValueError("Packing is only supported for less than 8 bits") | ||
|
|
||
| if num_bits < 1: | ||
| raise ValueError(f"num_bits must be at least 1, got {num_bits}") | ||
|
|
||
| # Convert to unsigned range for packing, matching quantization offset | ||
| if sym: | ||
| offset = 1 << (num_bits - 1) | ||
| value = (value + offset).to(torch.uint8) | ||
| device = value.device | ||
|
|
||
| pack_factor = 32 // num_bits | ||
|
|
||
| if packed_dim == 0: | ||
| value = value.transpose(0, 1) | ||
|
|
||
| rows, cols = value.shape | ||
| padded_cols = math.ceil(cols / pack_factor) * pack_factor | ||
| pad_len = padded_cols - cols | ||
|
|
||
| if pad_len > 0: | ||
| value = torch.nn.functional.pad(value, (0, pad_len)) | ||
|
|
||
| num_groups = padded_cols // pack_factor | ||
|
|
||
| # Use int32 here | ||
| reshaped = value.view(rows, num_groups, pack_factor).to(torch.int32) | ||
| bit_shifts = torch.arange(pack_factor, device=device, dtype=torch.int32) * num_bits | ||
| packed = (reshaped << bit_shifts).sum(dim=2, dtype=torch.int32) | ||
|
|
||
| if packed_dim == 0: | ||
| packed = packed.transpose(0, 1) | ||
|
|
||
| return packed | ||
|
|
||
|
|
||
| def round_to_quantized_type_dtype( | ||
| tensor, | ||
| dtype, | ||
| cast_to_original_dtype=False, | ||
| ): | ||
| original_dtype = tensor.dtype | ||
| iinfo = torch.iinfo(dtype) | ||
| rounded = torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)).to(dtype) | ||
| if cast_to_original_dtype: | ||
| return rounded.to(original_dtype) | ||
| return rounded | ||
|
|
||
|
|
||
| @torch.no_grad() | ||
| def quantize( | ||
| x, | ||
| scale, | ||
| zero_point, | ||
| dtype=torch.int8, | ||
| ): | ||
| group_size = x.shape[-1] // scale.shape[-1] | ||
| output_dtype = dtype | ||
| output = torch.zeros_like(x).to(output_dtype) | ||
|
|
||
| reshaped_dims = ( | ||
| math.ceil(x.shape[-1] / group_size), | ||
| group_size, | ||
| ) | ||
| x = x.unflatten(-1, reshaped_dims) | ||
|
|
||
| scaled = x / scale.unsqueeze(-1) | ||
|
|
||
| if zero_point is not None: | ||
| zero_point = zero_point.unsqueeze(-1) | ||
| scaled += zero_point.to(x.dtype) | ||
|
|
||
| # clamp and round | ||
| output = round_to_quantized_type_dtype(tensor=scaled, dtype=dtype) | ||
|
|
||
| output = output.flatten(start_dim=-2) | ||
| output = output.to(output_dtype) | ||
|
|
||
| return output | ||
|
|
||
|
|
||
| def pack_layer(weight, group_size, sym=True): | ||
| w, scale, zp = fake_int4_quant_cuda.fake_int4_quant_cuda(weight, (1, group_size), sym) | ||
| w = w.view(weight.shape[0], 1, weight.shape[1] // group_size, group_size) | ||
| scale = scale.view(weight.shape[0], 1, weight.shape[1] // group_size, 1) | ||
| zp = zp.view(weight.shape[0], 1, weight.shape[1] // group_size, 1) | ||
| if sym: | ||
| w = w * scale | ||
| else: | ||
| w = (w - zp) * scale | ||
| w = w.view(weight.shape) | ||
| scale = scale.view(weight.shape[0], -1).contiguous() | ||
| if not sym: | ||
| zp = zp.view(weight.shape[0], -1) | ||
| zeros = zp.t().contiguous().to(torch.float32) | ||
| zeros = zeros.to(dtype=torch.int32, device=w.device) | ||
| zeros = zeros.reshape(-1, zeros.shape[1] // 8, 8) | ||
| new_order_map = torch.tensor([0, 4, 1, 5, 2, 6, 3, 7], device=zeros.device) * 4 | ||
| zeros = zeros << new_order_map | ||
| packed_zp = torch.sum(zeros, dim=-1).to(torch.int32) | ||
| else: | ||
| zp = None | ||
| packed_zp = None | ||
|
|
||
| quantized_weight = quantize( | ||
| x=w, | ||
| scale=scale, | ||
| zero_point=zp, | ||
| dtype=torch.int8 if sym else torch.uint8, | ||
| ) | ||
| packed_weight = pack_to_int32(quantized_weight, 4, sym=sym) | ||
| return packed_weight, scale, packed_zp |
There was a problem hiding this comment.
There is significant code duplication here. The functions pack_to_int32, round_to_quantized_type_dtype, quantize, and pack_layer are copied from miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_compressed_tensors.py. This will make maintenance difficult. This utility code should be extracted into a shared module and imported in both places to avoid duplication.
| qw, s, zp = pack_layer(weight, group_size, is_symmetric) | ||
| qweight_name = name.replace(".weight", ".weight_packed") | ||
| scale_name = name.replace(".weight", ".weight_scale") | ||
| weight_shape = torch.tensor(weight.shape, dtype=torch.int32, device="cuda") |
There was a problem hiding this comment.
The device for weight_shape is hardcoded to 'cuda'. This will fail if the weight tensor is on a different device (e.g., CPU). It should use the device of the weight tensor itself. This issue would be resolved by de-duplicating the quantization logic as mentioned in another comment.
| weight_shape = torch.tensor(weight.shape, dtype=torch.int32, device="cuda") | |
| weight_shape = torch.tensor(weight.shape, dtype=torch.int32, device=weight.device) |
| parser.add_argument("--model-dir", type=str, required=True, help="local BF16 path") | ||
| parser.add_argument("--save-dir", type=str, required=True) | ||
| parser.add_argument("--group-size", type=int, default=32, help="Group Size") | ||
| parser.add_argument("--is-symmetric", type=bool, default=True, help="Is Symmetric") |
There was a problem hiding this comment.
Using type=bool with argparse does not work as expected for command-line flags. For example, --is-symmetric False will be interpreted as True because bool('False') is True. The recommended way to handle boolean flags is with action='store_true' and action='store_false', or action=argparse.BooleanOptionalAction for Python 3.9+.
| parser.add_argument("--is-symmetric", type=bool, default=True, help="Is Symmetric") | |
| parser.add_argument("--is-symmetric", action=argparse.BooleanOptionalAction, default=True, help="Is Symmetric") |
|
|
||
| data_list = [] | ||
| for _ in range(num_samples): | ||
| i = random.randint(0, encoded.shape[0] - seq_len - 1) |
There was a problem hiding this comment.
asymmetric quantization
convert_hf_to_int4.pyandconvert_hf_to_int4_direct.pyutilsTODO
sglang-milesbranch under sglangsglang-milesbranch