chatglm.cpp的一个主要特点就是以量化的形式对大模型进行优化,使其在CPU上能够进行高效推理。
本文主要查看chatglm.cpp是如何对模型进行量化的
chatglm.cpp在使用时主要分成两步:
- 使用
convert.py
将模型进行量化,得到ggml格式 - 使用
./build/bin/main
进行模型调用
convert.py
截至目前(commit: 7da55260 ,231108),chatglm.cpp已经支持多个llm,这里以chatglm为例。
class BaseConverter:
@classmethod
def convert(cls, f, model, tokenizer, ggml_type):
f.write(b"ggml") # magic
f.write(struct.pack("ii", cls.MODEL_TYPE.value, 1)) # model type & version
cls.dump_config(f, model.config, ggml_type)
cls.dump_tokenizer(f, tokenizer)
cls.dump_model(f, model, ggml_type)
进行convert时有三个大步骤,分别处理config/tokenizer/model,将结果写到同一个文件描述符f中。
这里仅关注dump_model
查看ChatGLMConverter的dump_model
@staticmethod
def dump_model(f, model, ggml_type):
assert torch.allclose(
model.state_dict()["transformer.word_embeddings.weight"], model.state_dict()["lm_head.weight"]
), "unimplemented: lm_head weight must be tied to input embedding"
weight_names = ["transformer.word_embeddings.weight"]
for i in range(model.config.num_layers):
weight_names += [
f"transformer.layers.{i}.input_layernorm.weight",
f"transformer.layers.{i}.input_layernorm.bias",
f"transformer.layers.{i}.attention.query_key_value.weight",
f"transformer.layers.{i}.attention.query_key_value.bias",
f"transformer.layers.{i}.attention.dense.weight",
f"transformer.layers.{i}.attention.dense.bias",
f"transformer.layers.{i}.post_attention_layernorm.weight",
f"transformer.layers.{i}.post_attention_layernorm.bias",
f"transformer.layers.{i}.mlp.dense_h_to_4h.weight",
f"transformer.layers.{i}.mlp.dense_h_to_4h.bias",
f"transformer.layers.{i}.mlp.dense_4h_to_h.weight",
f"transformer.layers.{i}.mlp.dense_4h_to_h.bias",
]
weight_names += [
"transformer.final_layernorm.weight",
"transformer.final_layernorm.bias",
]
dump_state_dict(f, weight_names, model.state_dict(), model.config.quantization_bit, ggml_type)
输入:
- f:文件描述符
- model:加载的chatglm模型
- ggml_type:ggml中的数据类型,参考ggml_type,其中包含Q的type为量化的type
根据chatglm模型结构,整理了所有权重名,准备进行dump_state_dict
def dump_state_dict(f, weight_names, state_dict, quantization_bit, ggml_type):
tensor_info = []
for name in tqdm(weight_names, desc="Processing model states"):
tensor = state_dict[name]
if tensor.ndim == 2:
# 2d weight: should quantize it if needed
# step 1: de-quantize it back to float32
if tensor.dtype == torch.int8:
assert quantization_bit in [4, 8]
scale = state_dict[f"{name}_scale"].float() # channel-wise scale
if quantization_bit == 4:
# convert int4 weight to int8
low_bits = ((tensor << 4) & 0xF0) >> 4
high_bits = (tensor & 0xF0) >> 4
tensor = torch.stack((high_bits, low_bits), dim=-1).view(tensor.shape[0], -1)
tensor = tensor * scale[:, None]
else:
tensor = tensor.float()
# step 2: quantize it into ggml format
tensor_ggml_type = ggml_type
else:
# 1d weight: convert it to float32
assert tensor.ndim == 1
tensor = tensor.float()
tensor_ggml_type = GGMLType.F32
dump_tensor(f, name, tensor, tensor_ggml_type)
tensor_info.append((name, tensor.shape, tensor_ggml_type.name))
print(tabulate(tensor_info, headers=["name", "shape", "dtype"], tablefmt="psql"))
看到在作者标注的step2中,只有二维的tensor有需要标注特定ggml_type,根据设置需要进行量化,
所有权重进行dump_tensor
def dump_tensor(f, name: str, tensor: torch.Tensor, ggml_type: GGMLType):
assert tensor.dtype == torch.float32
# tensor name
f.write(struct.pack("i", len(name.encode())))
f.write(name.encode())
# tensor shape & dtype
f.write(struct.pack("i" * (2 + tensor.ndim), tensor.ndim, *tensor.shape, ggml_type.value))
# tensor data
if ggml_type == GGMLType.F32:
tensor = tensor.float()
elif ggml_type == GGMLType.F16:
tensor = tensor.half()
elif ggml_type == GGMLType.Q8_0:
tensor = quantize_q8_0(tensor)
elif ggml_type == GGMLType.Q4_0:
tensor = quantize_q4_0(tensor)
elif ggml_type == GGMLType.Q4_1:
tensor = quantize_q4_1(tensor)
elif ggml_type == GGMLType.Q5_0:
tensor = quantize_q5_0(tensor)
elif ggml_type == GGMLType.Q5_1:
tensor = quantize_q5_1(tensor)
else:
raise NotImplementedError(f"Cannot dump tensor of dtype {tensor.dtype}")
# align address
aligned_pos = (f.tell() + (GGML_MEM_ALIGN - 1)) // GGML_MEM_ALIGN * GGML_MEM_ALIGN
f.seek(aligned_pos)
tensor.numpy().tofile(f)
首先将权重的名称、维度等基本信息写入文件,
之后根据不同type,调用不同的量化方法,得到不同的量化张量,再写入文件。
以ggml_type == GGMLType.Q4_0
为例
def quantize_q4_0(tensor: torch.Tensor) -> torch.CharTensor:
# equivalent to ggml_quantize_q4_0 in ggml.c
assert tensor.shape[1] % GGML_QK4_0 == 0 # 确保权重元素个数能被32整除
tensor = tensor.view(-1, GGML_QK4_0) # 以32分组
abs_max_indices = tensor.abs().max(dim=-1, keepdim=True).indices # 每组绝对值最大的元素的位置
max_values = torch.take_along_dim(tensor, abs_max_indices, dim=-1) # 每组绝对值最大的元素的值
scale = max_values / -8 # 构建scale
tensor = (tensor / scale + 8).round().clamp(min=0, max=15).char() # 取近似值进行量化
# compress two int4 weights into an int8
tensor = tensor[:, :16] | (tensor[:, 16:] << 4) # 用int8装下两个int4
# add scale into each block
tensor = torch.cat((scale.half().view(torch.int8), tensor), dim=-1) # 拼接以适配ggml格式
return tensor
与ggml中的quantize_row_q4_0_reference方法相同,也可以看我的另一篇博客ggml的量化处理。
直接看上面代码即可。
简单地说进行就是进行以32个值为一组进行量化计算,每组中,以绝对值最大值构建scale,进行量化,得到int4,并将int4拆成两组进行合并压缩成int8,至此将scale和这个int8进行concat,从而写入文件f。
至此,模型的信息和数据已经保存。
chatglm.cpp
chatlm.cpp构建模型,在模型推理计算的时候不用考虑读取的tensor是否是量化的,仅需注意标注好tensor的type,量化后的张量之间的运算也全权交付于ggml进行处理。
构建模型首先需要准备ModelConfig
,convert.py中会保存原始模型的相关信息,所以从转换得到的数据中进行提取,得到了这个量化模型存档的权重类型。
模型的基类BaseModelForCausalLM
会构造一个ModelContext
,ctx的dtype来自于config的dtype,为ggml提供type标注。
进而,在构建ChatGLM的每层网络组件时,例如Linear,构造函数将传入这个ctx,会在构建weight和bias时指出其ggml_type。
模型每个权重的tensor会有对应的ggml_type标注。ggml在进行张量运算时,会根据ggml_type进行相应处理。
延伸阅读,ggml的量化处理