GLM-4v-9B-源码解析-四-

GLM-4v-9B 源码解析(四)

GLM-4-9B Chat dialogue model fine-tuning

In this demo, you will experience how to fine-tune the GLM-4-9B-Chat open source model (visual understanding model is
not supported). Please strictly follow the steps in the document to avoid unnecessary errors.

Hardware check

The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
environment. The fine-tuned resource usage is set according to the configuration file in the
configs folder

Test hardware information:

  • OS: Ubuntu 22.04
  • Memory: 512GB
  • Python: Python: 3.10.12 / 3.12.3 (Currently, you need to install nltk from the git source code if you use Python
    3.12.3)
  • CUDA Version: 12.3
  • GPU Driver: 535.104.05
  • GPU: NVIDIA A100-SXM4-80GB * 8
Fine-tuning Model Fine-tuning solution GPU memory usage Weight save point size
GLM-4-9B-Chat lora (PEFT) 22G 17M
GLM-4-9B-Chat p-tuning v2 (PEFT) 21G 121M
GLM-4-9B-Chat SFT (Zero3 method) 80G (Each GPU, Need 8 GPUs) 20G
GLM-4V-9B lora (PEFT), Include EVA2CLIPModel 75G 37M
GLM-4V-9B SFT Not Support in this Code 28G

GLM-4V-9B fine-tuning cannot work properly with deepspeed, the official fine-tuning script only does the most basic
fine-tuning solution, more optimizations require developers to explore on their own

Before starting fine-tuning, please install the dependencies in basic_demo and clone the latest model repos (Hugging
Face) first. You also need to install the dependencies in this directory:

pip install -r requirements.txt

NOTE: Some codes in NLTK 3.8.1 might not yet be compatible with Python 3.12. For adaptation methods in such cases,
please refer to issues #38.

Multi-round dialogue format

The multi-round dialogue fine-tuning example uses the GLM-4 dialogue format convention, adding different loss_mask to
different roles to calculate loss for multiple rounds of replies in one calculation.

For data files, the sample uses the following format:

[
  {
    "messages": [
      {
        "role": "system",
        "content": "<system prompt text>",
        "tools": [
          {
            "name": "<tool name>",
            "args": {
              "<arg name>": "<arg value>"
            }
          }
          // Add more tools if needed
        ]
      },
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      },
      // If Tool Using
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      },
      {
        "role": "observation",
        "content": "<observation prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response observation>"
      },
      // Multi_turns
      {
        "role": "user",
        "content": "<user prompt text>"
      },
      {
        "role": "assistant",
        "content": "<assistant response text>"
      }
    ]
  }
]

This is a sample without tools:

{
  "messages": [
    {
      "role": "user",
      "content": "类型#裤*材质#牛仔布*风格#性感"
    },
    {
      "role": "assistant",
      "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质,其柔然的手感和细腻的质地,在穿着舒适的同时,透露着清纯甜美的个性气质。除此之外,流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致,不失为一款随性出街的必备单品。"
    }
  ]
}

This is a sample with tools:

{
  "messages": [
    {
      "role": "system",
      "content": "",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_recommended_books",
            "description": "Get recommended books based on user's interests",
            "parameters": {
              "type": "object",
              "properties": {
                "interests": {
                  "type": "array",
                  "items": {
                    "type": "string"
                  },
                  "description": "The interests to recommend books for"
                }
              },
              "required": [
                "interests"
              ]
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
    },
    {
      "role": "assistant",
      "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
    },
    {
      "role": "observation",
      "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
    },
    {
      "role": "assistant",
      "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
    }
  ]
}

This is a sample with VQA Task:

{
  "messages": [
    {
      "role": "user",
      "content": "图片中的动物是什么?",
      "image": "/root/images/0001.jpg"
    },
    {
      "role": "assistant",
      "content": "图片中有一只猫。"
    },
    {
      "role": "user",
      "content": "图片中的猫在做什么?"
    },
    {
      "role": "assistant",
      "content": "这只猫坐在或站在桌子上,桌上有很多食物。"
    }
  ]
}
  • The system role is optional, but if it exists, it must appear before the user role, and the system role can only
    appear once in a complete conversation (whether it is a single round or a multi-round conversation).
  • The tools field is optional, but if it exists, it must appear after the system role, and the tools field can
    only appear once in a complete conversation (whether it is a single round or a multi-round conversation). When
    the tools field exists, the system role must exist and the content field is empty.
  • GLM-4V-9B does not support the tools field and the system field. And image must be placed in the first
    message. The image field needs to contain the absolute path of the image.

Configuration file

The fine-tuning configuration file is located in the config directory, including the following files:

  1. ds_zereo_2 / ds_zereo_3.json: deepspeed configuration file.

  2. `lora.yaml / ptuning_v2

  3. .yaml / sft.yaml`: Configuration files for different modes of models, including model parameters, optimizer
    parameters, training parameters, etc. Some important parameters are explained as follows: + data_config section

  • train_file: File path of training dataset.
  • val_file: File path of validation dataset.
  • test_file: File path of test dataset.
  • num_proc: Number of processes to use when loading data.
  • max_input_length: Maximum length of input sequence.
  • max_output_length: Maximum length of output sequence.
  • training_args section
  • output_dir: Directory for saving model and other outputs.
  • max_steps: Maximum number of training steps.
  • per_device_train_batch_size: Training batch size per device (such as GPU).
  • dataloader_num_workers: Number of worker threads to use when loading data.
  • remove_unused_columns: Whether to remove unused columns in data.
  • save_strategy: Model saving strategy (for example, how many steps to save).
  • save_steps: How many steps to save the model.
  • log_level: Log level (such as info).
  • logging_strategy: logging strategy.
  • logging_steps: how many steps to log at.
  • per_device_eval_batch_size: per-device evaluation batch size.
  • evaluation_strategy: evaluation strategy (e.g. how many steps to evaluate at).
  • eval_steps: how many steps to evaluate at.
  • predict_with_generate: whether to use generation mode for prediction.
  • generation_config section
  • max_new_tokens: maximum number of new tokens to generate.
  • peft_config section
  • peft_type: type of parameter tuning to use (supports LORA and PREFIX_TUNING).
  • task_type: task type, here is causal language model (don't change).
  • Lora parameters:
  • r: rank of LoRA.
  • lora_alpha: scaling factor of LoRA.
  • lora_dropout: dropout probability to use in LoRA layer.
  • P-TuningV2 parameters: + num_virtual_tokens: the number of virtual tokens.
  • num_attention_heads: 2: the number of attention heads of P-TuningV2 (do not change).
  • token_dim: 256: the token dimension of P-TuningV2 (do not change).

Start fine-tuning

Execute single machine multi-card/multi-machine multi-card run through the following code, which uses deepspeed as
the acceleration solution, and you need to install deepspeed.

OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml # For Chat Fine-tune
OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b  configs/lora.yaml  # For VQA Fine-tune

Execute single machine single card run through the following code.

python finetune.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml # For Chat Fine-tune
python finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune

Fine-tune from a saved point

If you train as described above, each fine-tuning will start from the beginning. If you want to fine-tune from a
half-trained model, you can add a fourth parameter, which can be passed in two ways:

  1. yes, automatically start training from the last saved Checkpoint

  2. XX, breakpoint number, for example 600, start training from Checkpoint 600

For example, this is an example code to continue fine-tuning from the last saved point

python finetune.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes

Use the fine-tuned model

Verify the fine-tuned model in inference.py

You can Use our fine-tuned model in finetune_demo/inference.py, and you can easily test it with just one line of code.

python inference.py your_finetune_path

In this way, the answer you get is the fine-tuned answer.

Use the fine-tuned model in other demos in this repository or external repositories

You can use our LORA and fully fine-tuned models in any demo. This requires you to modify the code yourself according
to the following tutorial.

  1. Replace the way to read the model in the demo with the way to read the model in finetune_demo/inference.py.

Please note that for LORA and P-TuningV2, we did not merge the trained models, but recorded the fine-tuned path
in adapter_config.json
If the location of your original model changes, you should modify the path of base_model_name_or_path
in adapter_config.json.

def load_model_and_tokenizer(
        model_dir: Union[str, Path], trust_remote_code: bool = True
) -> tuple[ModelType, TokenizerType]:
    model_dir = _resolve_path(model_dir)


if (model_dir / 'adapter_config.json').exists():
    model = AutoPeftModelForCausalLM.from_pretrained(
        model_dir, trust_remote_code=trust_remote_code, device_map='auto'
    )
tokenizer_dir = model.peft_config['default'].base_model_name_or_path
else:
model = AutoModelForCausalLM.from_pretrained(
    model_dir, trust_remote_code=trust_remote_code, device_map='auto'
)
tokenizer_dir = model_dir
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_dir, trust_remote_code=trust_remote_code
)
return model, tokenizer
  1. Read the fine-tuned model. Please note that you should use the location of the fine-tuned model. For example, if your
    model location is /path/to/finetune_adapter_model
    and the original model address is path/to/base_model, you should use /path/to/finetune_adapter_model
    as model_dir.
  2. After completing the above operations, you can use the fine-tuned model normally. Other calling methods remain
    unchanged.
  3. This fine-tuning script has not been tested on long texts of 128K or 1M tokens. Fine-tuning long texts requires GPU
    devices with larger memory and more efficient fine-tuning solutions, which developers need to handle on their own.

Reference


@inproceedings{liu2022p,
title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers)},
pages={61--68},
year={2022}
}

@misc{tang2023toolalpaca,
title={ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases},
author={Qiaoyu Tang and Ziliang Deng and Hongyu Lin and Xianpei Han and Qiao Liang and Le Sun},
year={2023},
eprint={2306.05301},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

.\chatglm4-finetune\intel_device_demo\itrex\itrex_cli_demo.py

"""
该脚本创建一个命令行接口(CLI)演示,使用 transformers 后端,适用于 glm-4-9b 模型,结合 Intel® Extension for Transformers
"""

# 导入操作系统相关模块
import os
# 获取环境变量 'MODEL_PATH' 的值,如果不存在则使用默认值 'THUDM/glm-4-9b-chat'
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/glm-4-9b-chat')

# 导入 PyTorch 库
import torch
# 从 threading 模块导入 Thread 类
from threading import Thread
# 从 intel_extension_for_transformers 导入 AutoModelForCausalLM 类
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
# 从 transformers 模块导入必要的类
from transformers import TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria, AutoTokenizer


# 定义停止条件类,继承自 StoppingCriteria
class StopOnTokens(StoppingCriteria):
    # 重写 __call__ 方法,检查是否需要停止生成
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        # 定义停止的 token ID 列表
        stop_ids = [151329, 151336, 151338]
        # 遍历停止 ID 列表
        for stop_id in stop_ids:
            # 如果当前输入的最后一个 token ID 是停止 ID,则返回 True
            if input_ids[0][-1] == stop_id:
                return True
        # 如果没有匹配的停止 ID,则返回 False
        return False


# 初始化模型和分词器的函数
def initialize_model_and_tokenizer():
    # 从预训练模型路径加载分词器,信任远程代码
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    # 从预训练模型路径加载 causal language model,指定设备为 CPU,信任远程代码,并以 4bit 模式加载
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="cpu",  # 使用 Intel CPU 进行推理
        trust_remote_code=True,
        load_in_4bit=True
    )
    # 返回加载的分词器和模型
    return tokenizer, model


# 获取用户输入的函数
def get_user_input():
    # 提示用户输入并返回输入内容
    return input("\nUser: ")


# 主函数
def main():
    # 初始化模型和分词器
    tokenizer, model = initialize_model_and_tokenizer()

    # 初始化历史记录列表
    history = []
    # 设置最大生成长度
    max_length = 100
    # 设置 top-p 取样参数
    top_p = 0.9
    # 设置温度参数
    temperature = 0.8
    # 实例化停止条件对象
    stop = StopOnTokens()

    # 打印欢迎信息
    print("Welcome to the CLI chat. Type your messages below.")
    # 无限循环,直到用户选择退出
    while True:
        # 获取用户输入
        user_input = get_user_input()
        # 检查用户输入是否为退出指令
        if user_input.lower() in ["exit", "quit"]:
            break
        # 将用户输入添加到历史记录中,模型响应初始化为空
        history.append([user_input, ""])

        # 初始化消息列表,用于存储用户和模型的对话内容
        messages = []
        # 遍历历史记录,获取用户和模型的消息
        for idx, (user_msg, model_msg) in enumerate(history):
            # 如果是最新的用户消息且没有模型消息,添加用户消息到消息列表
            if idx == len(history) - 1 and not model_msg:
                messages.append({"role": "user", "content": user_msg})
                break
            # 如果用户消息存在,添加到消息列表
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            # 如果模型消息存在,添加到消息列表
            if model_msg:
                messages.append({"role": "assistant", "content": model_msg})

        # 应用聊天模板处理消息,并返回模型输入的张量
        model_inputs = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,  # 添加生成提示
            tokenize=True,                # 对内容进行分词
            return_tensors="pt"          # 返回 PyTorch 张量
        )

        # 创建一个文本迭代流处理器,用于流式生成输出
        streamer = TextIteratorStreamer(
            tokenizer=tokenizer,          # 使用的分词器
            timeout=60,                   # 超时设置为60秒
            skip_prompt=True,             # 跳过提示
            skip_special_tokens=True      # 跳过特殊标记
        )

        # 设置生成模型的参数
        generate_kwargs = {
            "input_ids": model_inputs,    # 输入的模型张量
            "streamer": streamer,          # 使用的流处理器
            "max_new_tokens": max_length,  # 生成的最大新标记数量
            "do_sample": True,             # 启用采样
            "top_p": top_p,                # 样本筛选阈值
            "temperature": temperature,     # 温度参数控制生成随机性
            "stopping_criteria": StoppingCriteriaList([stop]),  # 停止生成的条件
            "repetition_penalty": 1.2,     # 重复惩罚系数
            "eos_token_id": model.config.eos_token_id,  # 结束标记的 ID
        }

        # 创建一个线程来生成模型的输出
        t = Thread(target=model.generate, kwargs=generate_kwargs)
        # 启动线程
        t.start()
        # 打印助手的提示,保持在同一行
        print("Assistant:", end="", flush=True)
        # 从流中获取新生成的标记并打印
        for new_token in streamer:
            if new_token:
                print(new_token, end="", flush=True)  # 打印新标记
                history[-1][1] += new_token  # 将新标记添加到最新的历史模型消息

        # 去掉最新模型消息的前后空白
        history[-1][1] = history[-1][1].strip()
# 当脚本作为主程序运行时
if __name__ == "__main__":
    # 调用 main 函数
    main()

使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型

本示例介绍如何使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型。

设备和依赖检查

相关推理测试数据

本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。

测试硬件信息:

  • OS: Ubuntu 22.04 (本教程一定需要在Linux环境下执行)
  • Memory: 512GB
  • Python: 3.10.12
  • CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400

安装依赖

在开始推理之前,请你先安装basic_demo中的依赖,同时您需要安装本目录下的依赖项:

pip install -r requirements.txt

运行模型推理

python itrex_cli_demo.py

如果您是第一次推理,会有一次模型转换权重的过程,转换后的模型权重存放在runtime_outputs文件夹下,这大概会消耗60G的硬盘空间。
转换完成后,文件夹下有两个文件:

  • ne_chatglm2_f32.bin 52G(如果您不使用FP32进行推理,可以删掉这个文件)
  • ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G

如果您不是第一次推理,则会跳过这个步骤,直接开始对话,推理效果如下:

Welcome to the CLI chat. Type your messages below.

User: 你好
AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 151552                        
load_ne_hparams  1.hparams.n_embd = 4096                          
load_ne_hparams  2.hparams.n_mult = 0                             
load_ne_hparams  3.hparams.n_head = 32                            
load_ne_hparams  4.hparams.n_head_kv = 0                             
load_ne_hparams  5.hparams.n_layer = 40                            
load_ne_hparams  6.hparams.n_rot = 0                             
load_ne_hparams  7.hparams.ftype = 0                             
load_ne_hparams  8.hparams.max_seq_len = 131072                        
load_ne_hparams  9.hparams.alibi_bias_max = 0.000                         
load_ne_hparams  10.hparams.clip_qkv = 0.000                         
load_ne_hparams  11.hparams.par_res = 0                             
load_ne_hparams  12.hparams.word_embed_proj_dim = 0                             
load_ne_hparams  13.hparams.do_layer_norm_before = 0                             
load_ne_hparams  14.hparams.multi_query_group_num = 2                             
load_ne_hparams  15.hparams.ffn_hidden_size = 13696                         
load_ne_hparams  16.hparams.inner_hidden_size = 0                             
load_ne_hparams  17.hparams.n_experts = 0                             
load_ne_hparams  18.hparams.n_experts_used = 0                             
load_ne_hparams  19.hparams.n_embd_head_k = 0                             
load_ne_hparams  20.hparams.norm_eps = 0.000000                      
load_ne_hparams  21.hparams.freq_base = 5000000.000                   
load_ne_hparams  22.hparams.freq_scale = 1.000                         
load_ne_hparams  23.hparams.rope_scaling_factor = 0.000                         
load_ne_hparams  24.hparams.original_max_position_embeddings = 0                             
load_ne_hparams  25.hparams.use_yarn = 0                             
load_ne_vocab    26.vocab.bos_token_id = 1                             
load_ne_vocab    27.vocab.eos_token_id = 151329                        
load_ne_vocab    28.vocab.pad_token_id = 151329                        
load_ne_vocab    29.vocab.sep_token_id = -1                            
init: hparams.n_vocab         = 151552
init: hparams.n_embd          = 4096
init: hparams.n_mult          = 0
init: hparams.n_head          = 32
init: hparams.n_layer         = 40
init: hparams.n_rot           = 0
init: hparams.ffn_hidden_size = 13696
init: n_parts    = 1
load: ctx size   = 16528.38 MB
load: layers[0].ffn_fusion    = 1
load: scratch0   = 4096.00 MB
load: scratch1   = 2048.00 MB
load: scratch2   = 4096.00 MB
load: mem required  = 26768.38 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
kv_cache_init: run_mha_reordered = 1
model_init_from_file: kv self size =  690.00 MB
Assistant:
你好👋!我是人工智能助手,很高兴为你服务。有什么可以帮助你的吗?

Using Intel® Extension for Transformers to Inference the GLM-4-9B-Chat Model

This example introduces how to use Intel® Extension for Transformers to inference the GLM-4-9B-Chat model.

Device and Dependency Check

Relevant Inference Test Data

The data in this document is tested on the following hardware environment. The actual running environment requirements and memory usage may vary slightly. Please refer to the actual running environment.

Test hardware information:

  • OS: Ubuntu 22.04 (This tutorial must be executed in a Linux environment)
  • Memory: 512GB
  • Python: 3.10.12
  • CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400

Installing Dependencies

Before starting the inference, please install the dependencies in basic_demo, and you need to install the dependencies in this directory:

pip install -r requirements.txt

Running Model Inference

python itrex_cli_demo.py

If this is your first inference, there will be a process of converting model weights. The converted model weights are stored in the runtime_outputs folder, which will consume about 60G of disk space.
After the conversion is completed, there are two files in the folder:

  • ne_chatglm2_f32.bin 52G (If you do not use FP32 for inference, you can delete this file)
  • ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G

If this is not your first inference, this step will be skipped, and you will directly start the conversation. The inference result is as follows:

Welcome to the CLI chat. Type your messages below.

User: Hello
AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
Loading the bin file with NE format...
load_ne_hparams  0.hparams.n_vocab = 151552                        
load_ne_hparams  1.hparams.n_embd = 4096                          
load_ne_hparams  2.hparams.n_mult = 0                             
load_ne_hparams  3.hparams.n_head = 32                            
load_ne_hparams  4.hparams.n_head_kv = 0                             
load_ne_hparams  5.hparams.n_layer = 40                            
load_ne_hparams  6.hparams.n_rot = 0                             
load_ne_hparams  7.hparams.ftype = 0                             
load_ne_hparams  8.hparams.max_seq_len = 131072                        
load_ne_hparams  9.hparams.alibi_bias_max = 0.000                         
load_ne_hparams  10.hparams.clip_qkv = 0.000                         
load_ne_hparams  11.hparams.multi_query_group_num = 2                             
load_ne_hparams  12.hparams.ffn_hidden_size = 13696                         
load_ne_hparams  13.hparams.inner_hidden_size = 0                             
load_ne_hparams  14.hparams.n_experts = 0                             
load_ne_hparams  15.hparams.n_experts_used = 0                             
load_ne_hparams  16.hparams.n_embd_head_k = 0                             
load_ne_hparams  17.hparams.norm_eps = 0.000000                      
load_ne_hparams  18.hparams.freq_base = 5000000.000                   
load_ne_hparams  19.hparams.freq_scale = 1.000                         
load_ne_hparams  20.hparams.rope_scaling_factor = 0.000                         
load_ne_hparams  21.hparams.original_max_position_embeddings = 0                             
load_ne_hparams  22.hparams.use_yarn = 0                             
load_ne_vocab    23.vocab.bos_token_id = 1                             
load_ne_vocab    24.vocab.eos_token_id = 151329                        
load_ne_vocab    25.vocab.pad_token_id = 151329                        
load_ne_vocab    26.vocab.sep_token_id = -1                            
init: hparams.n_vocab         = 151552
init: hparams.n_embd          = 4096
init: hparams.n_mult          = 0
init: hparams.n_head          = 32
init: hparams.n_layer         = 40
init: hparams.n_rot           = 0
init: hparams.ffn_hidden_size = 13696
init: n_parts    = 1
load: ctx size   = 16528.38 MB
load: layers[0].ffn_fusion    = 1
load: scratch0   = 4096.00 MB
load: scratch1   = 2048.00 MB
load: scratch2   = 4096.00 MB
load: mem required  = 26768.38 MB (+ memory per state)
.............................................................................................
model_init_from_file: support_bestla_kv = 1
kv_cache_init: run_mha_reordered = 1
model_init_from_file: kv self size =  690.00 MB
Assistant:
Hello👋! I am an AI assistant. How can I help you today?

.\chatglm4-finetune\intel_device_demo\openvino\convert.py

"""
该脚本用于将原始模型转换为 OpenVINO IR 格式。
可以查看原始代码 https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
"""
# 从 transformers 库导入自动分词器和配置
from transformers import AutoTokenizer, AutoConfig
# 从 optimum.intel 导入量化配置
from optimum.intel import OVWeightQuantizationConfig
# 从 optimum.intel.openvino 导入 OpenVINO 语言模型类
from optimum.intel.openvino import OVModelForCausalLM

# 导入操作系统模块
import os
# 从 pathlib 导入 Path 类
from pathlib import Path
# 导入参数解析模块
import argparse


# 主程序入口
if __name__ == '__main__':
    # 创建参数解析器,禁用帮助信息自动添加
    parser = argparse.ArgumentParser(add_help=False)
    # 添加帮助选项
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='显示帮助信息并退出。')
    # 添加模型 ID 参数,默认值为指定的模型路径
    parser.add_argument('-m',
                        '--model_id',
                        default='THUDM/glm-4-9b-chat',
                        required=False,
                        type=str,
                        help='原始模型路径')
    # 添加精度参数,默认值为 "int4"
    parser.add_argument('-p',
                        '--precision',
                        required=False,
                        default="int4",
                        type=str,
                        choices=["fp16", "int8", "int4"],
                        help='fp16、int8 或 int4')
    # 添加输出路径参数,默认值为 './glm-4-9b-ov'
    parser.add_argument('-o',
                        '--output',
                        default='./glm-4-9b-ov',
                        required=False,
                        type=str,
                        help='必需。保存 IR 模型的路径')
    # 解析命令行参数
    args = parser.parse_args()

    # 将输出路径转换为 Path 对象
    ir_model_path = Path(args.output)
    # 如果输出路径不存在,则创建该目录
    if ir_model_path.exists() == False:
        os.mkdir(ir_model_path)

    # 创建模型参数字典,包括信任远程代码和模型配置
    model_kwargs = {
        "trust_remote_code": True,
        "config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
    }
    # 创建压缩配置字典
    compression_configs = {
        "sym": False,
        "group_size": 128,
        "ratio": 0.8,
    }

    # 打印导出 IR 的消息
    print("====Exporting IR=====")
    # 根据指定精度加载不同的模型
    if args.precision == "int4":
        # 加载 4 位量化模型
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, quantization_config=OVWeightQuantizationConfig(
                                                          bits=4, **compression_configs), **model_kwargs)
    elif args.precision == "int8":
        # 加载 8 位量化模型
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, load_in_8bit=True, **model_kwargs)
    else:
        # 加载原始模型(未量化)
        ov_model = OVModelForCausalLM.from_pretrained(args.model_id, export=True,
                                                      compile=False, load_in_8bit=False, **model_kwargs)

    # 将模型保存到指定的路径
    ov_model.save_pretrained(ir_model_path)

    # 打印导出分词器的消息
    print("====Exporting tokenizer=====")
    # 加载分词器
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_id, trust_remote_code=True)
    # 将分词器保存到指定的路径
    tokenizer.save_pretrained(ir_model_path)

.\chatglm4-finetune\intel_device_demo\openvino\openvino_cli_demo.py

# 导入 argparse 库用于处理命令行参数
import argparse
# 从 typing 导入 List 和 Tuple 类型提示
from typing import List, Tuple
# 从 threading 导入 Thread 用于多线程
from threading import Thread
# 导入 PyTorch 库
import torch
# 从 optimum.intel.openvino 导入模型类
from optimum.intel.openvino import OVModelForCausalLM
# 从 transformers 导入所需的类和函数
from transformers import (AutoTokenizer, AutoConfig,
                          TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)

# 定义停止条件类
class StopOnTokens(StoppingCriteria):
    # 初始化类,传入要停止的 token ID
    def __init__(self, token_ids):
        self.token_ids = token_ids

    # 重载 __call__ 方法,实现停止条件逻辑
    def __call__(
            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
    ) -> bool:
        # 遍历每个停止 ID
        for stop_id in self.token_ids:
            # 检查当前输入的最后一个 token 是否为停止 ID
            if input_ids[0][-1] == stop_id:
                return True  # 如果是,则返回 True
        return False  # 否则返回 False


# 主程序入口
if __name__ == "__main__":
    # 创建命令行参数解析器
    parser = argparse.ArgumentParser(add_help=False)
    # 添加帮助参数
    parser.add_argument('-h',
                        '--help',
                        action='help',
                        help='Show this help message and exit.')
    # 添加模型路径参数
    parser.add_argument('-m',
                        '--model_path',
                        required=True,
                        type=str,
                        help='Required. model path')
    # 添加最大序列长度参数
    parser.add_argument('-l',
                        '--max_sequence_length',
                        default=256,
                        required=False,
                        type=int,
                        help='Required. maximun length of output')
    # 添加设备参数
    parser.add_argument('-d',
                        '--device',
                        default='CPU',
                        required=False,
                        type=str,
                        help='Required. device for inference')
    # 解析命令行参数
    args = parser.parse_args()
    # 获取模型路径
    model_dir = args.model_path

    # 配置 OpenVINO 参数
    ov_config = {"PERFORMANCE_HINT": "LATENCY",
                 "NUM_STREAMS": "1", "CACHE_DIR": ""}

    # 从预训练模型加载分词器
    tokenizer = AutoTokenizer.from_pretrained(
        model_dir, trust_remote_code=True)

    # 打印模型编译信息
    print("====Compiling model====")
    # 从预训练模型加载 OpenVINO 模型
    ov_model = OVModelForCausalLM.from_pretrained(
        model_dir,
        device=args.device,
        ov_config=ov_config,
        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
        trust_remote_code=True,
    )

    # 创建文本迭代流处理器
    streamer = TextIteratorStreamer(
        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
    )
    # 初始化停止 token 列表
    stop_tokens = [StopOnTokens([151329, 151336, 151338])]
    # 定义一个函数,将对话历史转换为模型输入格式
    def convert_history_to_token(history: List[Tuple[str, str]]):
        # 初始化一个空的消息列表,用于存储用户和助手的消息
        messages = []
        # 遍历历史记录中的每一条消息,索引为 idx
        for idx, (user_msg, model_msg) in enumerate(history):
            # 如果是最后一条记录且助手消息为空,添加用户消息并终止循环
            if idx == len(history) - 1 and not model_msg:
                messages.append({"role": "user", "content": user_msg})
                break
            # 如果用户消息不为空,添加到消息列表中
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            # 如果助手消息不为空,添加到消息列表中
            if model_msg:
                messages.append({"role": "assistant", "content": model_msg})

        # 将消息列表转换为模型输入格式,并添加生成提示
        model_inputs = tokenizer.apply_chat_template(messages,
                                                     add_generation_prompt=True,
                                                     tokenize=True,
                                                     return_tensors="pt")
        # 返回模型输入
        return model_inputs

    # 初始化历史记录为空列表
    history = []
    # 输出对话开始的提示信息
    print("====Starting conversation====")
    # 无限循环以持续对话
    while True:
        # 获取用户输入
        input_text = input("用户: ")
        # 如果用户输入为 'stop',则终止循环
        if input_text.lower() == 'stop':
            break

        # 如果用户输入为 'clear',则清空对话历史
        if input_text.lower() == 'clear':
            history = []
            print("AI助手: 对话历史已清空")
            continue

        # 输出助手的提示,准备生成回复
        print("GLM-4-9B-OpenVINO:", end=" ")
        # 将当前用户输入添加到历史记录中
        history = history + [[input_text, ""]]
        # 将对话历史转换为模型输入
        model_inputs = convert_history_to_token(history)
        # 构造生成模型所需的参数字典
        generate_kwargs = dict(
            input_ids=model_inputs,  # 输入的 ID
            max_new_tokens=args.max_sequence_length,  # 最大生成的 token 数量
            temperature=0.1,  # 生成的温度控制
            do_sample=True,  # 启用采样
            top_p=1.0,  # Nucleus 采样参数
            top_k=50,  # Top-k 采样参数
            repetition_penalty=1.1,  # 重复惩罚参数
            streamer=streamer,  # 流式输出的处理对象
            stopping_criteria=StoppingCriteriaList(stop_tokens)  # 停止生成的标准
        )

        # 创建一个线程,用于生成模型的回复
        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
        t1.start()  # 启动线程

        # 初始化部分文本为空字符串
        partial_text = ""
        # 遍历流式输出生成的文本
        for new_text in streamer:
            new_text = new_text  # 接收新生成的文本
            print(new_text, end="", flush=True)  # 输出新文本
            partial_text += new_text  # 累加到部分文本中
        print("\n")  # 输出换行
        # 更新历史记录中最后一条消息的助手回复
        history[-1][1] = partial_text

使用 OpenVINO 部署 GLM-4-9B-Chat 模型

Read this in English.

OpenVINO
是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型,提高推理性能,减少模型的内存占用。
本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。

1. 环境配置

首先,你需要安装依赖

pip install -r requirements.txt

2. 转换模型

由于需要将Huggingface模型转换为OpenVINO IR模型,因此您需要下载模型并转换。

python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov

可以选择的参数

  • --model_id - 模型所在目录的路径(绝对路径)。
  • --output - 转换后模型保存的地址。
  • --precision - 转换的精度。

转换过程如下:

====Exporting IR=====
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cu121
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
Configuration saved in glm-4-9b-ov/openvino_config.json
====Exporting tokenizer=====
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

3. 运行 GLM-4-9B-Chat 模型

python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU

可以选择的参数

  • --model_path - OpenVINO IR 模型所在目录的路径。
  • --max_sequence_length - 输出标记的最大大小。
  • --device - 运行推理的设备。

参考代码

本代码参考 OpenVINO 官方示例 进行修改。

Deploy the GLM-4-9B-Chat model using OpenVINO

OpenVINO
is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.

1. Environment configuration

First, you need to install the dependencies

pip install -r requirements.txt

2. Convert the model

Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.

python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov

The conversion process is as follows:

====Exporting IR=====
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cu121
Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
INFO:nncf:Statistics of the bitwidth distribution:
┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
Configuration saved in glm-4-9b-ov/openvino_config.json
====Exporting tokenizer=====
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Optional parameters

  • --model_id - Path to the directory where the model is located (absolute path).

  • --output - Path to where the converted model is saved.

  • --precision - Precision of the conversion.

3. Run the GLM-4-9B-Chat model

python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU

Optional parameters

  • --model_path - Path to the directory where the OpenVINO IR model is located.

  • --max_sequence_length - Maximum size of the output token.

  • --device - the device to run inference on.

Reference code

This code is modified based on the OpenVINO official example.

GLM-4 微调代码源码解析

📄 Report • 🤗 HF Repo • 🤖 ModelScope • 🟣 WiseModel • 🐦 Twitter • 👋 加入我们的 Discord微信

📍在 智谱AI开放平台 体验和使用更大规模的 GLM 商业模型。

Read this in English

项目更新

  • 🔥 News: py/10/12: 增加了 GLM-4v-9B 模型对vllm框架的支持
  • 🔥 News: py/09/06: 增加了在 GLM-4v-9B 模型上构建OpenAI API兼容的服务端
  • 🔥 News: py/09/05 我们开源了使LLMs能够在长上下文问答中生成细粒度引用的模型 longcite-glm4-9b
    以及数据集 LongCite-45k,
    欢迎在 Huggingface Space 在线体验。
  • 🔥News: py/09/04: 增加了在 GLM-4-9B-Chat 模型上使用带有 Lora adapter 的 vLLM 演示代码
  • 🔥News: py/08/15: 我们开源具备长文本输出能力(单轮对话大模型输出可超过1万token)
    的模型 longwriter-glm4-9b
    以及数据集 LongWriter-6k,
    欢迎在 Huggingface Space
    魔搭社区空间 在线体验。
  • 🔥 News: py/08/12: GLM-4-9B-Chat 模型依赖的transformers版本升级到 4.44.0,请重新拉取除模型权重(
    *.safetensor 文件 和 tokenizer.model)外的文件并参考 basic_demo/requirements.txt 严格更新依赖。
  • 🔥 News: py/07/24:
    我们发布了与长文本相关的最新技术解读,关注 这里
    查看我们在训练 GLM-4-9B 开源模型中关于长文本技术的技术报告。
  • 🔥 News: 2024/7/16: GLM-4-9B-Chat 模型依赖的transformers版本升级到 4.42.4,
    请更新模型配置文件并参考 basic_demo/requirements.txt 更新依赖。
  • 🔥 News: 2024/7/9: GLM-4-9B-Chat
    模型已适配 Ollama,Llama.cpp
    ,您可以在PR 查看具体的细节。
  • 🔥 News: 2024/7/1: 我们更新了 GLM-4V-9B 的微调,您需要更新我们的模型仓库的运行文件和配置文件,
    以支持这个功能,更多微调细节 (例如数据集格式,显存要求),请前往 查看
  • 🔥 News: 2024/6/28: 我们与英特尔技术团队合作,改进了 GLM-4-9B-Chat 的 ITREX 和 OpenVINO 部署教程。您可以使用英特尔
    CPU/GPU 设备高效部署 GLM-4-9B 开源模型。欢迎访问 查看
  • 🔥 News: 2024/6/24: 我们更新了模型仓库的运行文件和配置文件,支持 Flash Attention 2,
    请更新模型配置文件并参考 basic_demo/trans_cli_demo.py 中的示例代码。
  • 🔥 News: 2024/6/19: 我们更新了模型仓库的运行文件和配置文件,修复了部分已知的模型推理的问题,欢迎大家克隆最新的模型仓库。
  • 🔥 News: 2024/6/18: 我们发布 技术报告, 欢迎查看。
  • 🔥 News: 2024/6/05: 我们发布 GLM-4-9B 系列开源模型

模型介绍

GLM-4-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源版本。 在语义、数学、推理、代码和知识等多方面的数据集测评中,
GLM-4-9B 及其人类偏好对齐的版本 GLM-4-9B-Chat 均表现出超越 Llama-3-8B 的卓越性能。除了能进行多轮对话,GLM-4-9B-Chat
还具备网页浏览、代码执行、自定义工具调用(Function Call)和长文本推理(支持最大 128K 上下文)等高级功能。本代模型增加了多语言支持,支持包括日语,韩语,德语在内的
26 种语言。我们还推出了支持 1M 上下文长度(约 200 万中文字符)的 GLM-4-9B-Chat-1M 模型和基于 GLM-4-9B 的多模态模型
GLM-4V-9B。GLM-4V-9B 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力,在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中,GLM-4V-9B
表现出超越 GPT-4-turbo-2024-04-09、Gemini 1.0 Pro、Qwen-VL-Max 和 Claude 3 Opus 的卓越性能。

Model List

Model Type Seq Length Download Online Demo
GLM-4-9B Base 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4-9B-Chat Chat 128K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope CPU
🤖 ModelScope vLLM
GLM-4-9B-Chat-1M Chat 1M 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4V-9B Chat 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope

评测结果

对话模型典型任务

Model AlignBench MT-Bench IFEval MMLU C-Eval GSM8K MATH HumanEval NaturalCodeBench
Llama-3-8B-Instruct 6.40 8.00 68.6 68.4 51.3 79.6 30.0 62.2 24.7
ChatGLM3-6B 5.18 5.50 28.1 61.4 69.0 72.3 25.7 58.5 11.3
GLM-4-9B-Chat 7.01 8.35 69.0 72.4 75.6 79.6 50.6 71.8 32.2

基座模型典型任务

Model MMLU C-Eval GPQA GSM8K MATH HumanEval
Llama-3-8B 66.6 51.2 - 45.8 - 33.5
Llama-3-8B-Instruct 68.4 51.3 34.2 79.6 30.0 62.2
ChatGLM3-6B-Base 61.4 69.0 26.8 72.3 25.7 58.5
GLM-4-9B 74.7 77.1 34.3 84.0 30.4 70.1

由于 GLM-4-9B 在预训练过程中加入了部分数学、推理、代码相关的 instruction 数据,所以将 Llama-3-8B-Instruct 也列入比较范围。

长文本

在 1M 的上下文长度下进行大海捞针实验,结果如下:

needle

在 LongBench-Chat 上对长文本能力进行了进一步评测,结果如下:

描述文字

多语言能力

在六个多语言数据集上对 GLM-4-9B-Chat 和 Llama-3-8B-Instruct 进行了测试,测试结果及数据集对应选取语言如下表

Dataset Llama-3-8B-Instruct GLM-4-9B-Chat Languages
M-MMLU 49.6 56.6 all
FLORES 25.0 28.8 ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
MGSM 54.0 65.3 zh, en, bn, de, es, fr, ja, ru, sw, te, th
XWinograd 61.7 73.1 zh, en, fr, jp, ru, pt
XStoryCloze 84.7 90.7 zh, en, ar, es, eu, hi, id, my, ru, sw, te
XCOPA 73.3 80.1 zh, et, ht, id, it, qu, sw, ta, th, tr, vi

工具调用能力

我们在 Berkeley Function Calling Leaderboard
上进行了测试并得到了以下结果:

Model Overall Acc. AST Summary Exec Summary Relevance
Llama-3-8B-Instruct 58.88 59.25 70.01 45.83
gpt-4-turbo-2024-04-09 81.24 82.14 78.61 88.75
ChatGLM3-6B 57.88 62.18 69.78 5.42
GLM-4-9B-Chat 81.00 80.26 84.40 87.92

多模态能力

GLM-4V-9B 是一个多模态语言模型,具备视觉理解能力,其相关经典任务的评测结果如下:

MMBench-EN-Test MMBench-CN-Test SEEDBench_IMG MMStar MMMU MME HallusionBench AI2D OCRBench
gpt-4o-2024-05-13 83.4 82.1 77.1 63.9 69.2 2310.3 55.0 84.6 736
gpt-4-turbo-2024-04-09 81.0 80.2 73.0 56.0 61.7 2070.2 43.9 78.6 656
gpt-4-1106-preview 77.0 74.4 72.3 49.7 53.8 1771.5 46.5 75.9 516
InternVL-Chat-V1.5 82.3 80.7 75.2 57.1 46.8 2189.6 47.4 80.6 720
LLaVA-Next-Yi-34B 81.1 79.0 75.7 51.6 48.8 2050.2 34.8 78.9 574
Step-1V 80.7 79.9 70.3 50.0 49.9 2206.4 48.4 79.2 625
MiniCPM-Llama3-V2.5 77.6 73.8 72.3 51.8 45.8 2024.6 42.4 78.4 725
Qwen-VL-Max 77.6 75.7 72.7 49.5 52.0 2281.7 41.2 75.7 684
Gemini 1.0 Pro 73.6 74.3 70.7 38.6 49.0 2148.9 45.7 72.9 680
Claude 3 Opus 63.3 59.2 64.0 45.7 54.9 1586.8 37.8 70.6 694
GLM-4V-9B 81.1 79.4 76.8 58.7 47.2 2163.8 46.6 81.1 786

快速调用

硬件配置和系统要求,请查看这里

使用以下方法快速调用 GLM-4-9B-Chat 语言模型

使用 transformers 后端进行推理:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0' # 设置 GPU 编号,如果单机单卡指定一个,单机多卡指定多个 GPU 编号
MODEL_PATH = "THUDM/glm-4-9b-chat"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

使用 vLLM 后端进行推理:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

使用以下方法快速调用 GLM-4V-9B 多模态模型

使用 transformers 后端进行推理:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0' # 设置 GPU 编号,如果单机单卡指定一个,单机多卡指定多个 GPU 编号
MODEL_PATH = "THUDM/glm-4v-9b"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

query = '描述这张图片'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

使用 vLLM 后端进行推理:

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/glm-4v-9b"

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = "What's the content of the image?"
image = Image.open("your image").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

完整项目列表

如果你想更进一步了解 GLM-4-9B 系列开源模型,本开源仓库通过以下内容为开发者提供基础的 GLM-4-9B的使用和开发代码

  • basic_demo: 在这里包含了

    • 使用 transformers 和 vLLM 后端的交互代码
    • OpenAI API 后端交互代码
    • Batch 推理代码
  • composite_demo: 在这里包含了

    • GLM-4-9B-Chat 以及 GLM-4V-9B 开源模型的完整功能演示代码,包含了 All Tools 能力、长文档解读和多模态能力的展示。
  • fintune_demo: 在这里包含了

    • PEFT (LORA, P-Tuning) 微调代码
    • SFT 微调代码

友情链接

  • LLaMA-Factory: 高效开源微调框架,已支持 GLM-4-9B-Chat 语言模型微调。
  • SWIFT: 魔搭社区的大模型/多模态大模型训练框架,已支持 GLM-4-9B-Chat / GLM-4V-9B
    模型微调。
  • Xorbits Inference: 性能强大且功能全面的分布式推理框架,轻松一键部署你自己的模型或内置的前沿开源模型。
  • LangChain-ChatChat: 基于 Langchain 与 ChatGLM 等语言模型的 RAG
    与 Agent 应用
  • self-llm: Datawhale 团队的提供的 GLM-4-9B
    系列模型使用教程。
  • chatglm.cpp: 类似 llama.cpp 的量化加速推理方案,实现笔记本上实时对话

协议

  • GLM-4 模型的权重的使用则需要遵循 模型协议

  • 本开源仓库的代码则遵循 Apache 2.0 协议。

请您严格遵循开源协议。

引用

如果你觉得我们的工作有帮助的话,请考虑引用下列论文。

@misc{glm2024chatglm,
      title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools}, 
      author={Team GLM and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang},
      year={2024},
      eprint={2406.12793},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

GLM-4

📄 Report • 🤗 HF Repo • 🤖 ModelScope • 🟣 WiseModel • 🐦 Twitter • 👋 Join Discord and WeChat

📍Experience and use a larger-scale GLM business model on the Zhipu AI Open Platform

Update

  • 🔥 News: py/10/12: Add GLM-4v-9B model support for vllm framework.
  • 🔥 News: py/09/06: Add support for OpenAI API server on the GLM-4v-9B model.
  • 🔥 News: py/09/05: We open-sourced a model enabling LLMs to generate fine-grained citations in
    long-context Q&A: longcite-glm4-9b, along with the
    dataset LongCite-14k. You are welcome to experience it online
    at Huggingface Space.
  • 🔥 News: py/09/04: Add demo code for using vLLM with LoRA adapter on the GLM-4-9B-Chat model.
  • 🔥 News: py/08/15: We have open-sourced a model with long-text output capability (single turn LLM output
    can exceed
    10K tokens) longwriter-glm4-9b and the
    dataset LongWriter-6k. You're welcome
    to try it online.
  • 🔥 News: py/08/12: The transformers version required for the GLM-4-9B-Chat model has been upgraded
    to 4.44.0. Please pull all files again except for the model weights (*.safetensor files and tokenizer.model),
    and strictly update the dependencies as per basic_demo/requirements.txt.
  • 🔥 News: py/07/24: we released the latest technical interpretation related to long texts. Check
    out here to view
    our
    technical report on long context technology in the training of the open-source GLM-4-9B model.
  • 🔥 News: 2024/7/16: The transformers version that the GLM-4-9B-Chat model depends on has been upgraded
    to 4.42.4. Please update the model configuration file and refer to basic_demo/requirements.txt to update the
    dependencies.
  • 🔥 News: 2024/7/9: The GLM-4-9B-Chat model has been adapted to Ollama
    and Llama.cpp, you can check the specific details
    in PR.
  • 🔥 News: 2024/7/1: We have updated the multimodal fine-tuning of GLM-4V-9B. You need to update the run file and
    configuration file of our model repository to support this feature. For more fine-tuning details (such as dataset
    format, video memory requirements), please go to view.
  • 🔥 News: 2024/6/28: We have worked with the Intel technical team to improve the ITREX and OpenVINO deployment
    tutorials for GLM-4-9B-Chat. You can use Intel CPU/GPU devices to efficiently deploy the GLM-4-9B open source model.
    Welcome to view.
  • 🔥 News: 2024/6/24: We have updated the running files and configuration files of the model repository to
    support Flash Attention 2, Please update the model configuration file and refer to the sample code
    in basic_demo/trans_cli_demo.py.
  • 🔥 News: 2024/6/19: We updated the running files and configuration files of the model repository and fixed some
    model inference issues. Welcome to clone the latest model repository.
  • 🔥 News: 2024/6/18: We released a technical report, welcome to check it
    out.
  • 🔥 News: 2024/6/05: We released the GLM-4-9B series of open source models

Model Introduction

GLM-4-9B is the open-source version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu
AI. In the evaluation of data sets in semantics, mathematics, reasoning, code, and knowledge, GLM-4-9B
and its human preference-aligned version GLM-4-9B-Chat have shown superior performance beyond Llama-3-8B. In
addition to multi-round conversations, GLM-4-9B-Chat also has advanced features such as web browsing, code execution,
custom tool calls (Function Call), and long text reasoning (supporting up to 128K context).
This generation of models has added multi-language support, supporting 26 languages including Japanese, Korean,
and German. We have also launched the GLM-4-9B-Chat-1M model that supports 1M
context length (about 2 million Chinese characters) and the multimodal model GLM-4V-9B based on GLM-4-9B.
GLM-4V-9B possesses dialogue capabilities in both Chinese and English at a high resolution of 1120*1120.
In various multimodal evaluations, including comprehensive abilities in Chinese and English, perception & reasoning,
text recognition, and chart understanding, GLM-4V-9B demonstrates superior performance compared to
GPT-4-turbo-2024-04-09, Gemini 1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.

Model List

Model Type Seq Length Download Online Demo
GLM-4-9B Base 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4-9B-Chat Chat 128K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope CPU
🤖 ModelScope vLLM
GLM-4-9B-Chat-1M Chat 1M 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel /
GLM-4V-9B Chat 8K 🤗 Huggingface 🤖 ModelScope 🟣 WiseModel 🤖 ModelScope

BenchMark

Typical Tasks

Model AlignBench MT-Bench IFEval MMLU C-Eval GSM8K MATH HumanEval NaturalCodeBench
Llama-3-8B-Instruct 6.40 8.00 68.58 68.4 51.3 79.6 30.0 62.2 24.7
ChatGLM3-6B 5.18 5.50 28.1 66.4 69.0 72.3 25.7 58.5 11.3
GLM-4-9B-Chat 7.01 8.35 69.0 72.4 75.6 79.6 50.6 71.8 32.2

Base Model

Model MMLU C-Eval GPQA GSM8K MATH HumanEval
Llama-3-8B 66.6 51.2 - 45.8 - 33.5
Llama-3-8B-Instruct 68.4 51.3 34.2 79.6 30.0 62.2
ChatGLM3-6B-Base 61.4 69.0 26.8 72.3 25.7 58.5
GLM-4-9B 74.7 77.1 34.3 84.0 30.4 70.1

Since GLM-4-9B adds some math, reasoning, and code-related instruction data during pre-training, Llama-3-8B-Instruct
is also included in the comparison range.

Long Context

The needle-in-the-haystack experiment was
conducted with a context length of 1M, and the results are as follows:

needle

The long text capability was further evaluated on LongBench-Chat, and the results are as follows:

Description text

Multi Language

The tests for GLM-4-9B-Chat and Llama-3-8B-Instruct are conducted on six multilingual datasets. The test results and the
corresponding languages selected for each dataset are shown in the table below:

Dataset Llama-3-8B-Instruct GLM-4-9B-Chat Languages
M-MMLU 49.6 56.6 all
FLORES 25.0 28.8 ru, es, de, fr, it, pt, pl, ja, nl, ar, tr, cs, vi, fa, hu, el, ro, sv, uk, fi, ko, da, bg, no
MGSM 54.0 65.3 zh, en, bn, de, es, fr, ja, ru, sw, te, th
XWinograd 61.7 73.1 zh, en, fr, jp, ru, pt
XStoryCloze 84.7 90.7 zh, en, ar, es, eu, hi, id, my, ru, sw, te
XCOPA 73.3 80.1 zh, et, ht, id, it, qu, sw, ta, th, tr, vi

Function Call

Tested
on Berkeley Function Calling Leaderboard.

Model Overall Acc. AST Summary Exec Summary Relevance
Llama-3-8B-Instruct 58.88 59.25 70.01 45.83
gpt-4-turbo-2024-04-09 81.24 82.14 78.61 88.75
ChatGLM3-6B 57.88 62.18 69.78 5.42
GLM-4-9B-Chat 81.00 80.26 84.40 87.92

Multi-Modal

GLM-4V-9B is a multimodal language model with visual understanding capabilities. The evaluation results of its related
classic tasks are as follows:

MMBench-EN-Test MMBench-CN-Test SEEDBench_IMG MMStar MMMU MME HallusionBench AI2D OCRBench
gpt-4o-2024-05-13 83.4 82.1 77.1 63.9 69.2 2310.3 55 84.6 736
gpt-4-turbo-2024-04-09 81.0 80.2 73.0 56.0 61.7 2070.2 43.9 78.6 656
gpt-4-1106-preview 77.0 74.4 72.3 49.7 53.8 1771.5 46.5 75.9 516
InternVL-Chat-V1.5 82.3 80.7 75.2 57.1 46.8 2189.6 47.4 80.6 720
LLaVA-Next-Yi-34B 81.1 79 75.7 51.6 48.8 2050.2 34.8 78.9 574
Step-1V 80.7 79.9 70.3 50.0 49.9 2206.4 48.4 79.2 625
MiniCPM-Llama3-V2.5 77.6 73.8 72.3 51.8 45.8 2024.6 42.4 78.4 725
Qwen-VL-Max 77.6 75.7 72.7 49.5 52 2281.7 41.2 75.7 684
Gemini 1.0 Pro 73.6 74.3 70.7 38.6 49 2148.9 45.7 72.9 680
Claude 3 Opus 63.3 59.2 64 45.7 54.9 1586.8 37.8 70.6 694
GLM-4V-9B 81.1 79.4 76.8 58.7 47.2 2163.8 46.6 81.1 786

Quick call

For hardware configuration and system requirements, please check here.

Use the following method to quickly call the GLM-4-9B-Chat language model

Use the transformers backend for inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Set the GPU number. If inference with multiple GPUs, set multiple GPU numbers
MODEL_PATH = "THUDM/glm-4-9b-chat"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use the vLLM backend for inference:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat
# If you encounter OOM, you can try to reduce max_model_len or increase tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # if you encounter OOM in GLM-4-9B-Chat-1M, you can try to enable the following parameters
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Use the following method to quickly call the GLM-4V-9B multimodal model

Use the transformers backend for inference:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Set the GPU number. If inference with multiple GPUs, set multiple GPU numbers
MODEL_PATH = "THUDM/glm-4v-9b"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

query = '描述这张图片'
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
                                       add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                       return_dict=True)  # chat mode

inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto"
).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Use the vLLM backend for inference:

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/glm-4v-9b"

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = "What's the content of the image?"
image = Image.open("your image").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Complete project list

If you want to learn more about the GLM-4-9B series open source models, this open source repository provides developers
with basic GLM-4-9B usage and development code through the following content

  • basic_demo: Contains

  • Interaction code using transformers and vLLM backend

  • OpenAI API backend interaction code

  • Batch reasoning code

  • composite_demo: Contains

  • Fully functional demonstration code for GLM-4-9B and GLM-4V-9B open source models, including All Tools capabilities,
    long document interpretation, and multimodal capabilities.

  • fintune_demo: Contains

  • PEFT (LORA, P-Tuning) fine-tuning code

  • SFT fine-tuning code

  • LLaMA-Factory: Efficient open-source fine-tuning framework,
    already supports GLM-4-9B-Chat language model fine-tuning.
  • SWIFT: LLM/VLM training framework from ModelScope, supports
    GLM-4-9B-Chat / GLM-4V-9b fine-tuning.
  • Xorbits Inference: Performance-enhanced and comprehensive global inference
    framework, easily deploy your own models or import cutting-edge open source models with one click.
  • LangChain-ChatChat: RAG and Agent applications based on
    language models such as Langchain and ChatGLM
  • self-llm: Datawhale's self-llm project, which
    includes
    the GLM-4-9B open source model cookbook.
  • chatglm.cpp: Real-time inference on your laptop accelerated by quantization,
    similar to llama.cpp.

License

  • The use of GLM-4 model weights must follow
    the Model License.

  • The code in this open source repository follows the Apache 2.0 license.

Please strictly follow the open source license.

Reference

If you find our work helpful, please consider citing the following paper.

@misc{glm2024chatglm,
      title={ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools}, 
      author={Team GLM  and Aohan Zeng and Bin Xu and Bowen Wang and Chenhui Zhang and Da Yin and Diego Rojas and Guanyu Feng and Hanlin Zhao and Hanyu Lai and Hao Yu and Hongning Wang and Jiadai Sun and Jiajie Zhang and Jiale Cheng and Jiayi Gui and Jie Tang and Jing Zhang and Juanzi Li and Lei Zhao and Lindong Wu and Lucen Zhong and Mingdao Liu and Minlie Huang and Peng Zhang and Qinkai Zheng and Rui Lu and Shuaiqi Duan and Shudan Zhang and Shulin Cao and Shuxun Yang and Weng Lam Tam and Wenyi Zhao and Xiao Liu and Xiao Xia and Xiaohan Zhang and Xiaotao Gu and Xin Lv and Xinghan Liu and Xinyi Liu and Xinyue Yang and Xixuan Song and Xunkai Zhang and Yifan An and Yifan Xu and Yilin Niu and Yuantao Yang and Yueyan Li and Yushi Bai and Yuxiao Dong and Zehan Qi and Zhaoyu Wang and Zhen Yang and Zhengxiao Du and Zhenyu Hou and Zihan Wang},
      year={2024},
      eprint={2406.12793},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
@misc{wang2023cogvlm,
      title={CogVLM: Visual Expert for Pretrained Language Models}, 
      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2023},
      eprint={2311.03079},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

扫码关注公众号,加入「GLM-4交流群」

Scan the QR code to follow the official account and join the "ChatGLM Discussion Group"

.\chatglm4v-9b\configuration_chatglm.py

# 从 transformers 库导入预训练配置类
from transformers import PretrainedConfig


# 定义 ChatGLMConfig 类,继承自 PretrainedConfig
class ChatGLMConfig(PretrainedConfig):
    # 设置模型类型为 "chatglm"
    model_type = "chatglm"

    # 初始化方法,设置模型的各种参数
    def __init__(
            # 定义模型层数,默认为 28
            num_layers=28,
            # 定义填充后的词汇表大小,默认为 65024
            padded_vocab_size=65024,
            # 定义隐藏层的大小,默认为 4096
            hidden_size=4096,
            # 定义前馈网络隐藏层的大小,默认为 13696
            ffn_hidden_size=13696,
            # 定义键值通道的数量,默认为 128
            kv_channels=128,
            # 定义注意力头的数量,默认为 32
            num_attention_heads=32,
            # 定义序列长度,默认为 2048
            seq_length=2048,
            # 定义隐藏层的 dropout 比例,默认为 0.0
            hidden_dropout=0.0,
            # 定义分类器的 dropout 比例,默认为 None
            classifier_dropout=None,
            # 定义注意力层的 dropout 比例,默认为 0.0
            attention_dropout=0.0,
            # 定义 layernorm 的 epsilon 值,默认为 1e-5
            layernorm_epsilon=1e-5,
            # 定义是否使用 rmsnorm,默认为 True
            rmsnorm=True,
            # 定义是否在 layernorm 后应用残差连接,默认为 False
            apply_residual_connection_post_layernorm=False,
            # 定义是否使用后层归一化,默认为 True
            post_layer_norm=True,
            # 定义是否添加线性偏置,默认为 False
            add_bias_linear=False,
            # 定义是否添加 QKV 偏置,默认为 False
            add_qkv_bias=False,
            # 定义是否进行偏置 dropout 融合,默认为 True
            bias_dropout_fusion=True,
            # 定义是否使用多查询注意力,默认为 False
            multi_query_attention=False,
            # 定义多查询组的数量,默认为 1
            multi_query_group_num=1,
            # 定义 ROPE 比例,默认为 1
            rope_ratio=1,
            # 定义是否应用查询-键层缩放,默认为 True
            apply_query_key_layer_scaling=True,
            # 定义是否在 FP32 中进行注意力 softmax,默认为 True
            attention_softmax_in_fp32=True,
            # 定义是否使用 FP32 残差连接,默认为 False
            fp32_residual_connection=False,
            # 定义前序列长度,默认为 None
            pre_seq_len=None,
            # 定义是否使用前缀投影,默认为 False
            prefix_projection=False,
            # 定义 BOI token 的 ID,默认为 None
            boi_token_id=None,
            # 定义 EOI token 的 ID,默认为 None
            eoi_token_id=None,
            # 其他参数,允许扩展
            **kwargs
    ):
        # 将 num_layers 参数赋值给实例属性
        self.num_layers = num_layers
        # 将词汇表大小赋值给实例属性
        self.vocab_size = padded_vocab_size
        # 将填充后的词汇表大小赋值给实例属性
        self.padded_vocab_size = padded_vocab_size
        # 将隐藏层大小赋值给实例属性
        self.hidden_size = hidden_size
        # 将前馈网络隐藏层大小赋值给实例属性
        self.ffn_hidden_size = ffn_hidden_size
        # 将键值通道数量赋值给实例属性
        self.kv_channels = kv_channels
        # 将注意力头数量赋值给实例属性
        self.num_attention_heads = num_attention_heads
        # 将序列长度赋值给实例属性
        self.seq_length = seq_length
        # 将隐藏层 dropout 赋值给实例属性
        self.hidden_dropout = hidden_dropout
        # 将分类器 dropout 赋值给实例属性
        self.classifier_dropout = classifier_dropout
        # 将注意力 dropout 赋值给实例属性
        self.attention_dropout = attention_dropout
        # 将 layernorm epsilon 赋值给实例属性
        self.layernorm_epsilon = layernorm_epsilon
        # 将 rmsnorm 赋值给实例属性
        self.rmsnorm = rmsnorm
        # 将是否应用残差连接后的 layernorm 赋值给实例属性
        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
        # 将后层归一化赋值给实例属性
        self.post_layer_norm = post_layer_norm
        # 将是否添加线性偏置赋值给实例属性
        self.add_bias_linear = add_bias_linear
        # 将是否添加 QKV 偏置赋值给实例属性
        self.add_qkv_bias = add_qkv_bias
        # 将偏置 dropout 融合赋值给实例属性
        self.bias_dropout_fusion = bias_dropout_fusion
        # 将多查询注意力赋值给实例属性
        self.multi_query_attention = multi_query_attention
        # 将多查询组的数量赋值给实例属性
        self.multi_query_group_num = multi_query_group_num
        # 将 ROPE 比例赋值给实例属性
        self.rope_ratio = rope_ratio
        # 将查询-键层缩放赋值给实例属性
        self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
        # 将注意力 softmax 在 FP32 中的设置赋值给实例属性
        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
        # 将 FP32 残差连接的设置赋值给实例属性
        self.fp32_residual_connection = fp32_residual_connection
        # 将前序列长度赋值给实例属性
        self.pre_seq_len = pre_seq_len
        # 将前缀投影赋值给实例属性
        self.prefix_projection = prefix_projection
        # 将 BOI token ID 赋值给实例属性
        self.boi_token_id = boi_token_id
        # 将 EOI token ID 赋值给实例属性
        self.eoi_token_id = eoi_token_id
        # 调用父类的初始化方法,传递其他参数
        super().__init__(**kwargs)

.\chatglm4v-9b\modeling_chatglm.py

# PyTorch GLM-4V 模型的文档字符串
""" PyTorch GLM-4V model. """
# 导入数学库
import math
# 导入系统库
import sys
# 导入 PyTorch 库
import torch
# 导入用于检查点的工具
import torch.utils.checkpoint
# 导入 PyTorch 的功能性模块
import torch.nn.functional as F
# 从 PyTorch 导入 nn 模块
from torch import nn
# 从 nn 模块导入多种损失函数
from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
# 从 nn.utils 导入跳过初始化的工具
from torch.nn.utils import skip_init
# 导入类型提示相关的模块
from typing import Optional, Tuple, Union, List, Dict, Any

# 从 transformers 导入模型输出相关的类
from transformers.modeling_outputs import (
    BaseModelOutputWithPast,
    CausalLMOutputWithPast,
    SequenceClassifierOutputWithPast,
)
# 从 transformers 导入预训练模型的基类
from transformers.modeling_utils import PreTrainedModel
# 从 transformers 导入日志记录和可用性检查
from transformers.utils import logging, is_torch_npu_available
# 从生成模块导入 logits 处理器
from transformers.generation.logits_process import LogitsProcessor
# 从生成工具导入生成相关的类
from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput

# 导入视觉模型
from .visual import EVA2CLIPModel
# 导入 ChatGLM 配置
from .configuration_chatglm import ChatGLMConfig

# 尝试导入 Flash Attention 相关工具
try:
    from transformers.utils import is_flash_attn_greater_or_equal_2_10, is_flash_attn_2_available

    # 如果 Flash Attention 2 可用,导入相关函数
    if is_flash_attn_2_available():
        from flash_attn import flash_attn_func, flash_attn_varlen_func
        from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
# 捕获导入异常
except:
    pass

# 设置 JIT 融合内核所需的标志
# 如果不是在 macOS 上并且不支持 NPU,则设置 JIT 配置
if sys.platform != 'darwin' and not is_torch_npu_available():
    torch._C._jit_set_profiling_mode(False)  # 禁用 JIT 轮廓模式
    torch._C._jit_set_profiling_executor(False)  # 禁用 JIT 轮廓执行器
    torch._C._jit_override_can_fuse_on_cpu(True)  # 允许在 CPU 上融合
    torch._C._jit_override_can_fuse_on_gpu(True)  # 允许在 GPU 上融合

# 获取当前模块的日志记录器
logger = logging.get_logger(__name__)

# 定义语言和视觉的标记类型
LANGUAGE_TOKEN_TYPE = 0
VISION_TOKEN_TYPE = 1

# 定义文档检查点和配置
_CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
_CONFIG_FOR_DOC = "ChatGLMConfig"


# 默认初始化函数
def default_init(cls, *args, **kwargs):
    # 使用给定的参数初始化类
    return cls(*args, **kwargs)


# 定义无效分数的 logits 处理器
class InvalidScoreLogitsProcessor(LogitsProcessor):
    # 重写调用方法
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        # 检查分数是否存在 NaN 或 Inf
        if torch.isnan(scores).any() or torch.isinf(scores).any():
            # 将分数置为零
            scores.zero_()
            # 设置特定索引的分数
            scores[..., 198] = 5e4
        # 返回处理后的分数
        return scores


# 定义前缀编码器
class PrefixEncoder(torch.nn.Module):
    """
    用于编码前缀的 torch.nn 模型
    输入形状: (batch-size, prefix-length)
    输出形状: (batch-size, prefix-length, 2*layers*hidden)
    """
    # 初始化方法,接受一个 ChatGLMConfig 配置对象
    def __init__(self, config: ChatGLMConfig):
        # 调用父类的初始化方法
        super().__init__()
        # 从配置中获取前缀投影的设置
        self.prefix_projection = config.prefix_projection
        # 如果启用了前缀投影
        if self.prefix_projection:
            # 计算用于编码前缀的键值对的大小
            kv_size = config.num_layers * config.kv_channels * config.multi_query_group_num * 2
            # 创建嵌入层,输入大小为 pre_seq_len,输出大小为 kv_size
            self.embedding = torch.nn.Embedding(config.pre_seq_len, kv_size)
            # 创建一个包含两个线性层和一个 Tanh 激活函数的顺序网络
            self.trans = torch.nn.Sequential(
                # 第一层线性变换,输入大小为 kv_size,输出大小为 hidden_size
                torch.nn.Linear(kv_size, config.hidden_size),
                # 应用 Tanh 激活函数
                torch.nn.Tanh(),
                # 第二层线性变换,输入大小为 hidden_size,输出大小为 kv_size
                torch.nn.Linear(config.hidden_size, kv_size)
            )
        else:
            # 如果没有启用前缀投影,直接创建嵌入层
            self.embedding = torch.nn.Embedding(config.pre_seq_len,
                                                config.num_layers * config.kv_channels * config.multi_query_group_num * 2)

    # 前向传播方法,接受一个前缀张量
    def forward(self, prefix: torch.Tensor):
        # 如果启用了前缀投影
        if self.prefix_projection:
            # 将前缀张量通过嵌入层进行嵌入
            prefix_tokens = self.embedding(prefix)
            # 通过转换网络获取过去的键值对
            past_key_values = self.trans(prefix_tokens)
        else:
            # 如果没有前缀投影,直接通过嵌入层获取过去的键值对
            past_key_values = self.embedding(prefix)
        # 返回过去的键值对
        return past_key_values
# 定义一个函数用于沿最后一个维度拆分张量
def split_tensor_along_last_dim(
        tensor: torch.Tensor,  # 输入的张量
        num_partitions: int,  # 拆分张量的分区数
        contiguous_split_chunks: bool = False,  # 是否要求每个分块在内存中是连续的
) -> List[torch.Tensor]:  # 返回类型为张量列表
    """拆分张量沿其最后一个维度。

    参数:
        tensor: 输入张量。
        num_partitions: 拆分张量的分区数
        contiguous_split_chunks: 如果为 True,则使每个块在内存中连续。

    返回:
        张量列表
    """
    # 获取张量的最后维度索引
    last_dim = tensor.dim() - 1
    # 计算每个分区的大小
    last_dim_size = tensor.size()[last_dim] // num_partitions
    # 使用 torch.split 函数进行拆分
    tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
    # 注意:torch.split 默认不创建连续的张量
    if contiguous_split_chunks:  # 如果需要连续的分块
        # 返回每个分块的连续版本
        return tuple(chunk.contiguous() for chunk in tensor_list)

    # 返回拆分后的张量列表
    return tensor_list


# 定义一个旋转嵌入类,继承自 nn.Module
class RotaryEmbedding(nn.Module):
    # 初始化函数,设置参数
    def __init__(self, dim, rope_ratio=1, original_impl=False, device=None, dtype=None):
        super().__init__()  # 调用父类初始化
        # 计算反频率并在 buffer 中注册
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
        self.register_buffer("inv_freq", inv_freq)  # 注册反频率
        self.dim = dim  # 保存维度信息
        self.original_impl = original_impl  # 保存原始实现标志
        self.rope_ratio = rope_ratio  # 保存旋转比例

    # 实现方法,根据序列长度和维度生成嵌入
    def impl(self, seq_length: int, dim: int, device: torch.device, dtype: torch.dtype):
        base = 10000 * self.rope_ratio  # 计算基础值
        # 计算反频率
        inv_freq = 1.0 / (
                base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
        # 创建序列的张量
        seq = torch.arange(seq_length, device=inv_freq.device, dtype=torch.float32)
        # 计算频率的外积
        freqs = torch.outer(seq, inv_freq)
        # 第一个部分是偶数向量分量,第二个部分是奇数向量分量,
        # 维度大小为 2 * dim
        emb = torch.cat((freqs, freqs), dim=-1)  # 将频率拼接
        return emb  # 返回嵌入

    # 前向实现函数,定义前向传播的逻辑
    def forward_impl(
            self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
    ):
        """增强的 Transformer,带有旋转位置嵌入。

        来源于: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
        transformers/rope/__init__.py。MIT 许可证:
        https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license。
        """
        # 计算旋转嵌入的基础 $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
        base = base * self.rope_ratio
        # 计算每个位置的频率 $\theta_i$,用于位置嵌入
        theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=torch.float, device=device) / n_elem))

        # 创建位置索引 `[0, 1, ..., seq_len - 1]`
        seq_idx = torch.arange(seq_len, dtype=torch.float, device=device)

        # 计算位置索引与频率的外积
        idx_theta = torch.outer(seq_idx, theta).float()

        # 堆叠余弦和正弦值,形成位置嵌入的缓存
        cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)

        # 处理数据类型,模拟 complex32 的行为,避免结果不同
        if dtype in (torch.float16, torch.bfloat16, torch.int8):
            # 将缓存转换为 bfloat16 或 half,根据数据类型
            cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half()
        # 返回计算得到的缓存
        return cache

    def forward(self, max_seq_len, offset=0):
        # 如果使用原始实现,则调用原始的前向传播方法
        if self.original_impl:
            return self.forward_impl(
                max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
            )
        # 否则调用自定义实现的前向传播方法
        else:
            return self.impl(max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device)
# 使用 Torch JIT 编译器将此函数编译为高效的 Torch 脚本
@torch.jit.script
def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
    # x: [b, np, sq, hn],其中 b 是批量大小,np 是序列数,sq 是序列长度,hn 是隐藏维度
    b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
    # 计算旋转维度,rope_cache 的最后一维的大小乘以 2
    rot_dim = rope_cache.shape[-2] * 2
    # 将 x 分为旋转部分和其他部分
    x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
    # 截断 rope_cache 以支持可变大小
    rope_cache = rope_cache[:, :sq]
    # 将 x 重塑为 [b, np, sq, rot_dim / 2, 2] 的形状
    xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
    # 将 rope_cache 视图重塑为 [b, 1, sq, xshaped 的最后一维, 2]
    rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
    # 计算输出,使用旋转位置编码的公式
    x_out2 = torch.stack(
        [
            xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
            xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
        ],
        -1,
    )
    # 将输出展平,去掉最后一个维度的 3 维信息
    x_out2 = x_out2.flatten(3)
    # 将处理后的输出与未处理部分连接,沿最后一个维度拼接
    return torch.cat((x_out2, x_pass), dim=-1)


# 定义 RMSNorm 类,继承自 torch.nn.Module
class RMSNorm(torch.nn.Module):
    # 初始化 RMSNorm 类的构造函数
    def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
        # 调用父类构造函数
        super().__init__()
        # 创建可学习的权重参数,形状为 normalized_shape
        self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
        # 设置 epsilon 值以避免除以零
        self.eps = eps

    # 定义前向传播方法
    def forward(self, hidden_states: torch.Tensor):
        # 获取输入的 dtype
        input_dtype = hidden_states.dtype
        # 计算方差,取平方后求均值,保持维度
        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
        # 规范化 hidden_states,乘以方差的平方根的倒数
        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)

        # 返回加权的隐藏状态,转换回原始的 dtype
        return (self.weight * hidden_states).to(input_dtype)


# 定义 CoreAttention 类,继承自 torch.nn.Module
class CoreAttention(torch.nn.Module):
    # 初始化 CoreAttention 类的构造函数
    def __init__(self, config: ChatGLMConfig, layer_number):
        # 调用父类构造函数
        super(CoreAttention, self).__init__()

        # 根据配置设置是否应用查询键层的缩放
        self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
        # 设置注意力 softmax 的数据类型为 FP32
        self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
        # 如果应用查询键层缩放,则强制使用 FP32
        if self.apply_query_key_layer_scaling:
            self.attention_softmax_in_fp32 = True
        # 确保层号至少为 1
        self.layer_number = max(1, layer_number)

        # 计算投影大小
        projection_size = config.kv_channels * config.num_attention_heads

        # 每个注意力头和每个分区的大小
        self.hidden_size_per_partition = projection_size
        # 每个注意力头的隐藏维度
        self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
        # 每个分区的注意力头数量
        self.num_attention_heads_per_partition = config.num_attention_heads

        coeff = None
        # 计算规范化因子,使用每个注意力头的隐藏大小的平方根
        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
        # 如果应用查询键层缩放,则调整规范化因子
        if self.apply_query_key_layer_scaling:
            coeff = self.layer_number
            self.norm_factor *= coeff
        # 存储缩放系数
        self.coeff = coeff

        # 初始化注意力 dropout
        self.attention_dropout = torch.nn.Dropout(config.attention_dropout)

# 定义 SdpaAttention 类,继承自 CoreAttention
class SdpaAttention(CoreAttention):
    # 定义前向传播函数,接受查询层、键层、值层和注意力掩码作为输入
    def forward(self, query_layer, key_layer, value_layer, attention_mask):
        # 如果没有注意力掩码且查询层的最后一维与键层的最后一维相同
        if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
            # 使用缩放点积注意力计算上下文层,设置为因果模式,并根据训练状态设置丢弃率
            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
                                                                                 is_causal=True,
                                                                                 dropout_p=self.config.attention_dropout if self.training else 0.0)
        else:
            # 如果存在注意力掩码
            if attention_mask is not None:
                # 反转注意力掩码
                attention_mask = ~attention_mask
            # 使用缩放点积注意力计算上下文层,传入注意力掩码和丢弃率
            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
                                                                                 attention_mask,
                                                                                 dropout_p=self.config.attention_dropout if self.training else 0.0)
        # 转置上下文层的第1维和第2维,并确保内存连续
        context_layer = context_layer.transpose(1, 2).contiguous()
        # 生成新的上下文层形状,将最后两个维度替换为分区后的隐藏层大小
        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
        # 按新的形状重塑上下文层
        context_layer = context_layer.reshape(*new_context_layer_shape)
        # 返回处理后的上下文层
        return context_layer
# 获取未填充的注意力数据
def _get_unpad_data(attention_mask):
    # 计算每个样本的序列长度,使用 int32 类型
    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
    # 找到注意力掩码中非零的索引,并扁平化
    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
    # 计算批次中最长的序列长度
    max_seqlen_in_batch = seqlens_in_batch.max().item()
    # 计算累计序列长度,并在开头填充一个零
    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
    # 返回索引、累计序列长度和最大序列长度
    return (
        indices,
        cu_seqlens,
        max_seqlen_in_batch,
    )


# 从 transformers.models.llama.modeling_llama.LlamaFlashAttention2 复制而来
class FlashAttention2(CoreAttention):
    def __init__(self, *args, **kwargs):
        # 初始化基类
        super().__init__(*args, **kwargs)
        # 检查 Flash Attention 的版本以决定是否使用左上角掩码
        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()

    def forward(self, query_states, key_states, value_states, attention_mask):
        # 转置查询状态以符合 Flash Attention 的要求
        query_states = query_states.transpose(1, 2)
        # 转置键状态以符合 Flash Attention 的要求
        key_states = key_states.transpose(1, 2)
        # 转置值状态以符合 Flash Attention 的要求
        value_states = value_states.transpose(1, 2)
        # 获取批次大小和查询长度
        batch_size, query_length = query_states.shape[:2]
        # 根据 Flash Attention 的配置决定 causal 标志
        if not self._flash_attn_uses_top_left_mask:
            causal = self.is_causal
        else:
            # TODO: 一旦 Flash Attention 对 RoCm 的版本提升到 2.1,移除 `query_length != 1` 的检查
            causal = self.is_causal and query_length != 1
        # 设置 dropout 概率,根据训练状态决定
        dropout = self.config.attention_dropout if self.training else 0.0
        # 如果存在注意力掩码,则进行处理
        if attention_mask is not None:
            # 调用输入处理函数以获取未填充的输入
            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
                query_states, key_states, value_states, attention_mask, query_length
            )

            # 解包累计序列长度
            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
            # 解包最大序列长度
            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens

            # 调用 Flash Attention 函数进行计算,使用未填充的状态
            attn_output_unpad = flash_attn_varlen_func(
                query_states,
                key_states,
                value_states,
                cu_seqlens_q=cu_seqlens_q,
                cu_seqlens_k=cu_seqlens_k,
                max_seqlen_q=max_seqlen_in_batch_q,
                max_seqlen_k=max_seqlen_in_batch_k,
                dropout_p=dropout,
                softmax_scale=None,
                causal=causal,
            )

            # 将未填充的注意力输出填充为最终输出
            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
        else:
            # 如果没有注意力掩码,直接计算注意力输出
            attn_output = flash_attn_func(
                query_states, key_states, value_states, dropout, softmax_scale=None, causal=causal
            )
        # 重塑输出的形状以符合批次大小和查询长度
        attn_output = attn_output.reshape(batch_size, query_length, self.hidden_size_per_partition).contiguous()
        # 返回最终的注意力输出
        return attn_output
    # 更新输入的查询层、键层和值层,并处理注意力掩码和查询长度
    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
        # 获取未填充数据的索引、当前序列长度和批次中最大序列长度
        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
        # 获取键层的批次大小、键值序列长度、键值头数和头维度
        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
    
        # 根据索引调整键层的形状
        key_layer = index_first_axis(
            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
        )
        # 根据索引调整值层的形状
        value_layer = index_first_axis(
            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
        )
        # 如果查询长度等于键值序列长度
        if query_length == kv_seq_len:
            # 根据索引调整查询层的形状
            query_layer = index_first_axis(
                query_layer.reshape(batch_size * kv_seq_len, self.num_attention_heads_per_partition, head_dim),
                indices_k
            )
            # 设置当前序列长度和最大序列长度为键的值
            cu_seqlens_q = cu_seqlens_k
            max_seqlen_in_batch_q = max_seqlen_in_batch_k
            indices_q = indices_k
        # 如果查询长度为1
        elif query_length == 1:
            # 最大序列长度设为1
            max_seqlen_in_batch_q = 1
            # 生成当前序列长度的范围
            cu_seqlens_q = torch.arange(
                batch_size + 1, dtype=torch.int32, device=query_layer.device
            )  # 这里有一个内存拷贝,性能较差。
            # 获取索引
            indices_q = cu_seqlens_q[:-1]
            # 压缩查询层的维度
            query_layer = query_layer.squeeze(1)
        else:
            # 根据查询长度调整注意力掩码(假设是左填充)
            attention_mask = attention_mask[:, -query_length:]
            # 去填充输入并获取相应的查询层和索引等
            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
    
        # 返回更新后的查询层、键层、值层及其相关信息
        return (
            query_layer,
            key_layer,
            value_layer,
            indices_q,
            (cu_seqlens_q, cu_seqlens_k),
            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
        )
# 定义核心注意力类的字典映射
CORE_ATTENTION_CLASSES = {
    "eager": CoreAttention,  # 将 "eager" 映射到 CoreAttention 类
    "sdpa": SdpaAttention,   # 将 "sdpa" 映射到 SdpaAttention 类
    "flash_attention_2": FlashAttention2  # 将 "flash_attention_2" 映射到 FlashAttention2 类
}

# 定义自注意力类,继承自 PyTorch 的模块
class SelfAttention(torch.nn.Module):
    """并行自注意力层抽象类。

    自注意力层接受大小为 [s, b, h] 的输入并返回相同大小的输出。
    """

    # 初始化方法
    def __init__(self, config: ChatGLMConfig, layer_number, device=None):
        super(SelfAttention, self).__init__()  # 调用父类初始化方法
        self.layer_number = max(1, layer_number)  # 确保层编号至少为1

        # 计算投影大小
        self.projection_size = config.kv_channels * config.num_attention_heads

        # 每个注意力头和每个分区的值
        self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads
        self.num_attention_heads_per_partition = config.num_attention_heads  # 每个分区的注意力头数量

        self.multi_query_attention = config.multi_query_attention  # 是否使用多查询注意力
        self.qkv_hidden_size = 3 * self.projection_size  # QKV的隐藏大小
        self.original_rope = config.original_rope  # 原始旋转位置编码配置
        if self.multi_query_attention:  # 如果使用多查询注意力
            self.num_multi_query_groups_per_partition = config.multi_query_group_num  # 每个分区的多查询组数量
            self.qkv_hidden_size = (  # 更新QKV的隐藏大小
                    self.projection_size + 2 * self.hidden_size_per_attention_head * config.multi_query_group_num
            )
        # 定义线性层以计算QKV
        self.query_key_value = nn.Linear(config.hidden_size, self.qkv_hidden_size,
                                         bias=config.add_bias_linear or config.add_qkv_bias,
                                         device=device, **_config_to_kwargs(config)
                                         )

        # 实例化核心注意力
        self.core_attention = CoreAttention(config, self.layer_number)

        # 定义输出层
        self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
                               device=device, **_config_to_kwargs(config)
                               )

    # 分配内存的方法
    def _allocate_memory(self, inference_max_sequence_len, batch_size, device=None, dtype=None):
        if self.multi_query_attention:  # 根据是否使用多查询注意力设置头数
            num_attention_heads = self.num_multi_query_groups_per_partition
        else:
            num_attention_heads = self.num_attention_heads_per_partition  # 设置为每个分区的注意力头数量
        return torch.empty(  # 返回空的张量用于存储注意力值
            inference_max_sequence_len,
            batch_size,
            num_attention_heads,
            self.hidden_size_per_attention_head,
            dtype=dtype,
            device=device,
        )

    # 前向传播方法
    def forward(
            self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True
# 定义配置转化为关键字参数的方法
def _config_to_kwargs(args):
    common_kwargs = {
        "dtype": args.torch_dtype,  # 将 PyTorch 数据类型作为关键字参数
    }
    return common_kwargs

# 定义多层感知器类,继承自 PyTorch 的模块
class MLP(torch.nn.Module):
    """多层感知器。

    MLP 将输入的隐藏状态 h 投影到 4*h 的隐藏维度,进行非线性变换,然后将状态投影回 h 的隐藏维度。
    """
    # 初始化 MLP 类,接受配置和可选设备参数
    def __init__(self, config: ChatGLMConfig, device=None):
        # 调用父类的初始化方法
        super(MLP, self).__init__()
    
        # 设置是否添加线性层的偏置
        self.add_bias = config.add_bias_linear
    
        # 创建一个线性层,将输入从隐层大小映射到 4h
        # 使用 SWIGLU 时,输出宽度加倍,详见相关文献
        self.dense_h_to_4h = nn.Linear(
            config.hidden_size,  # 输入特征数
            config.ffn_hidden_size * 2,  # 输出特征数
            bias=self.add_bias,  # 是否使用偏置
            device=device,  # 指定设备
            **_config_to_kwargs(config)  # 其他配置参数
        )
    
        # 定义 SWIGLU 激活函数
        def swiglu(x):
            # 将输入张量分成两部分
            x = torch.chunk(x, 2, dim=-1)
            # 返回激活函数的输出
            return F.silu(x[0]) * x[1]
    
        # 设置激活函数为 SWIGLU
        self.activation_func = swiglu
    
        # 创建一个线性层,将 4h 的输出映射回隐层大小
        self.dense_4h_to_h = nn.Linear(
            config.ffn_hidden_size,  # 输入特征数
            config.hidden_size,  # 输出特征数
            bias=self.add_bias,  # 是否使用偏置
            device=device,  # 指定设备
            **_config_to_kwargs(config)  # 其他配置参数
        )
    
    # 前向传播方法,处理隐藏状态
    def forward(self, hidden_states):
        # 将隐藏状态通过第一层线性变换
        # 输出形状为 [s, b, 4hp]
        intermediate_parallel = self.dense_h_to_4h(hidden_states)
        # 应用 SWIGLU 激活函数
        intermediate_parallel = self.activation_func(intermediate_parallel)
        # 将激活后的结果通过第二层线性变换
        # 输出形状为 [s, b, h]
        output = self.dense_4h_to_h(intermediate_parallel)
        # 返回最终输出
        return output
# 定义一个单一的变换器层,继承自 PyTorch 的 Module 类
class GLMBlock(torch.nn.Module):
    """A single transformer layer.

    Transformer layer takes input with size [s, b, h] and returns an
    output of the same size.
    """

    # 初始化方法,设置层配置、层号和设备
    def __init__(self, config: ChatGLMConfig, layer_number, device=None):
        # 调用父类构造函数
        super(GLMBlock, self).__init__()
        # 保存层号
        self.layer_number = layer_number

        # 从配置中获取是否在层归一化后应用残差连接
        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm

        # 从配置中获取是否使用32位浮点的残差连接
        self.fp32_residual_connection = config.fp32_residual_connection

        # 根据配置选择层归一化函数
        LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
        # 对输入数据应用层归一化
        self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
                                             dtype=config.torch_dtype)

        # 自注意力层
        self.self_attention = SelfAttention(config, layer_number, device=device)
        # 隐藏层的丢弃率
        self.hidden_dropout = config.hidden_dropout

        # 在注意力输出后应用层归一化
        self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
                                                      dtype=config.torch_dtype)

        # 多层感知机
        self.mlp = MLP(config, device=device)

    # 前向传播方法
    def forward(
            self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True,
    ):
        # hidden_states: [s, b, h]

        # 在变换器层开始应用层归一化
        layernorm_output = self.input_layernorm(hidden_states)
        # 进行自注意力计算
        attention_output, kv_cache = self.self_attention(
            layernorm_output,
            attention_mask,
            rotary_pos_emb,
            kv_cache=kv_cache,
            use_cache=use_cache
        )

        # 残差连接
        if self.apply_residual_connection_post_layernorm:
            residual = layernorm_output
        else:
            residual = hidden_states

        # 应用丢弃,准备进行层归一化的输入
        layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
        layernorm_input = residual + layernorm_input

        # 自注意力后的层归一化
        layernorm_output = self.post_attention_layernorm(layernorm_input)

        # 通过多层感知机计算输出
        mlp_output = self.mlp(layernorm_output)

        # 第二次残差连接
        if self.apply_residual_connection_post_layernorm:
            residual = layernorm_output
        else:
            residual = layernorm_input

        # 应用丢弃并计算最终输出
        output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
        output = residual + output

        # 返回输出和键值缓存
        return output, kv_cache


# 定义变换器类,继承自 PyTorch 的 Module 类
class GLMTransformer(torch.nn.Module):
    """Transformer class."""
    # 初始化方法,接收配置和设备参数
        def __init__(self, config: ChatGLMConfig, device=None):
            # 调用父类的初始化方法
            super(GLMTransformer, self).__init__()
    
            # 设置 FP32 残差连接的配置
            self.fp32_residual_connection = config.fp32_residual_connection
            # 设置后层归一化的配置
            self.post_layer_norm = config.post_layer_norm
    
            # 获取层数
            # Number of layers.
            self.num_layers = config.num_layers
    
            # 定义构建层的方法
            # Transformer layers.
            def build_layer(layer_number):
                # 创建 GLMBlock 层实例
                return GLMBlock(config, layer_number, device=device)
    
            # 构建多个层并放入 ModuleList
            self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
    
            # 如果需要后层归一化
            if self.post_layer_norm:
                # 根据配置选择层归一化类型
                LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
                # 创建最终的层归一化实例,作为输出前的归一化
                # Final layer norm before output.
                self.final_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
                                                     dtype=config.torch_dtype)
    
            # 初始化梯度检查点标志为 False
            self.gradient_checkpointing = False
    
        # 获取指定层的方法
        def _get_layer(self, layer_number):
            # 返回对应编号的层
            return self.layers[layer_number]
    
        # 前向传播方法
        def forward(
                # 输入的隐藏状态
                self, hidden_states, 
                # 注意力掩码
                attention_mask, 
                # 旋转位置嵌入
                rotary_pos_emb, 
                # 可选的键值缓存
                kv_caches=None,
                # 是否使用缓存的标志,默认 True
                use_cache: Optional[bool] = True,
                # 是否输出隐藏状态的标志,默认 False
                output_hidden_states: Optional[bool] = False,
    ):
        # 如果 kv_caches 为空,则为每层初始化为 None
        if not kv_caches:
            kv_caches = [None for _ in range(self.num_layers)]
        # 如果使用缓存,则初始化 presents 为空元组,否则为 None
        presents = () if use_cache else None
        # 如果开启梯度检查点并处于训练模式
        if self.gradient_checkpointing and self.training:
            # 如果使用缓存,则发出警告并禁用缓存
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        # 初始化所有自注意力的集合为 None
        all_self_attentions = None
        # 如果需要输出隐藏状态,则初始化为一个空元组,否则为 None
        all_hidden_states = () if output_hidden_states else None
        # 遍历每一层
        for index in range(self.num_layers):
            # 如果需要输出隐藏状态,则将当前隐藏状态添加到所有隐藏状态中
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            # 获取当前层
            layer = self._get_layer(index)
            # 如果开启梯度检查点并处于训练模式
            if self.gradient_checkpointing and self.training:
                # 使用检查点函数来计算当前层的输出
                layer_ret = torch.utils.checkpoint.checkpoint(
                    layer,
                    hidden_states,
                    attention_mask,
                    rotary_pos_emb,
                    kv_caches[index],
                    use_cache,
                    use_reentrant=False
                )
            else:
                # 直接调用当前层计算输出
                layer_ret = layer(
                    hidden_states,
                    attention_mask,
                    rotary_pos_emb,
                    kv_cache=kv_caches[index],
                    use_cache=use_cache
                )
            # 解包当前层的输出和缓存
            hidden_states, kv_cache = layer_ret
            # 如果使用缓存,则将当前缓存添加到 presents 中
            if use_cache:
                presents = presents + (kv_cache,)

        # 如果需要输出隐藏状态,则将最后的隐藏状态添加到所有隐藏状态中
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        # 最终的层归一化
        if self.post_layer_norm:
            hidden_states = self.final_layernorm(hidden_states)

        # 返回隐藏状态、缓存、所有隐藏状态和所有自注意力
        return hidden_states, presents, all_hidden_states, all_self_attentions
# 定义一个抽象类,处理权重初始化和预训练模型的下载与加载接口
class ChatGLMPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and
    a simple interface for downloading and loading pretrained models.
    """

    # 指示模型是否可以并行化
    is_parallelizable = False
    # 指示模型是否支持梯度检查点
    supports_gradient_checkpointing = True
    # 配置类为 ChatGLMConfig
    config_class = ChatGLMConfig
    # 基础模型前缀为 "transformer"
    base_model_prefix = "transformer"
    # 不可分割的模块列表
    _no_split_modules = ["GLMBlock"]
    # 支持 flash attention 2
    _supports_flash_attn_2 = True
    # 支持 SDPA
    _supports_sdpa = True

    # 初始化权重的方法
    def _init_weights(self, module: nn.Module):
        """Initialize the weights."""
        return

    # 获取输入的掩码
    def get_masks(self, input_embeds, past_key_values, padding_mask=None):
        # 获取批大小、序列长度和嵌入维度
        batch_size, seq_length, embed_size = input_embeds.shape
        # 创建全1的注意力掩码
        full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_embeds.device)
        # 变为下三角矩阵
        full_attention_mask.tril_()
        # 初始化过去的长度
        past_length = 0
        # 如果有过去的键值对,获取过去的长度
        if past_key_values:
            past_length = past_key_values[0][0].shape[2]
        # 如果过去的长度存在,拼接注意力掩码
        if past_length:
            full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
                                                        device=input_embeds.device), full_attention_mask), dim=-1)
        # 如果有填充掩码,进行相应的操作
        if padding_mask is not None:
            full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
        # 如果没有过去的长度且有填充掩码,调整全注意力掩码
        if not past_length and padding_mask is not None:
            full_attention_mask -= padding_mask.unsqueeze(-1) - 1
        # 将注意力掩码转为布尔类型
        full_attention_mask = (full_attention_mask < 0.5).bool()
        # 增加维度以适应后续操作
        full_attention_mask.unsqueeze_(1)
        # 返回最终的注意力掩码
        return full_attention_mask

    # 获取位置 ID
    def get_position_ids(self, input_ids, device):
        # 获取批大小和序列长度
        batch_size, seq_length = input_ids.shape
        # 创建位置 ID 并扩展到批大小
        position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
        return position_ids

    # 获取多模态位置 ID
    def get_multimodal_position_ids(self, input_ids, device):
        # 获取批大小和序列长度
        batch_size, seq_length = input_ids.shape
        # 创建位置 ID 并扩展到批大小
        position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)

# 定义嵌入类,继承自 torch.nn.Module
class Embedding(torch.nn.Module):
    """Language model embeddings."""

    # 初始化方法
    def __init__(self, config: ChatGLMConfig, device=None):
        super(Embedding, self).__init__()

        # 获取隐藏层大小
        self.hidden_size = config.hidden_size
        # 创建单词嵌入层(并行)
        self.word_embeddings = nn.Embedding(
            config.padded_vocab_size,
            self.hidden_size,
            dtype=config.torch_dtype,
            device=device
        )
        # 是否使用 fp32 残差连接
        self.fp32_residual_connection = config.fp32_residual_connection

    # 前向传播方法
    def forward(self, input_ids):
        # 获取单词嵌入
        words_embeddings = self.word_embeddings(input_ids)
        # 设置嵌入值
        embeddings = words_embeddings
        # 如果设置了 fp32 残差连接,将嵌入转换为浮点型
        if self.fp32_residual_connection:
            embeddings = embeddings.float()
        # 返回嵌入
        return embeddings


# 检查图像列表是否为空
def is_empty(images_list: Optional[List[List[torch.Tensor]]]):
    # 检查 images_list 是否为 None 或者为空列表
    if images_list is None or len(images_list) == 0:
        # 如果是,返回 True
        return True
    # 遍历 images_list 中的每个 image_list
    for image_list in images_list:
        # 如果 image_list 不是 None
        if image_list is not None:
            # 返回 False,表示存在有效的 image_list
            return False
    # 如果所有 image_list 都是 None,返回 True
    return True
# 定义 ChatGLMModel 类,继承自 ChatGLMPreTrainedModel
class ChatGLMModel(ChatGLMPreTrainedModel):
    # 初始化方法,接受配置、设备和空初始化标志
    def __init__(self, config: ChatGLMConfig, device=None, empty_init=True):
        # 调用父类的初始化方法,传入配置
        super().__init__(config)
        # 根据空初始化标志选择初始化方法
        if empty_init:
            init_method = skip_init
        else:
            init_method = default_init
        # 初始化关键字参数字典
        init_kwargs = {}
        # 如果设备不为 None,将其加入初始化参数
        if device is not None:
            init_kwargs["device"] = device
        # 使用初始化方法创建嵌入层
        self.embedding = init_method(Embedding, config, **init_kwargs)
        # 获取层数配置
        self.num_layers = config.num_layers
        # 获取多查询组数配置
        self.multi_query_group_num = config.multi_query_group_num
        # 获取 KV 通道数配置
        self.kv_channels = config.kv_channels

        # 旋转位置嵌入
        self.seq_length = config.seq_length
        # 根据注意力头数或 KV 通道数计算旋转维度
        rotary_dim = (
            config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
        )

        # 创建旋转位置嵌入对象
        self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio,
                                              original_impl=config.original_rope,
                                              device=device, dtype=config.torch_dtype)
        # 使用初始化方法创建 GLMTransformer 编码器
        self.encoder = init_method(GLMTransformer, config, **init_kwargs)
        # 使用初始化方法创建输出层
        self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
                                        dtype=config.torch_dtype, **init_kwargs)
        # 获取预序列长度配置
        self.pre_seq_len = config.pre_seq_len
        # 获取前缀投影配置
        self.prefix_projection = config.prefix_projection
        # 如果预序列长度不为 None
        if self.pre_seq_len is not None:
            # 将所有参数的 requires_grad 设置为 False
            for param in self.parameters():
                param.requires_grad = False
            # 创建前缀令牌的张量
            self.prefix_tokens = torch.arange(self.pre_seq_len).long()
            # 创建前缀编码器
            self.prefix_encoder = PrefixEncoder(config)
            # 初始化 Dropout 层
            self.dropout = torch.nn.Dropout(0.1)

        # 创建视觉模型
        self.vision = EVA2CLIPModel(config)

    # 获取输入嵌入的方法
    def get_input_embeddings(self):
        return self.embedding.word_embeddings

    # 设置输入嵌入的方法
    def set_input_embeddings(self, value):
        self.embedding.word_embeddings = value

    # 获取提示的方法,接受批大小、设备和数据类型
    def get_prompt(self, batch_size, device, dtype=torch.half):
        # 扩展前缀令牌的维度以匹配批大小
        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
        # 通过前缀编码器处理前缀令牌并转换数据类型
        past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)
        # 重新排列过去的关键值的维度
        past_key_values = past_key_values.view(
            batch_size,
            self.pre_seq_len,
            self.pre_seq_len,
            self.num_layers * 2,
            self.multi_query_group_num,
            self.kv_channels
        )
        # 应用 Dropout 层
        past_key_values = self.dropout(past_key_values)
        # 调整维度顺序并分割
        past_key_values = past_key_values.permute([2, 1, 0, 3, 4]).split(2)
        # 返回处理后的过去关键值
        return past_key_values
    # 定义一个前向传播函数,接受多个输入参数
        def forward(
                # 输入 ID,类型为长整型张量,默认为 None
                self,
                input_ids: torch.LongTensor = None,
                # 输入图像,类型为张量,默认为 None
                images: torch.Tensor = None,
                # 位置 ID,类型为可选张量,默认为 None
                position_ids: Optional[torch.Tensor] = None,
                # 注意力掩码,类型为可选布尔张量,默认为 None
                attention_mask: Optional[torch.BoolTensor] = None,
                # 完整注意力掩码,类型为可选布尔张量,默认为 None
                full_attention_mask: Optional[torch.BoolTensor] = None,
                # 过去的键值对,类型为可选元组,默认为 None
                past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
                # 输入嵌入,类型为可选张量,默认为 None
                inputs_embeds: Optional[torch.Tensor] = None,
                # 是否使用缓存,类型为可选布尔值,默认为 None
                use_cache: Optional[bool] = None,
                # 是否输出隐藏状态,类型为可选布尔值,默认为 None
                output_hidden_states: Optional[bool] = None,
                # 是否以字典格式返回结果,类型为可选布尔值,默认为 None
                return_dict: Optional[bool] = None,
# 将历史对话转换为提示字符串,包含用户和助手的对话内容
def _history_to_prompt(history, query):
    # 初始化提示字符串为空
    prompt = ''
    # 标记是否已有历史查询
    flag = False
    # 遍历历史对话,索引和内容
    for i, (old_query, response) in enumerate(history):
        # 添加用户查询和助手响应到提示中,依据标记决定用户标签的添加
        prompt += ('<|user|>' if flag else '') + old_query + "<|assistant|>" + response + "<|endoftext|>"
        # 更新标记为 True,表示已有查询
        flag = True
    # 添加最新查询到提示中,依据标记决定用户标签的添加
    prompt += '{}{}<|assistant|>'.format('<|user|>' if flag else '', query)
    # 返回最终的提示字符串
    return prompt


# 自定义的条件生成模型类,继承自预训练模型
class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
    # 初始化模型,设置配置和其他参数
    def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
        # 调用父类的初始化方法
        super().__init__(config)

        # 设置最大序列长度
        self.max_sequence_length = config.max_length
        # 初始化变换模型
        self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
        # 保存配置对象
        self.config = config

    # 更新生成过程中的模型参数
    def _update_model_kwargs_for_generation(
            self,
            outputs: ModelOutput,
            model_kwargs: Dict[str, Any],
            is_encoder_decoder: bool = False,
    ) -> Dict[str, Any]:
        # 从模型输出提取过去的键值对
        cache_name, cache = self._extract_past_from_model_output(outputs)
        # 将缓存添加到模型参数中
        model_kwargs[cache_name] = cache

        # 更新注意力掩码
        if "attention_mask" in model_kwargs:
            # 获取当前注意力掩码
            attention_mask = model_kwargs["attention_mask"]
            # 连接新创建的掩码,将其追加到当前掩码的末尾
            model_kwargs["attention_mask"] = torch.cat(
                [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
            )

        # 更新位置 ID
        if "position_ids" in model_kwargs:
            # 获取当前的位置 ID
            position_ids = model_kwargs["position_ids"]
            # 复制最后一个位置 ID,并加 1
            new_position_id = position_ids[..., -1:].clone()
            new_position_id += 1
            # 将新的位置 ID 追加到当前的位置 ID
            model_kwargs["position_ids"] = torch.cat(
                [position_ids, new_position_id], dim=-1
            )

        # 标记为非首次前向传递
        model_kwargs["is_first_forward"] = False
        # 返回更新后的模型参数
        return model_kwargs

    # 准备生成所需的输入
    def prepare_inputs_for_generation(
            self,
            input_ids: torch.LongTensor,
            images: Optional[torch.Tensor] = None,
            past_key_values: Optional[torch.Tensor] = None,
            attention_mask: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.Tensor] = None,
            use_cache: Optional[bool] = None,
            is_first_forward: bool = True,
            **kwargs
    ) -> dict:  # 定义函数返回类型为字典
        # 如果 past 不为 None,只处理输入 ID 的最后一个 token
        if position_ids is None:  # 如果没有提供位置 ID
            # 获取输入 ID 的位置 ID,设备与输入 ID 相同
            position_ids = self.get_position_ids(input_ids, device=input_ids.device)  
        if attention_mask is not None:  # 如果提供了注意力掩码
            # 从配置中获取图像的大小
            image_size: int = self.config.vision_config['image_size']  
            # 从配置中获取补丁的大小
            patch_size: int = self.config.vision_config['patch_size']  
            # 计算图像中补丁的数量
            num_patches = (image_size // patch_size // 2) ** 2  
            new_attention_masks = []  # 初始化新的注意力掩码列表

            # 如果没有图像,使用默认的 ID
            eoi_token_pos = 6  # 结束 token 的位置
            boi_token_pos = 4  # 开始 token 的位置

            # 遍历输入 ID 的每个 token
            for i in range(len(input_ids)):  
                # 将当前输入 ID 转换为列表
                input_id = input_ids[i].tolist()  
                # 如果图像不为空,获取 BOI 和 EOI token 的位置
                if not is_empty(images):  
                    boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
                        self.config.eoi_token_id)  
                # 确保 EOI 和 BOI token 之间的距离为 2
                assert eoi_token_pos - boi_token_pos == 2  
                # 生成新的注意力掩码并添加到列表
                new_attention_masks.append(torch.cat(
                    (attention_mask[i, :boi_token_pos + 1], attention_mask.new_ones(num_patches),  # 添加 BOI 前的掩码和新的补丁掩码
                     attention_mask[i, eoi_token_pos:])  # 添加 EOI 之后的掩码
                ))  
            # 将新的注意力掩码列表堆叠为张量
            attention_mask = torch.stack(new_attention_masks, dim=0)  
        if not is_first_forward:  # 如果不是第一次前向传播
            if past_key_values is not None:  # 如果过去的键值对不为 None
                # 只保留 position_ids 的最后一个元素
                position_ids = position_ids[..., -1:]  
                # 只保留 input_ids 的最后一个元素
                input_ids = input_ids[:, -1:]  
        # 返回一个字典,包含输入 ID、图像、过去的键值、位置 ID、注意力掩码及其他参数
        return {  
            "input_ids": input_ids,  # 输入 ID
            "images": images,  # 图像
            "past_key_values": past_key_values,  # 过去的键值
            "position_ids": position_ids,  # 位置 ID
            "attention_mask": attention_mask,  # 注意力掩码
            "return_last_logit": True,  # 是否返回最后的 logit
            "use_cache": use_cache  # 是否使用缓存
        }  

    def forward(  # 定义前向传播函数
            self,
            input_ids: Optional[torch.Tensor] = None,  # 可选的输入 ID 张量
            images: List[List[torch.Tensor]] = None,  # 可选的图像列表
            position_ids: Optional[torch.Tensor] = None,  # 可选的位置 ID 张量
            attention_mask: Optional[torch.Tensor] = None,  # 可选的注意力掩码
            past_key_values: Optional[Tuple[torch.FloatTensor]] = None,  # 可选的过去键值对
            inputs_embeds: Optional[torch.Tensor] = None,  # 可选的输入嵌入
            labels: Optional[torch.Tensor] = None,  # 可选的标签
            use_cache: Optional[bool] = None,  # 是否使用缓存
            output_attentions: Optional[bool] = None,  # 是否输出注意力
            output_hidden_states: Optional[bool] = None,  # 是否输出隐藏状态
            return_dict: Optional[bool] = None,  # 是否以字典形式返回
            return_last_logit: Optional[bool] = False,  # 是否返回最后的 logit
    # 在方法的括号结束位置,可能是某个函数定义的一部分
        ):
            # 根据配置选择是否使用缓存,默认使用类配置中的值
            use_cache = use_cache if use_cache is not None else self.config.use_cache
            # 根据配置选择是否返回字典格式,默认使用类配置中的值
            return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
            # 调用 transformer 进行前向传播,传入多个输入参数
            transformer_outputs = self.transformer(
                # 输入的 token ID
                input_ids=input_ids,
                # 输入的图像数据
                images=images,
                # 位置编码的 ID
                position_ids=position_ids,
                # 注意力掩码
                attention_mask=attention_mask,
                # 过去的键值对
                past_key_values=past_key_values,
                # 输入的嵌入表示
                inputs_embeds=inputs_embeds,
                # 是否使用缓存
                use_cache=use_cache,
                # 是否输出隐藏状态
                output_hidden_states=output_hidden_states,
                # 是否返回字典格式
                return_dict=return_dict,
            )
    
            # 从 transformer 输出中获取隐藏状态
            hidden_states = transformer_outputs[0]
            # 如果需要返回最后的 logit,则只保留最后一个时间步的隐藏状态
            if return_last_logit:
                hidden_states = hidden_states[:, -1:]
            # 通过输出层获取语言模型的 logits
            lm_logits = self.transformer.output_layer(hidden_states)
    
            # 初始化损失为 None
            loss = None
            # 如果标签存在,则计算损失
            if labels is not None:
                # 创建新的标签列表
                new_labels = []
                # 遍历每个输入 ID
                for i in range(len(input_ids)):
                    # 将当前输入 ID 转换为列表
                    input_id = input_ids[i].tolist()
                    # 获取 BOI 和 EOI 令牌的位置
                    boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
                        self.config.eoi_token_id)
                    # 确保 EOI 和 BOI 之间的间隔为 2
                    assert eoi_token_pos - boi_token_pos == 2
    
                    # 构建新的标签,包含 BOI 之前的标签和 EOI 之后的标签
                    new_labels.append(torch.cat(
                        (
                            labels[i, :boi_token_pos + 1],
                            torch.tensor([-100]).to(labels.device).to(labels.dtype).repeat(1600),
                            labels[i, eoi_token_pos:])))
    
                # 将新的标签列表转换为张量
                labels = torch.stack(new_labels, dim=0)
                # 将 logits 转换为 float32 类型
                lm_logits = lm_logits.to(torch.float32)
                # 将 logits 和标签分别向左移动一个时间步
                shift_logits = lm_logits[..., :-1, :].contiguous()
                shift_labels = labels[..., 1:].contiguous()
                # 定义交叉熵损失函数,忽略 -100 的标签
                loss_fct = CrossEntropyLoss(ignore_index=-100)
                # 计算损失
                loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    
                # 将 logits 和损失转换为与隐藏状态相同的类型
                lm_logits = lm_logits.to(hidden_states.dtype)
                loss = loss.to(hidden_states.dtype)
    
            # 如果不返回字典格式,组合输出结果
            if not return_dict:
                output = (lm_logits,) + transformer_outputs[1:]
                # 如果存在损失,返回损失和输出结果
                return ((loss,) + output) if loss is not None else output
    
            # 返回包含损失、logits 和其他 transformer 输出的 CausalLMOutputWithPast 对象
            return CausalLMOutputWithPast(
                loss=loss,
                logits=lm_logits,
                past_key_values=transformer_outputs.past_key_values,
                hidden_states=transformer_outputs.hidden_states,
                attentions=transformer_outputs.attentions,
            )
    
        # 定义静态方法,用于重新排序缓存
        @staticmethod
        def _reorder_cache(
                # past 包含过去的键值对
                past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], 
                # beam 的索引
                beam_idx: torch.LongTensor
    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:  # 指定函数返回类型为元组,其中包含元组,元组内包含两个 torch.Tensor
        """  # 函数文档字符串,描述该函数的功能
        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or  # 该函数用于重新排序 `past_key_values` 缓存,适用于 beam_search 或 beam_sample 调用
        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct  # 这个操作是为了在每次生成步骤中,将 `past_key_values` 与正确的 beam_idx 匹配
        beam_idx at every generation step.  # 在每次生成步骤中匹配 beam_idx
        
        Output shares the same memory storage as `past`.  # 输出与 `past` 共享相同的内存存储
        """
        return tuple(  # 返回一个元组,包含处理后的每层过去的键值
            (  # 开始一个元组,包含两个处理后的 torch.Tensor
                layer_past[0].index_select(0, beam_idx.to(layer_past[0].device)),  # 从 layer_past 的第一个 tensor 中选择对应 beam_idx 的元素,并转移到相应设备
                layer_past[1].index_select(0, beam_idx.to(layer_past[1].device)),  # 从 layer_past 的第二个 tensor 中选择对应 beam_idx 的元素,并转移到相应设备
            )  # 结束当前层的元组
            for layer_past in past  # 遍历 past 中的每个 layer_past
        )  # 结束 tuple 的构造
# 定义一个用于序列分类的模型类,继承自预训练模型类
class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
    # 初始化方法,接受配置对象、空初始化标志和设备参数
    def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
        # 调用父类的初始化方法,传入配置
        super().__init__(config)

        # 获取配置中的标签数量
        self.num_labels = config.num_labels
        # 创建变换器模型,使用配置和传入的参数
        self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)

        # 创建一个线性分类头,输入维度为隐藏层大小,输出维度为标签数量
        self.classifier_head = nn.Linear(config.hidden_size, config.num_labels, bias=True, dtype=torch.half)
        # 如果配置中指定了分类头的 dropout 率,则创建 Dropout 层
        if config.classifier_dropout is not None:
            self.dropout = nn.Dropout(config.classifier_dropout)
        # 否则,将 dropout 设置为 None
        else:
            self.dropout = None
        # 将配置对象保存到实例变量中
        self.config = config

    # 定义前向传播方法,处理输入和可选参数
    def forward(
            self,
            input_ids: Optional[torch.LongTensor] = None,
            position_ids: Optional[torch.LongTensor] = None,
            attention_mask: Optional[torch.Tensor] = None,
            full_attention_mask: Optional[torch.Tensor] = None,
            past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
            inputs_embeds: Optional[torch.LongTensor] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
    # 返回类型可以是元组或带过去状态的序列分类输出
) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
    # 如果 return_dict 为 None,则使用配置中的 use_return_dict 值
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # 调用变换器模型,传入相关参数以获取输出
    transformer_outputs = self.transformer(
        input_ids=input_ids,  # 输入的 ID 列表
        position_ids=position_ids,  # 输入的位置信息
        attention_mask=attention_mask,  # 注意力掩码
        full_attention_mask=full_attention_mask,  # 完整的注意力掩码
        past_key_values=past_key_values,  # 过去的键值对
        inputs_embeds=inputs_embeds,  # 输入的嵌入表示
        use_cache=use_cache,  # 是否使用缓存
        output_hidden_states=output_hidden_states,  # 是否输出隐藏状态
        return_dict=return_dict,  # 是否返回字典格式的输出
    )

    # 获取变换器输出的隐藏状态
    hidden_states = transformer_outputs[0]
    # 取最后一个隐藏状态作为池化的隐藏状态
    pooled_hidden_states = hidden_states[-1]
    # 如果设置了 dropout,则对池化的隐藏状态应用 dropout
    if self.dropout is not None:
        pooled_hidden_states = self.dropout(pooled_hidden_states)
    # 通过分类头计算 logits
    logits = self.classifier_head(pooled_hidden_states)

    # 初始化损失为 None
    loss = None
    # 如果标签不为空,则计算损失
    if labels is not None:
        # 如果问题类型尚未确定,则根据标签数量和类型确定
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"  # 回归问题
            elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                self.config.problem_type = "single_label_classification"  # 单标签分类
            else:
                self.config.problem_type = "multi_label_classification"  # 多标签分类

        # 根据问题类型计算损失
        if self.config.problem_type == "regression":
            loss_fct = MSELoss()  # 均方误差损失
            if self.num_labels == 1:
                # 对于单标签回归,计算损失
                loss = loss_fct(logits.squeeze().float(), labels.squeeze())
            else:
                # 对于多标签回归,计算损失
                loss = loss_fct(logits.float(), labels)
        elif self.config.problem_type == "single_label_classification":
            loss_fct = CrossEntropyLoss()  # 交叉熵损失
            # 计算损失
            loss = loss_fct(logits.view(-1, self.num_labels).float(), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss_fct = BCEWithLogitsLoss()  # 二进制交叉熵损失
            # 计算损失
            loss = loss_fct(logits.float(), labels.view(-1, self.num_labels))

    # 如果不返回字典格式,则返回 logits 和其他输出
    if not return_dict:
        output = (logits,) + transformer_outputs[1:]  # 包含 logits 的输出
        # 如果有损失,则返回损失和其他输出;否则只返回输出
        return ((loss,) + output) if loss is not None else output

    # 返回带过去状态的序列分类输出,包含损失、logits 和其他信息
    return SequenceClassifierOutputWithPast(
        loss=loss,  # 损失
        logits=logits,  # 预测的 logits
        past_key_values=transformer_outputs.past_key_values,  # 过去的键值对
        hidden_states=transformer_outputs.hidden_states,  # 隐藏状态
        attentions=transformer_outputs.attentions,  # 注意力信息
    )
posted @ 2024-10-22 10:21  绝不原创的飞龙  阅读(11)  评论(0编辑  收藏  举报