text-generation-webui 推理模型Qwen1.5-7B-Chat相关报错问题解决
推理代码 text-generation-webui
推理模型 Qwen1.5-7B-Chat
sys info
gpu: Tesla V100-PCIE-32GB
python: 3.10
model:Qwen1.5-7B-Chat
docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash
app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
nvcc --version
cuda 11.8
python
import torch
print(torch.version)
13.1
1 路径错误
root@92c536e270d3:/app# python server.py y _-auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/modelDir/qwen --trust-remote-code odel Owen1.5-72B-Chat 这里是末级模型目录名称 写倒数第二级目录即可 10:17:25-241308 INFO Starting Text generation web UI 10:17:25-244433 WARNING trust remote code is enabled. This is dangerous. 10:17:25-245304 WARNING You are potentially exposing the web UI to the entire internet without any access password. You can create one with the "--gradio auth" flag like this: puoмssed : aweuuasn qqne-oтpeu6-- Make sure to replace username:password with your own. 10:17:25-248394 INFO Loading "Qwen1.5-72B-Chat" 10:17:25-250000 ERRoR The path to the model does not exist. Exiting. Traceback (most recent call last) /app/server.py:241 in <module> 240 # Load the model )241 shared.model, shared.tokenizer = load model(model name) 242 if shared.args.lora: /app/modules/models.py:84 in load model 83 logger.error('The path to the model does not exist. Exiting.') 84 raise ValueError 85的
2 依赖没安装
ImportError: This modeling file requires the following packages that were not found in your environment: transformers_stream_generator. Run `pip install
transformers_stream_generator`
)241 shared.model, shared.tokenizer = load model(model name) 242 if shared.args.lora: /app/modules/models.py:87 in load modela 86 shared.args.loader = loader 87 output = load func map[loader](model name) 88 if type(output) is tuple: /app/modules/models.py:235 in huggingface_loader 234 235 model = LoaderClass.from pretrained(path to model, **params) 236 /venv/Lib/python3.10/site-packages/transformers/models/auto/auto factory.py:548 in from pretrained 547 class ref = config.auto map[cls. name ] 548 model class = get class from dynamic module( 549 class ref, pretrained model name or path, code revision=code revision, * /venv/Lib/python3.10/site-packages/transformers/dynamic module_utils.py:488 in get_class from_dynamic_module 487 # And lastly we get the class inside our newly created module 488 final module = get cached module file(e 489的 repo _id, /venv/Lib/python3.10/site-packages/transformers/dynamic_module _utils.py:314 in get_cached module file 313 # Check we have all the requirements in our environment )314 modules needed = check imports(resolved module file) 315 /venv/Lib/python3.10/site-packages/transformers/dynamic module utils.py:180 in check imports 179 if len(missing packages) > 0: )180 raise ImportError( 181 "This modeling file requires the following packages that were not found in y ImportError: This modeling file requires the following packages that were not found in your environment: transformers stream generator. Run `pip install transformers stream ceneratoree root@92c536e270d3: /app#
3 c编译环境报错
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
output = module. _old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_ modules/qwen/modeling_qwen.py", Line 432, in forward (qua sod b ‘/uanb)qua sod Auepou /ldde = Kuanb File "/root/.cache/huggingface/modules/transformers_modules/qwen/modeling_qwen.py", Line 1342, in apply rotary pos_emb return apply rotary_emb func(t_float, cos, sin).type as(t) File "/venv/Lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 122, in apply rotary_emb return ApplyRotaryEmb.apply( File "/venv/Lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/venv/Lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 48, in forward out = apply rotary( File "/venv/Lib/python3.10/site-packages/flash_attn/ops/triton/rotary.py", line 213, in apply rotary rotary kernel[grid]( File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", Line 532, in run self.cache[device][key] = compile( File "/venv/Lib/python3.10/site-packages/triton/compiler/compiler.py", line 614, in compile so path = make stub(name, signature, constants, ids, enable warp specialization=enable warp specialization) File "/venv/lib/python3.10/site-packages/triton/compiler/make_launcher.py", line 37, in make_stub so = build(name, src path, tmpdir) File "/venv/Lib/python3.10/site-packages/triton/common/build.py", Line 83, in _build raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.") RuntimeError: Failed to find C compiler. Please specify via CC environment variable. Output generated in 2.18 seconds (0.00 tokens/s, 0 tokens, context 60, seed 589520803)
4 RuntimeError: FlashAttention only supports Ampere GPUs or newer.
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
attn_output = self.core attention_flash(q, k, v, attention mask=attention mask) File "/venv/Lib/python3.10/site-packages/torch/nn/modules/module.py", Line 1511, in _ wrapped call impl return self, call impl(*args, **kwargs) File "/venv/Lib/python3.10/site-packages/torch/nn/modules/module.py", Line 1520, in _call_impl return forward call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers modules/qwen/modeling_qwen.py", line 191, in forward output = flash_attn_func(q, k, v, dropout_p, softmax scale=self.softmax scale, causal=self.causal) File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", Line 825, in flash_attn_func return FlashAttnFunc.apply( File "/venv/Lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/venv/Lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", Line 507, in forward out, q, k, v, out padded, softmax_lse, S_dmask, rng_state = flash attn forward( File "/venv/Lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward out, q, k, v, out padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer. Output qenerated in 1.67 seconds (0.00 tokens/s, 0 tokens, context 51, seed 872380422)
修改 模型的文件config.json 将 use_flash_attn 改为 false 暂不用加速推理
目前我所用显卡不支持 flash_attn,只能关闭了
https://github.com/oobabooga/text-generation-webui/issues/5985
5 不会stop
5.1 ui 端 配置 如下
Custom stopping strings : "<|im_start|>", "<|im_end|>", "<|endoftext|>"
Skip special tokens false
5.2 api接口
6 加载lora 推理报错
TypeError: LoraConfig.__init__() got an unexpected keyword argument 'layer_replication'
09:07:09-714849 INFO Applying the following LoRAs to modelDir: adapterDir Traceback (most recent call Last) /app/server.py:243 in <module> 242 if shared.args.lora: )243 add_lora_to model(shared.args.lora) 244 /app/modules/LoRA.py:18 in add_lora to model 17 else: 18的 add_lora_transformers(lora names)e 19 /app/modules/LoRA.py:125 in add lora transformers 124 logger.info("Applying the following LoRAs to {}: {}".format(shared.model name, )125的 shared.model = PeftModel.from pretrained(shared.model, get lora path(lora names[ 126 for lora in lora names[1:]: /venv/lib/python3.10/site-packages/peft/peft model.py:325 in from pretrained 324 if config is None: 325的 config = PEFT TYPE TO CONFIG MAPPING[ 326 PeftConfig, get peft type( /venv/lib/python3.10/site-packages/peft/config.py:152 in from_pretraineds 151 kwargs = {**class kwargs, **loaded attributes} )152的 return cls.from peft type(**kwargs)r 153 /venv/Lib/python3.10/site-packages/peft/config.py:119 in from_peft_type 118 )119的 return config_cls(**kwargs) 120 TypeError: LoraConfig. init () got an unexpected keyword argument laver replication'
更换 peft 版本
pip install peft==0.5.0
7 加载多个lora 报错
U9:35:35-007791 IN U Suaruung rext generauron web uT 09:33:33-611270 WARNING trust remote code is enabled. This is dangerous. drwxr-37 09:33:33-612231 WARNING -1x-1xмup drwxr-xr- You are potentially exposing the web UI to the entire internet without any access password. drwxr-xr- You can create one with the "--gradio auth" flag like this: drwxr-xr- drwxr-xr- --gradio-auth username:password drwxr-xr- drwxr-xr- Make sure to replace username:password with your own. -rW-- 09:33:33-615970 INFO Loading "Qwen-7B-Chat-sft' drwxr-xr- Loading checkpoint shards: 100% 8/8 [00:08<00:00, 1.01s/it] drwxr-xr- 09:33:42-689837 INFO LOADER: -aх-ахмир TRUNCATION LENGTH: 32768 drwxr-xr- 09:33:42-691038 INFO drwxr-xr- 09:33:42-691978 INFO INSTRUCTION TEMPLATE: "ChatML" drwxr-xr- 09:33:42-692777 INFO Loaded the model in 9.08 seconds -их-ахмир 09:33:42-693596 INFO Applying the following LoRAs to Qwen-7B-Chat-sft: lora adapter, lora adapter2 drwxr-xr- Traceback (most recent call lastl drwxr-xr- /app/server.py:243 in <module> -1--1-M_- drwxr-xr- 242 if shared.args.lora: drwxr-xr- )243 add lora to model(shared.args.lora) drwxr-xr- 244 drwxr-x-- -их-ахмир -1--d-MI- /app/modules/LoRA.py:18 in add lora to model drwxr-xr- -их-ахмир 17 else: (base) ro 18 add lora transformers(lora names) t/V1 19 root@7.19 config.js /app/modules/LoRA.py:130 in add lora transformers generatio LICENSE 129 if len(lora names) > 1: model-000 merges.tx )130 merge loras( ) model-000 131 model-000 model-000 /app/modules/LoRA.py:152 in merge loras model.saf README.nd 151 tokenizer )152 shared.model.add weighted adapter(shared.lora names, [1] * len(shared.lora_names), " tokenizer 153 shared.model.set adapter(" merged") vocab.jso (base) ro home/wang /venv/Lib/python3.10/site-packages/peft/tuners/lora.py:660 in add weighted adaptere (base) ro (base) ro 659 # new rank is the max of all ranks of the adapters if not provided (base) ro 099 new_rank = svd_rank or max(adapters_ranks) (base) ou 661 else: (base) ro (base) ro ValueError: max() arg is an empty sequence (base) roe r00t047780043e992:/app# (base) ou
https://github.com/oobabooga/text-generation-webui/issues/4371
没解决,手动合并多个adapter
合并 lora adapter 和加载lora推理不知道是不是一样的效果
在变压器适配器上加载多个 LORA 时出错 ·问题 #4371 ·oobabooga/文本生成-webui (github.com)
引用github站友一句话
我在 (#3120) 中写道,PR 使用过时的 PEFT 代码,但无论如何它已被合并。所以¯\_(ツ)_/¯
这种方法存在的问题远不止使合并工作有效。例如,下次您尝试将 Loras 合并到相同的适配器名称中时,add_weighted_adapter会默默地纾困,使用户认为您应用了新的适配器名称,但实际上什么也没做等等......但这在合并中根本没有处理。
但我不知道如何说服人们这是错误的方法。
- main 中的 Lora 下拉列表应该只允许添加一个 Lora 使用from_pretrained - 这是最安全且始终有效的方法。重置模型,然后使用from_pretrained。没有奇怪的秘密合并到第三个适配器中。这不像稳定扩散那样起作用。
- Lora 合并和切换的新选项卡(但我更喜欢扩展)需要在用户完全控制它的地方完成,否则它毫无用处。它需要对用户透明(例如合并两个 lora 实际上在物理上创建了第三个 lora),并且它需要允许更改权重,因为 99.99% 合并两个权重为 1 的 lora 不会产生您想要的结果。它也需要处理PEFT的特殊性......
我知道人们希望它像稳定扩散一样工作,但文本不是图像。一个有趣的 Lora 和一首诗歌 Lora 不会创造有趣的诗歌合并。因此,我们应该处理它是什么,而不是人们想象它是什么。
编辑:撤回我的声明(但在此处未编辑)。由于 Lora 合并在 exllama2 上工作正常,因此上述 1 和 2 不是解决方案,因为它仅适用于 Transformers.