text-generation-webui 推理模型Qwen1.5-7B-Chat相关报错问题解决

推理代码 text-generation-webui

推理模型 Qwen1.5-7B-Chat

sys info
gpu: Tesla V100-PCIE-32GB
python: 3.10
model:Qwen1.5-7B-Chat
docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash
app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
nvcc --version
cuda 11.8
python
import torch
print(torch.version)
13.1

1 路径错误

root@92c536e270d3:/app#	python server.py	y _-auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/modelDir/qwen --trust-remote-code
odel Owen1.5-72B-Chat	这里是末级模型目录名称	写倒数第二级目录即可
10:17:25-241308 INFO	Starting Text generation web UI
10:17:25-244433 WARNING	trust remote code is enabled. This is dangerous.
10:17:25-245304 WARNING
You are potentially exposing the web UI to the entire internet without any access password.
You can create one with the "--gradio auth" flag like this:
puoмssed : aweuuasn qqne-oтpeu6--
Make sure to replace username:password with your own.
10:17:25-248394 INFO	Loading "Qwen1.5-72B-Chat"
10:17:25-250000 ERRoR	The path to the model does not exist. Exiting.
Traceback (most recent call last)
/app/server.py:241 in <module>
240	# Load the model
)241	shared.model, shared.tokenizer = load model(model name)
242	if shared.args.lora:
/app/modules/models.py:84 in load model
83	logger.error('The path to the model does not exist. Exiting.')
84	raise ValueError
85的

 

 

2 依赖没安装

ImportError: This modeling file requires the following packages that were not found in your environment: transformers_stream_generator. Run `pip install 

transformers_stream_generator`

)241    shared.model, shared.tokenizer = load model(model name)
242    if shared.args.lora:
/app/modules/models.py:87 in load modela
86    shared.args.loader = loader
87    output = load func map[loader](model name)
88    if type(output) is tuple:
/app/modules/models.py:235 in huggingface_loader
234
235    model = LoaderClass.from pretrained(path to model, **params)
236
/venv/Lib/python3.10/site-packages/transformers/models/auto/auto factory.py:548 in from pretrained
547    class ref = config.auto map[cls. name ]
548    model class = get class from dynamic module(
549    class ref, pretrained model name or path, code revision=code revision, *
/venv/Lib/python3.10/site-packages/transformers/dynamic module_utils.py:488 in get_class from_dynamic_module
487    # And lastly we get the class inside our newly created module
488    final module = get cached module file(e
489的    repo _id,
/venv/Lib/python3.10/site-packages/transformers/dynamic_module _utils.py:314 in get_cached module file
313    # Check we have all the requirements in our environment
)314    modules needed = check imports(resolved module file)
315
/venv/Lib/python3.10/site-packages/transformers/dynamic module utils.py:180 in check imports
179    if len(missing packages) > 0:
)180    raise ImportError(
181    "This modeling file requires the following packages that were not found in y
ImportError: This modeling file requires the following packages that were not found in your environment: transformers stream generator. Run `pip install
transformers stream ceneratoree
root@92c536e270d3: /app#

 

3 c编译环境报错

RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

output = module. _old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_ modules/qwen/modeling_qwen.py", Line 432, in forward
(qua sod b ‘/uanb)qua sod Auepou /ldde = Kuanb
File "/root/.cache/huggingface/modules/transformers_modules/qwen/modeling_qwen.py", Line 1342, in apply rotary pos_emb
return apply rotary_emb func(t_float, cos, sin).type as(t)
File "/venv/Lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 122, in apply rotary_emb
return ApplyRotaryEmb.apply(
File "/venv/Lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/venv/Lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 48, in forward
out = apply rotary(
File "/venv/Lib/python3.10/site-packages/flash_attn/ops/triton/rotary.py", line 213, in apply rotary
rotary kernel[grid](
File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", Line 532, in run
self.cache[device][key] = compile(
File "/venv/Lib/python3.10/site-packages/triton/compiler/compiler.py", line 614, in compile
so path = make stub(name, signature, constants, ids, enable warp specialization=enable warp specialization)
File "/venv/lib/python3.10/site-packages/triton/compiler/make_launcher.py", line 37, in make_stub
so = build(name, src path, tmpdir)
File "/venv/Lib/python3.10/site-packages/triton/common/build.py", Line 83, in _build
raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
Output generated in 2.18 seconds (0.00 tokens/s, 0 tokens, context 60, seed 589520803)

 

4 RuntimeError: FlashAttention only supports Ampere GPUs or newer.

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

attn_output = self.core attention_flash(q, k, v, attention mask=attention mask)
File "/venv/Lib/python3.10/site-packages/torch/nn/modules/module.py", Line 1511, in _    wrapped call impl
return self, call impl(*args, **kwargs)
File "/venv/Lib/python3.10/site-packages/torch/nn/modules/module.py", Line 1520, in _call_impl
return forward call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers modules/qwen/modeling_qwen.py", line 191, in forward
output = flash_attn_func(q, k, v, dropout_p, softmax scale=self.softmax scale, causal=self.causal)
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", Line 825, in flash_attn_func
return FlashAttnFunc.apply(
File "/venv/Lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/venv/Lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", Line 507, in forward
out, q, k, v, out padded, softmax_lse, S_dmask, rng_state =    flash attn forward(
File "/venv/Lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
out, q, k, v, out padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output qenerated in 1.67    seconds (0.00 tokens/s, 0 tokens,    context 51,    seed 872380422)

 

修改 模型的文件config.json 将 use_flash_attn 改为 false 暂不用加速推理

目前我所用显卡不支持 flash_attn,只能关闭了

https://github.com/oobabooga/text-generation-webui/issues/5985

5 不会stop

5.1 ui 端 配置  如下

 

 Custom stopping strings : "<|im_start|>", "<|im_end|>", "<|endoftext|>"

Skip special tokens false

5.2 api接口

6 加载lora 推理报错

TypeError: LoraConfig.__init__() got an unexpected keyword argument 'layer_replication'

09:07:09-714849 INFO  Applying the following LoRAs to modelDir: adapterDir
Traceback (most recent call Last)
/app/server.py:243 in <module>
242    if shared.args.lora:
)243    add_lora_to model(shared.args.lora)
244
/app/modules/LoRA.py:18 in add_lora to model
17    else:
18的    add_lora_transformers(lora names)e
19
/app/modules/LoRA.py:125 in add lora transformers
124    logger.info("Applying the following LoRAs to {}: {}".format(shared.model name,
)125的    shared.model = PeftModel.from pretrained(shared.model, get lora path(lora names[
126    for lora in lora names[1:]:
/venv/lib/python3.10/site-packages/peft/peft model.py:325 in from pretrained
324    if config is None:
325的    config = PEFT TYPE TO CONFIG MAPPING[
326    PeftConfig, get peft type(
/venv/lib/python3.10/site-packages/peft/config.py:152 in from_pretraineds
151    kwargs = {**class kwargs, **loaded attributes}
)152的    return cls.from peft type(**kwargs)r
153
/venv/Lib/python3.10/site-packages/peft/config.py:119 in from_peft_type
118
)119的    return config_cls(**kwargs)
120
TypeError: LoraConfig. init () got an unexpected keyword argument    laver replication'

 

更换 peft 版本

pip install peft==0.5.0

 

7 加载多个lora 报错

U9:35:35-007791 IN U    Suaruung rext generauron web uT
09:33:33-611270 WARNING    trust remote code is enabled. This is dangerous.    drwxr-37
09:33:33-612231 WARNING     -1x-1xмup
drwxr-xr-
You are potentially exposing the web UI to the entire internet without any access password.    drwxr-xr-
You can create one with the    "--gradio auth" flag like this:    drwxr-xr-
drwxr-xr-
--gradio-auth username:password    drwxr-xr-
drwxr-xr-
Make sure to replace username:password with your own.    -rW--
09:33:33-615970 INFO    Loading "Qwen-7B-Chat-sft'    drwxr-xr-
Loading checkpoint shards: 100%    8/8 [00:08<00:00, 1.01s/it]    drwxr-xr-
09:33:42-689837 INFO    LOADER:    -aх-ахмир
TRUNCATION LENGTH: 32768    drwxr-xr-
09:33:42-691038 INFO
drwxr-xr-
09:33:42-691978 INFO    INSTRUCTION TEMPLATE: "ChatML"
drwxr-xr-
09:33:42-692777 INFO    Loaded the model in 9.08 seconds    -их-ахмир
09:33:42-693596 INFO    Applying the following LoRAs to Qwen-7B-Chat-sft: lora adapter, lora adapter2    drwxr-xr-
Traceback (most recent call lastl    drwxr-xr-
/app/server.py:243 in <module>    -1--1-M_-
drwxr-xr-
242    if shared.args.lora:    drwxr-xr-
)243    add lora to model(shared.args.lora)    drwxr-xr-
244    drwxr-x--
-их-ахмир
-1--d-MI-
/app/modules/LoRA.py:18 in add lora to model    drwxr-xr-
-их-ахмир
17    else:    (base) ro
18    add lora transformers(lora names)    t/V1
19    root@7.19
config.js
/app/modules/LoRA.py:130 in add lora transformers    generatio
LICENSE
129    if len(lora names) > 1:    model-000    merges.tx
)130    merge loras( )    model-000
131    model-000
model-000
/app/modules/LoRA.py:152 in merge loras    model.saf
README.nd
151    tokenizer
)152    shared.model.add weighted adapter(shared.lora names, [1] * len(shared.lora_names), "    tokenizer
153    shared.model.set adapter(" merged")    vocab.jso
(base) ro
home/wang
/venv/Lib/python3.10/site-packages/peft/tuners/lora.py:660 in add weighted adaptere    (base) ro
(base) ro
659    # new rank is the max of all ranks of the adapters if not provided    (base)    ro
099    new_rank = svd_rank or max(adapters_ranks)    (base)    ou
661    else:    (base)    ro
(base)    ro
ValueError: max() arg is an empty sequence    (base)    roe
r00t047780043e992:/app#    (base)    ou

 

https://github.com/oobabooga/text-generation-webui/issues/4371

没解决,手动合并多个adapter

合并 lora adapter 和加载lora推理不知道是不是一样的效果

在变压器适配器上加载多个 LORA 时出错 ·问题 #4371 ·oobabooga/文本生成-webui (github.com)

 引用github站友一句话 

我在 (#3120) 中写道,PR 使用过时的 PEFT 代码,但无论如何它已被合并。所以¯\_(ツ)_/¯

这种方法存在的问题远不止使合并工作有效。例如,下次您尝试将 Loras 合并到相同的适配器名称中时,add_weighted_adapter会默默地纾困,使用户认为您应用了新的适配器名称,但实际上什么也没做等等......但这在合并中根本没有处理。

但我不知道如何说服人们这是错误的方法。

  1. main 中的 Lora 下拉列表应该只允许添加一个 Lora 使用from_pretrained - 这是最安全且始终有效的方法。重置模型,然后使用from_pretrained。没有奇怪的秘密合并到第三个适配器中。这不像稳定扩散那样起作用。
  2. Lora 合并和切换的新选项卡(但我更喜欢扩展)需要在用户完全控制它的地方完成,否则它毫无用处。它需要对用户透明(例如合并两个 lora 实际上在物理上创建了第三个 lora),并且它需要允许更改权重,因为 99.99% 合并两个权重为 1 的 lora 不会产生您想要的结果。它也需要处理PEFT的特殊性......

我知道人们希望它像稳定扩散一样工作,但文本不是图像。一个有趣的 Lora 和一首诗歌 Lora 不会创造有趣的诗歌合并。因此,我们应该处理它是什么,而不是人们想象它是什么。

编辑:撤回我的声明(但在此处未编辑)。由于 Lora 合并在 exllama2 上工作正常,因此上述 1 和 2 不是解决方案,因为它仅适用于 Transformers.

posted @ 2024-05-09 11:23  linzm14  阅读(1101)  评论(0编辑  收藏  举报