大模型工具KTransformer的安装
技术背景
前面写过几篇关于DeepSeek的文章,里面包含了通过Ollama来加载模型,以及通过llama.cpp来量化模型(实际上Llama.cpp也可以用来加载模型,功能类似于Ollama)。这里再介绍一个国产的高性能大模型加载工具:KTransformer。但是本文仅介绍KTransformer的安装方法,由于本地的GPU太老,导致无法正常的运行KTransformer,但是编译安装的过程没什么问题。
源码安装KTransformer
首先从Github克隆下来KTransformer的仓库,然后按照官方指导流程进行构建(不要使用0.2.1
!!!按照官方的说法,0.2.1版本会导致大模型降智,在最新版本中已经修复,所以最好是从最新的源代码进行安装):
$ git clone https://github.com/kvcache-ai/ktransformers.git
正克隆到 'ktransformers'...
remote: Enumerating objects: 1866, done.
remote: Counting objects: 100% (655/655), done.
remote: Compressing objects: 100% (300/300), done.
remote: Total 1866 (delta 440), reused 359 (delta 355), pack-reused 1211 (from 2)
接收对象中: 100% (1866/1866), 9.33 MiB | 7.65 MiB/s, 完成.
处理 delta 中: 100% (990/990), 完成.
$ cd ktransformers/
$ git submodule init
子模组 'third_party/llama.cpp'(https://github.com/ggerganov/llama.cpp.git)已对路径 'third_party/llama.cpp' 注册
子模组 'third_party/pybind11'(https://github.com/pybind/pybind11.git)已对路径 'third_party/pybind11' 注册
$ git submodule update
正克隆到 '/datb/DeepSeek/ktransformers/third_party/llama.cpp'...
正克隆到 '/datb/DeepSeek/ktransformers/third_party/pybind11'...
子模组路径 'third_party/llama.cpp':检出 'a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f'
子模组路径 'third_party/pybind11':检出 'bb05e0810b87e74709d9f4c4545f1f57a1b386f5'
$ bash install.sh
Successfully built ktransformers
Installing collected packages: wcwidth, zstandard, tomli, tenacity, sniffio, six, pyproject_hooks, pydantic-core, psutil, propcache, orjson, ninja, multidict, jsonpointer, h11, greenlet, frozenlist, exceptiongroup, colorlog, click, attrs, async-timeout, annotated-types, aiohappyeyeballs, yarl, uvicorn, SQLAlchemy, requests-toolbelt, pydantic, jsonpatch, httpcore, build, blessed, anyio, aiosignal, starlette, httpx, aiohttp, langsmith, fastapi, accelerate, langchain-core, langchain-text-splitters, langchain, ktransformers
Successfully installed SQLAlchemy-2.0.38 accelerate-1.3.0 aiohappyeyeballs-2.4.6 aiohttp-3.11.12 aiosignal-1.3.2 annotated-types-0.7.0 anyio-4.8.0 async-timeout-4.0.3 attrs-25.1.0 blessed-1.20.0 build-1.2.2.post1 click-8.1.8 colorlog-6.9.0 exceptiongroup-1.2.2 fastapi-0.115.8 frozenlist-1.5.0 greenlet-3.1.1 h11-0.14.0 httpcore-1.0.7 httpx-0.28.1 jsonpatch-1.33 jsonpointer-3.0.0 ktransformers-0.2.1+cu128torch26fancy langchain-0.3.18 langchain-core-0.3.35 langchain-text-splitters-0.3.6 langsmith-0.3.8 multidict-6.1.0 ninja-1.11.1.3 orjson-3.10.15 propcache-0.2.1 psutil-7.0.0 pydantic-2.10.6 pydantic-core-2.27.2 pyproject_hooks-1.2.0 requests-toolbelt-1.0.0 six-1.17.0 sniffio-1.3.1 starlette-0.45.3 tenacity-9.0.0 tomli-2.2.1 uvicorn-0.34.0 wcwidth-0.2.13 yarl-1.18.3 zstandard-0.23.0
Installation completed successfully
到这里就是安装成功了。如果安装过程有出现报错,如下内容可以协助进行debug。
报错处理
如果出现报错:
FileNotFoundError: [Errno 2] No such file or directory: '/xxx/nvcc'
这是由于CUDA_HOME
设置错误导致的,可以先使用which nvcc
查看一些nvcc的路径:
$ which nvcc
/home/llama/bin/nvcc
然后再把这个路径(按照自己的环境来配置)配置到环境变量中:
$ export CUDA_HOME=/home/llama/targets/x86_64-linux
如果出现报错:
sh: 1: cicc: not found
是因为cicc不在系统路径里面,可以把cicc可执行文件路径添加到环境变量中:
$ export PATH=$PATH:/home/llama/nvvm/bin
有一个比较坑的问题是:
make[2]: *** 没有规则可制作目标“/home/lib64/libcudart.so”
改了半天PATH
、CUDA_HOME
和LD_LIBRARY_PATH
,最后发现这个路径写死了lib64
,而本地的CUDA路径一般是lib
,我觉得可以直接把这些lib拷贝一遍:
$ cp -r /home/lib/ /home/lib64/
当然,因为涉及到动态链接库的调用问题,这个方法可能不是那么优雅,但是确实解决了这个报错,避免了重装CUDA的操作。
安装测试
安装成功后,可以用如下指令测试一下是否安装成功:
$ python -m ktransformers.local_chat --help
NAME
local_chat.py
SYNOPSIS
local_chat.py <flags>
FLAGS
--model_path=MODEL_PATH
Type: Optional[str | None]
Default: None
-o, --optimize_rule_path=OPTIMIZE_RULE_PATH
Type: Optional[str]
Default: None
-g, --gguf_path=GGUF_PATH
Type: Optional[str | None]
Default: None
--max_new_tokens=MAX_NEW_TOKENS
Type: int
Default: 300
-c, --cpu_infer=CPU_INFER
Type: int
Default: 10
-u, --use_cuda_graph=USE_CUDA_GRAPH
Type: bool
Default: True
-p, --prompt_file=PROMPT_FILE
Type: Optional[str | None]
Default: None
--mode=MODE
Type: str
Default: 'normal'
-f, --force_think=FORCE_THINK
Type: bool
Default: False
KTransformer加载模型
要执行local_chat,还需要安装一个flash-attn:
$ python3 -m pip install flash-attn --no-build-isolation
建议的模型列表:

如果要使用KTransformer的Chat功能,需要把模型路径放在KTransformer安装的根目录上,也就是git clone之后cd进去的那一个路径。使用KTransformer加载模型的话,safetensor模型和gguf模型都要。例如从ModelScope下载DeepSeek-V2.5-IQ4_XS
这个模型(在KTransformer根目录执行这个操作):
$ git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-V2.5.git
正克隆到 'DeepSeek-V2.5'...
remote: Enumerating objects: 81, done.
remote: Counting objects: 100% (81/81), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 81 (delta 6), reused 0 (delta 0), pack-reused 0
展开对象中: 100% (81/81), 1.47 MiB | 1017.00 KiB/s, 完成.
过滤内容: 100% (55/55), 3.10 GiB | 60.00 KiB/s, 完成.
Encountered 55 file(s) that may not have been copied correctly on Windows:
model-00013-of-000055.safetensors
model-00052-of-000055.safetensors
model-00050-of-000055.safetensors
model-00027-of-000055.safetensors
model-00053-of-000055.safetensors
model-00016-of-000055.safetensors
model-00026-of-000055.safetensors
model-00051-of-000055.safetensors
model-00028-of-000055.safetensors
model-00039-of-000055.safetensors
model-00015-of-000055.safetensors
model-00014-of-000055.safetensors
model-00049-of-000055.safetensors
model-00040-of-000055.safetensors
model-00038-of-000055.safetensors
model-00021-of-000055.safetensors
model-00020-of-000055.safetensors
model-00025-of-000055.safetensors
model-00024-of-000055.safetensors
model-00037-of-000055.safetensors
model-00046-of-000055.safetensors
model-00047-of-000055.safetensors
model-00045-of-000055.safetensors
model-00012-of-000055.safetensors
model-00011-of-000055.safetensors
model-00023-of-000055.safetensors
model-00044-of-000055.safetensors
model-00048-of-000055.safetensors
model-00010-of-000055.safetensors
model-00035-of-000055.safetensors
model-00022-of-000055.safetensors
model-00033-of-000055.safetensors
model-00036-of-000055.safetensors
model-00043-of-000055.safetensors
model-00034-of-000055.safetensors
model-00031-of-000055.safetensors
model-00032-of-000055.safetensors
model-00054-of-000055.safetensors
model-00042-of-000055.safetensors
model-00030-of-000055.safetensors
model-00019-of-000055.safetensors
model-00002-of-000055.safetensors
model-00008-of-000055.safetensors
model-00018-of-000055.safetensors
model-00009-of-000055.safetensors
model-00007-of-000055.safetensors
model-00001-of-000055.safetensors
model-00041-of-000055.safetensors
model-00055-of-000055.safetensors
model-00005-of-000055.safetensors
model-00006-of-000055.safetensors
model-00004-of-000055.safetensors
model-00003-of-000055.safetensors
model-00017-of-000055.safetensors
model-00029-of-000055.safetensors
See: `git lfs help smudge` for more details.
然后在KTransformer根目录创建一个GGUF模型路径:DeepSeek-V2.5-GGUF
,进入该文件夹下载相应的GGUF模型:
$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00001-of-00004.gguf --local_dir ./
$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00002-of-00004.gguf --local_dir ./
$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00003-of-00004.gguf --local_dir ./
$ modelscope download --model bartowski/DeepSeek-V2.5-GGUF DeepSeek-V2.5-IQ4_XS/DeepSeek-V2.5-IQ4_XS-00004-of-00004.gguf --local_dir ./
模型文件下载完成后,回到KTransformer根目录,使用KTransformer加载模型:
$ python -m ktransformers.local_chat --model_path ./DeepSeek-V2.5/ --gguf_path ./DeepSeek-V2.5-GGUF/
运行日志大概长这样:

如果本地硬件条件正常、软件安装也正常的话,在一大串的日志之后,就可以进入到Chat的界面。如果有遇到报错,可以参考下面的这些debug内容。
报错处理
第一个有可能出现的问题是GPU管理问题:

我这里本地的硬件是2张8G的显卡,这个模型需要10GB的显存,理论上应该是够用的,但是遇到OOM报错:
Traceback (most recent call last):
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 179, in <module>
fire.Fire(local_chat)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 110, in local_chat
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
File "/datb/DeepSeek/ktransformers/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
load_weights(module, gguf_loader)
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/datb/DeepSeek/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/datb/DeepSeek/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/datb/DeepSeek/ktransformers/ktransformers/operators/linear.py", line 432, in load
self.generate_linear.load(w=w)
File "/datb/DeepSeek/ktransformers/ktransformers/operators/linear.py", line 215, in load
w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize
w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 84, in quantize_weights
q_w = reshape_w(q_w)
File "/datb/DeepSeek/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 81, in reshape_w
w = w.reshape((size_k, size_n)).contiguous()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 7.78 GiB of which 243.12 MiB is free. Including non-PyTorch memory, this process has 6.96 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 623.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
这是因为KTransformer默认只用一张卡的显存,按照官方的说法,多卡PCIE通信开销较大,性能有可能劣化。但是架不住有些人想用低性能的显卡尝尝鲜,其实官方仓库的optimize路径下有一些关于multi-gpu的json配置文件,主要是V3和R1的,但是我发现似乎V2.5的可以直接复用V2的optimize-rule,可以参考这样来运行:
$ python -m ktransformers.local_chat --model_path ./DeepSeek-V2.5/ --gguf_path ./DeepSeek-V2.5-GGUF/ -f true --port 11434 --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml --cpu_infer 45
这样就手动把模型的一部分层数放到另一张显卡上面了,没有OOM的问题。
HF镜像配置
我这里本地使用的模型都是从modelscope下载的,但是KTransformer如果没有找到本地的模型文件,会从Hugging Face上面去搜索相应的模型进行下载。而由于国内的一些网络问题,可能会导致直接下载模型失败。有一种方法是配置HF的镜像地址:
$ export HF_ENDPOINT=https://hf-mirror.com
这种方法我没有自己测试过,仅做一个记录,有兴趣的朋友可以自己试试。
Docker安装
如果实在是不想去配置环境,或者解决各种编译过程的问题,可以使用docker安装,已经有人做好了docker镜像approachingai/ktransformers:0.2.1
。但是国内从dockerhub拉取镜像经常也会遇到网络上的问题,可以自己去找一找相关的国内镜像源,例如我这里用的是docker.1ms.run
的一个临时镜像:
$ docker pull docker.1ms.run/approachingai/ktransformers:0.2.1
0.2.1: Pulling from approachingai/ktransformers
43f89b94cd7d: Pull complete
dd4939a04761: Pull complete
b0d7cc89b769: Pull complete
1532d9024b9c: Pull complete
04fc8a31fa53: Pull complete
a14a8a8a6ebc: Pull complete
7d61afc7a3ac: Pull complete
8bd2762ffdd9: Pull complete
2a5ee6fadd42: Pull complete
22ba0fb08ae2: Pull complete
4d37a6bba88f: Pull complete
4bc954eb910a: Pull complete
165aad6fabd0: Pull complete
c49c630af66e: Pull complete
4f4fb700ef54: Pull complete
ad3e53217d5f: Pull complete
4e5bba86d044: Pull complete
30e549924dd6: Pull complete
d01b314170e5: Pull complete
819a99ec0367: Pull complete
Digest: sha256:7de7da1cca07197aed37aff6e25daeb56370705f0ee05dbe3ad4803c51cc50a3
Status: Downloaded newer image for docker.1ms.run/approachingai/ktransformers:0.2.1
docker.1ms.run/approachingai/ktransformers:0.2.1
拉取完成后可以查看一下正在运行的容器:
$ docker ps -a
现在是一个空的状态,再查看一下本地的镜像:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.1ms.run/approachingai/ktransformers 0.2.1 614daa66a726 3 days ago 18.5GB
使用镜像构建一个容器,并且把本地需要用到的目录映射进去:
$ docker run --gpus all -v /datb/DeepSeek/:/models --name ktransformers -itd approachingai/ktransformers:0.2.1
再次查看正在运行的容器:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0d3fb9d57002 docker.1ms.run/approachingai/ktransformers:0.2.1 "tail -f /dev/null" 3 minutes ago Up 3 minutes ktransformers
这里我们的容器就启动成功了,下一步就是进入到容器环境中:
$ docker exec -it ktransformers /bin/bash
查看当前目录和映射目录:
root@0d3fb9d57002:/workspace# ll
total 12
drwxr-xr-x 1 root root 4096 Feb 15 05:45 ./
drwxr-xr-x 1 root root 4096 Feb 19 00:54 ../
drwxrwxr-x 8 1004 1004 4096 Feb 15 06:40 ktransformers/
root@0d3fb9d57002:/workspace# ll /models/
total 72
drwxr-xr-x 7 1001 1001 4096 Feb 17 06:19 ./
drwxr-xr-x 1 root root 4096 Feb 19 00:54 ../
drwxrwxr-x 3 1001 1001 4096 Feb 6 09:19 DeepSeek32B/
drwxrwxr-x 12 1001 1001 4096 Feb 17 09:14 ktransformers/
drwxrwxr-x 4 1001 1001 4096 Feb 12 02:31 llama/
drwxrwxr-x 7 1001 1001 4096 Feb 18 06:46 models/
-rwxrwxr-x 1 1001 1001 13389 Feb 5 04:16 ollama_install.sh*
drwxrwxr-x 7 1001 1001 4096 Feb 14 08:31 page-share-app/
-rw-rw-r-- 1 1001 1001 22239 Feb 12 02:14 running.log
-rw-rw-r-- 1 1001 1001 183 Feb 12 02:09 test_llama_cpp.py
类似于上面的使用方法,直接启动Chat,只不过需要相应的修改一下路径配置:
root@0d3fb9d57002:/models# python3 -m ktransformers.local_chat --gguf_path ktransformers/DeepSeek-V2.5-GGUF --model_path ktransformers/DeepSeek-V2.5 --cpu_infer 45 -f true --port 11434 --optimize_rule_path ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu.yaml
如果本地的硬件条件没有问题,这里应该会直接进入到Chat界面。
TORCH_USE_CUDA_DSA报错
接下来说一个无解的问题,在执行local_chat的编译过程,有可能弹出来的一个报错:
loading token_embd.weight to cpu
/opt/conda/lib/python3.10/site-packages/ktransformers/util/custom_gguf.py:644: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905979055/work/torch/csrc/utils/tensor_numpy.cpp:206.)
data = torch.from_numpy(data)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/ktransformers/local_chat.py", line 179, in <module>
fire.Fire(local_chat)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/ktransformers/local_chat.py", line 110, in local_chat
optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
File "/opt/conda/lib/python3.10/site-packages/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
load_weights(module, gguf_loader)
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/base_operator.py", line 60, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 85, in load_weights
load_weights(child, gguf_loader, prefix+name+".")
File "/opt/conda/lib/python3.10/site-packages/ktransformers/util/utils.py", line 87, in load_weights
module.load()
File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/linear.py", line 430, in load
self.generate_linear.load(w=w)
File "/opt/conda/lib/python3.10/site-packages/ktransformers/operators/linear.py", line 215, in load
w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
File "/opt/conda/lib/python3.10/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize
w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
File "/opt/conda/lib/python3.10/site-packages/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 61, in quantize_weights
w = w.reshape((group_size, -1))
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
或者是在进入到Chat界面后,输入任意的问题,有可能弹出来的一个报错:
Chat: who are you?
Traceback (most recent call last):
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 179, in <module>
fire.Fire(local_chat)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/local_chat.py", line 173, in local_chat
generated = prefill_and_generate(
File "/datb/DeepSeek/ktransformers/ktransformers/util/utils.py", line 154, in prefill_and_generate
logits = model(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 1731, in forward
outputs = self.model(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/operators/models.py", line 722, in forward
layer_outputs = decoder_layer(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 1238, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 407, in forward
return self.forward_windows(
File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 347, in forward_windows
return self.forward_chunck(
File "/datb/DeepSeek/ktransformers/ktransformers/operators/attention.py", line 85, in forward_chunck
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/datb/DeepSeek/ktransformers/ktransformers/models/modeling_deepseek.py", line 113, in forward
hidden_states = hidden_states.to(torch.float32)
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
类似于这种TORCH_USE_CUDA_DSA
的问题,按照官方的说法,很大概率是由于显卡型号老旧导致的
。所以,这个就很无解,只能通过换显卡的方式来解决了。
总结概要
本文主要介绍的是国产高性能大模型加载工具KTransformer的安装方法。之所以是使用方法,是因为该工具对本地的硬件条件还是有一定的要求。如果是型号过于老旧的显卡,有可能出现TORCH_USE_CUDA_DSA相关的一个报错。而这个问题只能通过换显卡来解决,所以作者本地并未完全测试成功,只是源码安装方法和Docker安装方法经过确认没有问题。
版权声明
本文首发链接为:https://www.cnblogs.com/dechinphy/p/ktransformer.html
作者ID:DechinPhy
更多原著文章:https://www.cnblogs.com/dechinphy/
请博主喝咖啡:https://www.cnblogs.com/dechinphy/gallery/image/379634.html
参考链接
本文作者:DECHIN
本文链接:https://www.cnblogs.com/dechinphy/p/18719866/ktransformer
版权声明:本作品采用CC BY-NC-SA 4.0许可协议进行许可。
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 【.NET】调用本地 Deepseek 模型
· CSnakes vs Python.NET:高效嵌入与灵活互通的跨语言方案对比
· Plotly.NET 一个为 .NET 打造的强大开源交互式图表库
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
2024-02-19 MindSponge分子动力学模拟——定义Collective Variables(2024.02)