记录一下 Win11 下编译 Ollama 本地运行 llama3.1
运行环境
-
Windows 11(显卡 AMD Radeon RX 6650 XT)
-
VS Code(用于查找特定代码,在 gfx1030 附近添加 gfx1032)
-
Git
-
Go 版本
$ go version go version go1.23.3 windows/amd64
-
MinGW (编译需要
make
命令)$ make -v GNU Make 4.4.1 Built for x86_64-w64-mingw32 Copyright (C) 1988-2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
注:将
MinGW
放在环境变量中后如果使用make -v
报错,到mingw64\bin
路径下复制一份mingw32-make.exe
改名为make.exe
即可(安装Perl
同理)。
安装 AMD HIP SDK for Windows
-
下载地址:HIP SDK 6.1.2
-
安装成功后,将 HIP SDK 添加到环境变量中(如:
%HIP_PATH_61%\bin
) -
运行
hipinfo
,可以看到AMD Radeon RX 6650 XT
对应的gcnArchName
为:gfx1032$ hipinfo -------------------------------------------------------------------------------- device# 0 Name: AMD Radeon RX 6650 XT pciBusID: 3 pciDeviceID: 0 pciDomainID: 0 multiProcessorCount: 16 maxThreadsPerMultiProcessor: 2048 isMultiGpuBoard: 0 clockRate: 2410 Mhz memoryClockRate: 1095 Mhz memoryBusWidth: 0 totalGlobalMem: 7.98 GB totalConstMem: 2147483647 sharedMemPerBlock: 64.00 KB canMapHostMemory: 1 regsPerBlock: 0 warpSize: 32 l2CacheSize: 4194304 computeMode: 0 maxThreadsPerBlock: 1024 maxThreadsDim.x: 1024 maxThreadsDim.y: 1024 maxThreadsDim.z: 1024 maxGridSize.x: 2147483647 maxGridSize.y: 65536 maxGridSize.z: 65536 major: 10 minor: 3 concurrentKernels: 1 cooperativeLaunch: 0 cooperativeMultiDeviceLaunch: 0 isIntegrated: 0 maxTexture1D: 16384 maxTexture2D.width: 16384 maxTexture2D.height: 16384 maxTexture3D.width: 2048 maxTexture3D.height: 2048 maxTexture3D.depth: 2048 hostNativeAtomicSupported: 1 isLargeBar: 0 asicRevision: 0 maxSharedMemoryPerMultiProcessor: 64.00 KB clockInstructionRate: 1000.00 Mhz arch.hasGlobalInt32Atomics: 1 arch.hasGlobalFloatAtomicExch: 1 arch.hasSharedInt32Atomics: 1 arch.hasSharedFloatAtomicExch: 1 arch.hasFloatAtomicAdd: 1 arch.hasGlobalInt64Atomics: 1 arch.hasSharedInt64Atomics: 1 arch.hasDoubles: 1 arch.hasWarpVote: 1 arch.hasWarpBallot: 1 arch.hasWarpShuffle: 1 arch.hasFunnelShift: 0 arch.hasThreadFenceSystem: 1 arch.hasSyncThreadsExt: 0 arch.hasSurfaceFuncs: 0 arch.has3dGrid: 1 arch.hasDynamicParallelism: 0 gcnArchName: gfx1032 peers: non-peers: device#0 memInfo.total: 7.98 GB memInfo.free: 7.85 GB (98%)
-
可以看到官方 AMD ROCm 支持的 GPU并不包含
AMD Radeon RX 6650 XT
,但是我们可以使用一些预构建的 rocblas 库
-
在 ROCmLibs for HIP SDK 6.1.2 中找到
rocm.gfx1032.for.hip.sdk.6.1.2.optimized.Fremont.Dango.Version.7z
并下载(这个版本较新,所以使用的这一个) -
解压上述压缩包后(以下文件做好备份,出现问题后还可以回滚 ovo)
- 将
rocblas.dll
文件复制到C:\Program Files\AMD\ROCm\6.1\bin
下 - 将
library
目录复制到C:\Program Files\AMD\ROCm\6.1\bin\rocblas
(选择替换所有文件)
- 将
编译 Ollama
-
克隆
Ollama
项目# 注:当前实验版本为 ollama 0.4.0 git clone https://github.com/ollama/ollama.git
-
使用 VSCode 打开 ollama 代码,在
ollama/llama/make/Makefile.rocm
文件中添加gfx1032
(直接在代码中全局查找gfx1030
也可以找到对应文件)# 原代码 HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 # 添加 gfx1032 使编译后的 ollama_llama_server.exe 支持 AMD Radeon RX 6650 XT HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1032 gfx1100 gfx1101 gfx1102
-
依次运行以下命令
$ CGO_ENABLED="1"
$ go generate ./...
$ go build .
注:在克隆的
ollama
根路径下运行命令(使用 git bash 命令行,所以命令前有一个$
,复制命令时注意删除$
) -
编译完成后,在
ollama
根路径下会生成一个ollama.exe
文件,此时运行服务测试一下$ ./ollama.exe serve 2024/11/08 21:38:18 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\cphovo\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-11-08T21:38:18.488+08:00 level=INFO source=images.go:755 msg="total blobs: 5" time=2024-11-08T21:38:18.488+08:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-11-08T21:38:18.490+08:00 level=INFO source=routes.go:1240 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-11-08T21:38:18.491+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm]" time=2024-11-08T21:38:18.491+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=4 threads=20 time=2024-11-08T21:38:18.937+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.1 name="AMD Radeon RX 6650 XT" total="8.0 GiB" available="7.8 GiB"
注:运行日志中出现类似
time=2024-11-08T21:38:18.937+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.1 name="AMD Radeon RX 6650 XT" total="8.0 GiB" available="7.8 GiB"
的日志,说明编译的ollama
已经支持使用AMD Radeon RX 6650 XT
显卡加速。 -
下载模型并运行
# 第一次运行会下载相关模型 $ ./ollama.exe run llama3.1
此时,可能会出现以下报错:
由于找不到 ggml_rocm.dll,无法继续执行代码。重新安装程序可能会解决此问题。
但是我们在
dist\windows-amd64\lib\ollama
路径下可以看到是有ggml_rocm.dll
文件的。解决方法:将
dist\windows-amd64\lib\ollama\ggml_rocm.dll
文件复制一份,放到dist\windows-amd64\lib\ollama\runners\rocm
下$ cd dist/windows-amd64/lib/ollama/runners/rocm/ $ ls -al total 349176 drwxr-xr-x 1 cphovo 197121 0 11月 8 21:45 . drwxr-xr-x 1 cphovo 197121 0 11月 8 20:36 .. -rwxr-xr-x 1 cphovo 197121 348145152 11月 8 20:34 ggml_rocm.dll -rwxr-xr-x 1 cphovo 197121 9406976 11月 8 20:36 ollama_llama_server.exe
再次运行
./ollama.exe run llama3.1
命令,看到以下内容:$ ./ollama.exe run llama3.1 >>> Send a message (/? for help)
-
测试
$ ./ollama.exe run llama3.1 >>> 请使用 python 实现二分查找,仅给出代码即可 ```python def binary_search(arr, low, high, x): if high >= low: mid = (high + low) // 2 if arr[mid] == x: return mid elif arr[mid] > x: return binary_search(arr, low, mid - 1, x) else: return binary_search(arr, mid + 1, high, x) else: return -1 arr = [2, 3, 4, 10, 40] x = 10 result = binary_search(arr, 0, len(arr)-1, x) if result != -1: print("Element is present at index", str(result)) else: print("Element is not present in array") ```
此时可以从任务管理器中看到 GPU 被正确使用,而不是通过 CPU 来跑的 llama3.1 模型,速度相比于使用 CPU 来说,快了很多倍。
问题记录
- 本来电脑上安装的 HIP SDK 版本是 5.7,但是使用相同步骤以后启动 ollama 服务,发现依旧使用的是 CPU 进行处理,后卸载 5.7 版本并安装 6.1 版本的 HIP SDK 后,实验成功
- 至于为什么会出现这个问题:"由于找不到 ggml_rocm.dll,无法继续执行代码。重新安装程序可能会解决此问题。" ,我在原项目的 issue 中没有找到相关说明,但是在 B 站一些视频中下载的 ollama_orcm 文件中发现
ollama_llama_server.exe
所在目录中存在一个llama.dll
文件,所以我就尝试将编译后的ggml_rocm.dll
复制了一份放到了ollama_llama_server.exe
所在目录下,很玄学,发现问题解决了(避免了我去提 issue,开心 ovo) - 参考的 wiki 中说明编译的时候需要安装 Strawberry Perl,但是实际上我的电脑上只在运行
go generate ./...
命令时出现缺少make
命令,我将 mingw64 中的mingw32-make.exe
改名为make.exe
后编译成功,所以不确定 Perl 是否确实需要 - 选用的大模型最好不要超过 GPU 显存