成本立降50%！在EKS上借助Karpenter部署大模型

原文链接：
https://aws.amazon.com/cn/blogs/containers/scaling-a-large-language-model-with-nvidia-nim-on-amazon-eks-with-karpenter
作者：Shawn Zhang, Praseeda Sathaye, Vara Bonthu
翻译：cloudpilot.ai

在内容创作、客户服务以及数据分析领域中的很多企业正在使用大语言模型构建 AI 应用，从而为他们的客户交付全新的产品体验。

然而，由于这些模型所需的存储空间巨大，并且耗费大量计算资源，因此在 GPU 上对其进行有效配置、部署和扩展充满挑战。除此之外，企业还希望采用高效、实惠的方式实现低延迟和高性能的推理。

英伟达推出了 NVIDIA 推理微服务（NIM）容器来帮助企业部署大模型。这些容器通过使用 Kubernetes 的功能简化和加速 LLM 的部署，并具备以下优势：

为开发者和 IT 团队简化 AI 模型的部署和管理
在 NVIDIA 硬件上优化性能和资源使用情况
使企业能够保持对其 AI 部署的控制和安全性

在本文中，我们将展示如何在 Amazon EKS 上部署 NVIDIA NIM，并演示如何在 EKS 上管理和扩展大模型，如 Meta 的 Llama-3-8B。本文的内容将囊括前期准备、安装、负载测试、性能分析和可观测性等方面。

值得一提的是，文章中充分利用了 EKS 和 Karpenter 这一组合，对这些工作负载进行动态扩展和高效管理，并大幅降低了推理成本，可节省50%以上的开销。

方案概览

本文提出的方案是：在两台 g5.2xlarge 的 Amazon EC2 实例上部署了搭载 Llama-3-8B 模型的 NVIDIA NIM 容器，以确保高可用性。每台实例托管一个 NIM 容器的副本，因为每个 g5.2xlarge 实例仅配备一块 GPU。在通过 NIM Helm Chart 部署 Llama-3-8B 模型时，这些节点由 Karpenter 进行配置。这一设置确保了资源的高效利用，并能根据需求实现动态扩缩容。

Karpenter 可以使用和管理 Spot 实例，g5.2xlarge Spot 实例相较于按需实例（On-demand）有将近50%的折扣。

但是Spot实例随时有可能被回收，AWS 会提前2分钟发送中断通知，Karpenter 在接收到中断通知后会自动将Spot实例回滚为对应的按需实例。

但是正如你想到的那样，如果我有数十台节点，甚至成百上千台节点，2分钟的时间太短了，如果要安全稳定地使用Spot实例，可以借助云妙算（cloudpilot.ai）推出的 Spot Predict 功能，Spot Predict 功能通过机器学习算法准确预测出 AWS 全球数十万 Spot 实例的中断时刻，并提前60分钟执行回退操作 ，大幅减少 Spot 中断事件（Spot interruption events）对您业务的影响，同时结合底层增强版Karpenter，确保稳定地使用Spot实例。

您可以访问 cloudpilot.ai 获取免费试用，两步5分钟，将您的云成本减半！

水平自动扩缩容（HPA）可以根据吞吐量或其他指标进一步扩展这些副本。Prometheus 提供相关指标数据，Grafana 则用于监控和可视化这些指标。

为了访问 LLM 模型的端点，我们的方案使用了 Kubernetes 服务、NGINX ingress controller 和网络负载均衡（NLB）。用户可以将推理请求发送到 NLB 端点，同时 NIM Pods 从 NVIDIA NGC 容器镜像仓库中拉取容器镜像，并通过弹性文件系统（Amazon EFS）实现节点间的共享存储。下图为架构图：

使用 NIM 部署 Llama-3-8B 模型

为了简化 NVIDIA NIM 和 Llama-3-8B 模型的部署及管理，你可以使用Terraform 部署 Data on EKS Blueprints。这种基础设施即代码（IaC）的方式确保了部署流程的一致性和可重复性，为在 EKS 上提供可扩展、可维护的模型服务奠定了坚实的基础。

前期准备

在开始部署之前，你需要准备：

AWS CLI
curl
Phython3 和 pip
Kubectl
Terraform
NVIDIA NGC 账号及 API Key

设置

1、配置 NGC API Key

从 NVIDIA 获取 NGC API Key，并将其设置为环境变量：

export TF_VAR_ngc_api_key=<replace-with-your-NGC-API-KEY>

2、安装

在部署 blueprint 之前，确保将 variables.tf 文件中的region更新为所需部署的 AWS 区域。此外，请确认本地 AWS Region 设置与指定的 AWS Region 相匹配。例如，将 export AWS_DEFAULT_REGION="<REGION>" 设置为所需的区域。

接着，克隆代码仓库并运行安装脚本：

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/ai-ml/nvidia-triton-server
export TF_VAR_enable_nvidia_nim=true
export TF_VAR_enable_nvidia_triton_server=false
./install.sh

安装流程大约需要20分钟。如果安装失败，你可以尝试重复运行 install.sh 以重新应用 Terraform 模板。

3、验证安装

当安装完成，你可以从 Terraform 的输出中找到configure_kubectl命令。输入以下命令，为你的集群创建 kubeconfig 文件。将 region-code 替换为集群所在的区域。

aws eks --region <region-code> update-kubeconfig --name nvidia-triton-server

输入以下命令检查 nim-llm pods 是否处于运行状态：

kubectl get all -n nim

你将看到类似下方的结果：

NAME                               READY   STATUS    RESTARTS   AGE
pod/nim-llm-llama3-8b-instruct-0   1/1     Running   0          4h2m

NAME                                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/nim-llm-llama3-8b-instruct       ClusterIP   172.20.5.230   <none>        8000/TCP   4h2m
service/nim-llm-llama3-8b-instruct-sts   ClusterIP   None           <none>        8000/TCP   4h2m

NAME                                          READY   AGE
statefulset.apps/nim-llm-llama3-8b-instruct   1/1     4h2m

NAME                                                             REFERENCE                                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nim-llm-llama3-8b-instruct   StatefulSet/nim-llm-llama3-8b-instruct   2/5       1         5         1          4h2m

llama3-8b-instruct 模型在 NIM 命名空间中部署了一个 StatefulSet。运行过程中，Karpenter 配置了一个 GPU 实例。要检查已配置的 Karpenter EC2 实例，请输入以下命令：

kubectl get node -l type=karpenter -L node.kubernetes.io/instance-type

你将看到类似下方的输出结果：

NAME                                         STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-100-64-77-39.us-west-2.compute.internal   Ready    <none>   4m46s   v1.30.0-eks-036c24b   g5.2xlarge

使用示例提示测试 NIM

出于演示目的，我们使用 Kubernetes 服务进行端口转发，而不是暴露负载均衡端点。这种方法允许用户在本地访问服务，而无需公开访问 NLB。

kubectl port-forward -n nim service/nim-llm-llama3-8b-instruct 8000

然后，打开另一个终端，使用 curl 命令通过 HTTP 请求调用已部署的模型：

curl -X 'POST' \
    "http://localhost:8000/v1/completions" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "model": "meta/llama3-8b-instruct",
    "prompt": "Once upon a time",
    "max_tokens": 64
    }'

你将会获得类似下方的输出：

{
 "id": "cmpl-xxxxxxxxxxxxxxxxxxxxxxxxx",
 "object": "text_completion",
 "created": 1719742336,
 "model": "meta/llama3-8b-instruct",
 "choices": [
   {
     "index": 0,
     "text": ", there was a young man named Jack who lived in a small village at the foot of a vast and ancient forest. Jack was a curious and adventurous soul, always eager to explore the world beyond his village. One day, he decided to venture into the forest, hoping to discover its secrets.\nAs he wandered deeper into",
     "logprobs": null,
     "finish_reason": "length",
     "stop_reason": null
   }
 ],
 "usage": {
   "prompt_tokens": 5,
   "total_tokens": 69,
   "completion_tokens": 64
 }
}

这意味着我们部署的 Llama3 模型已经启动并运行，可以为请求提供服务。

使用 Karpenter 实现弹性伸缩

现在，你已经验证了部署的模型运行正常，是时候测试其扩展能力了。首先，我们建立一个测试环境：

cd gen-ai/inference/nvidia-nim/nim-client

python3 -m venv .venv
source .venv/bin/activate
pip install openai

我们准备了一个名为 prompts.txt 的文件，其中包含 20 条提示词。您可以使用以下命令运行这些提示词并验证生成的输出：

python3 client.py --input-prompts prompts.txt --results-file results.txt

你应该会看到类似下面的结果：

Loading inputs from `prompts.txt`...
Model meta/llama3-8b-instruct - Request 14: 4.68s (4678.46ms)
Model meta/llama3-8b-instruct - Request 10: 6.43s (6434.32ms)
Model meta/llama3-8b-instruct - Request 3: 7.82s (7824.33ms)
Model meta/llama3-8b-instruct - Request 1: 8.54s (8540.69ms)
Model meta/llama3-8b-instruct - Request 5: 8.81s (8807.52ms)
Model meta/llama3-8b-instruct - Request 12: 8.95s (8945.85ms)
Model meta/llama3-8b-instruct - Request 18: 9.77s (9774.75ms)
Model meta/llama3-8b-instruct - Request 16: 9.99s (9994.51ms)
Model meta/llama3-8b-instruct - Request 6: 10.26s (10263.60ms)
Model meta/llama3-8b-instruct - Request 0: 10.27s (10274.35ms)
Model meta/llama3-8b-instruct - Request 4: 10.65s (10654.39ms)
Model meta/llama3-8b-instruct - Request 17: 10.75s (10746.08ms)
Model meta/llama3-8b-instruct - Request 11: 10.86s (10859.91ms)
Model meta/llama3-8b-instruct - Request 15: 10.86s (10857.15ms)
Model meta/llama3-8b-instruct - Request 8: 11.07s (11068.78ms)
Model meta/llama3-8b-instruct - Request 2: 12.11s (12105.07ms)
Model meta/llama3-8b-instruct - Request 19: 12.64s (12636.42ms)
Model meta/llama3-8b-instruct - Request 9: 13.37s (13370.75ms)
Model meta/llama3-8b-instruct - Request 13: 13.57s (13571.28ms)
Model meta/llama3-8b-instruct - Request 7: 14.90s (14901.51ms)
Storing results into `results.txt`...
Accumulated time for all requests: 206.31 seconds (206309.73 milliseconds)
PASS: NVIDIA NIM example
Actual execution time used with concurrency 20 is: 14.92 seconds (14.92 milliseconds)

您可以在 results.txt 中检查生成的响应，其中包含类似下面的输出：

The key differences between traditional machine learning models and very large language models (vLLM) are:

1. **Scale**: vLLMs are massive, with billions of parameters, whereas traditional models typically have millions.
2. **Training data**: vLLMs are trained on vast amounts of text data, often sourced from the internet, whereas traditional models are trained on smaller, curated datasets.
3. **Architecture**: vLLMs often use transformer architectures, which are designed for sequential data like text, whereas traditional models may use feedforward networks or recurrent neural networks.
4. **Training objectives**: vLLMs are often trained using masked language modeling or next sentence prediction tasks, whereas traditional models may use classification, regression, or clustering objectives.
5. **Evaluation metrics**: vLLMs are typically evaluated using metrics like perplexity, accuracy, or fluency, whereas traditional models may use metrics like accuracy, precision, or recall.
6. **Interpretability**: vLLMs are often less interpretable due to their massive size and complex architecture, whereas traditional models may be more interpretable due to their smaller size and simpler architecture.

These differences enable vLLMs to excel in tasks like language translation, text generation, and conversational AI, whereas traditional models are better suited for tasks like image classification or regression.

=========

TensorRT (Triton Runtime) optimizes LLM (Large Language Model) inference on NVIDIA hardware by:

1. **Model Pruning**: Removing unnecessary weights and connections to reduce model size and computational requirements.
2. **Quantization**: Converting floating-point models to lower-precision integer formats (e.g., INT8) to reduce memory bandwidth and improve performance.
3. **Kernel Fusion**: Combining multiple kernel launches into a single launch to reduce overhead and improve parallelism.
4. **Optimized Tensor Cores**: Utilizing NVIDIA's Tensor Cores for matrix multiplication, which provides significant performance boosts.
5. **Batching**: Processing multiple input batches concurrently to improve throughput.
6. **Mixed Precision**: Using a combination of floating-point and integer precision to balance accuracy and performance.
7. **Graph Optimization**: Reordering and reorganizing the computation graph to minimize memory access and optimize data transfer.

By applying these optimizations, TensorRT can significantly accelerate LLM inference on NVIDIA hardware, achieving faster inference times and improved performance.

=========

你可能仍能看到一个 pod，这是因为当前 pod 仍能处理传入的负载。要进一步增加负载，可以在脚本中添加--iterations 标志，并注明要运行的迭代次数，从而引入更多迭代。例如，要运行 5 次迭代，可以运行以下脚本：

python3 client.py \
  --input-prompts prompts.txt \
  --results-file results.txt \
  --iterations 5

你也可以重复执行多次。同时，还可以使用以下命令来发现新的 pod，但启动后需要一些时间才能准备就绪：

kubectl get po,hpa -n nim

之后你可能会得到类似的输出结果：

NAME            READY   STATUS    RESTARTS   AGE
pod/nim-llm-llama3-8b-instruct-0   1/1     Running   0          35m
pod/nim-llm-llama3-8b-instruct-1   1/1     Running   0          7m39s
pod/nim-llm-llama3-8b-instruct-2   1/1     Running   0          7m39s
pod/nim-llm-llama3-8b-instruct-3   1/1     Running   0          7m39s

NAME                                          REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/nim-llm-llama3-8b-instruct   StatefulSet/nim-llm   18/5      1         5         4          9d

有一个名为 nim-llm-llama3-8b-instruct 的 HPA 资源，它与 nim-llm Helm Chart 一起部署。弹性伸缩由 NIM 暴露的 num_requests_running 指标驱动。我们已对 Prometheus Adapter 进行了预配置，使 HPA 能够使用这一自定义指标，这有助于根据实时需求自动扩展 NIM pod。

$ kubectl describe hpa nim-llm-llama3-8b-instruct -n nim

…
Reference:                         StatefulSet/nim-llm-llama3-8b-instruct
Metrics:                           ( current / target )
  "num_requests_running" on pods:  1 / 5
Min replicas:                      1
Max replicas:                      5
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Stabilization Window: 300 seconds
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
StatefulSet pods:      4 current / 4 desired
…

在实例层级，如果 pod 不可调度，并且与 NodePool 定义相匹配，Karpenter 会自动帮助您启动实例。我们为 NIM pod 启动了 GPU 实例 (g5)，因为我们对 NodePool 进行了如下配置：

nodePool:
  labels:
    - type: karpenter
    - NodeGroupType: g5-gpu-karpenter
  taints:
    - key: nvidia.com/gpu
      value: "Exists"
      effect: "NoSchedule"
  requirements:
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["g5"]
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values: [ "2xlarge", "4xlarge", "8xlarge", "16xlarge", "12xlarge", "24xlarge"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]

Karpenter 可以灵活定义各类实例规格，而不局限于固定的实例类型。当 spot 实例和按需实例均被配置为选项时，Karpenter 会使用价格容量优化分配策略优先使用 spot 实例。该策略先从预计短期内中断几率最低的池中请求 spot 实例。然后，再从价格最低的池中申请 spot 实例。

可观测性

为了监控部署，我们采用了 Prometheus 技术栈，其中包括 Prometheus 服务器和用于监控的 Grafana。

首先，使用以下命令验证 Kube Prometheus stack 部署的服务：

kubectl get svc -n kube-prometheus-stack

使用该命令列出服务后，你将看到：

NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
kube-prometheus-stack-grafana                    ClusterIP   172.20.225.77    <none>        80/TCP              10m
kube-prometheus-stack-kube-state-metrics         ClusterIP   172.20.237.248   <none>        8080/TCP            10m
kube-prometheus-stack-operator                   ClusterIP   172.20.118.163   <none>        443/TCP             10m
kube-prometheus-stack-prometheus                 ClusterIP   172.20.132.214   <none>        9090/TCP,8080/TCP   10m
kube-prometheus-stack-prometheus-node-exporter   ClusterIP   172.20.213.178   <none>        9100/TCP            10m
prometheus-adapter                               ClusterIP   172.20.171.163   <none>        443/TCP             10m
prometheus-operated                              ClusterIP   None             <none>        9090/TCP            10m

NVIDIA NIM LLM 服务通过 /metrics 端点从端口 8000 的 nim-llm 服务暴露指标。运行以下命令进行验证：

kubectl get svc -n nim
kubectl port-forward -n nim svc/nim-llm-llama3-8b-instruct 8000

打开另一个终端，并输入：

curl localhost:8000/metrics

你应该在 NIM 服务暴露的 Prometheus format 中获取到许多指标（比如，num_requests_running、time_to_first_token_seconds）。

Grafana dashboard

我们设置了一个预配置的 Grafana 面板，用于展示几个关键指标：

返回到首个令牌时间 (TTFT)： 从向模型发出初始推理请求到返回第一个令牌之间的延迟。
令牌间延迟 (ITL)： 第一个令牌之后每个令牌之间的延迟。
总吞吐量： NIM 每秒生成的令牌总数。

你可以在这份英伟达文档中详细了解指标信息

要查看 Grafana 面板，请参阅文末链接

使用 NVIDIA GenAI-Perf 工具进行性能测试

GenAI-Perf是一个命令行工具，用于测量 GenAI 模型的吞吐量和延迟。因为这些模型是通过推理服务器提供服务的。它是一个标准的基准测试工具，可用于比较部署在推理服务器上的不同模型的性能。

为了简化测试过程，特别是因为该工具需要 GPU，我们提供了一个预配置清单文件 genaiperf-deploy.yaml，它允许用户在自己的环境中部署和运行 GenAI-Perf。通过这种设置，你可以快速评估 AI 模型的性能，确保它们满足你的延迟和吞吐量要求。

cd gen-ai/inference/nvidia-nim
kubectl apply -f genaiperf-deploy.yaml

当 pod 准备就绪，即运行状态为 1/1 时，首先输入以下命令进入 pod：

export POD_NAME=$(kubectl get po -l app=tritonserver -ojsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- bash

然后输入以下命令测试已部署的 NIM Llama3 模型：

genai-perf \
  -m meta/llama3-8b-instruct \
  --service-kind openai \
  --endpoint v1/completions \
  --endpoint-type completions \
  --num-prompts 100 \
  --random-seed 123 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --tokenizer hf-internal-testing/llama-tokenizer \
  --concurrency 10 \
  --measurement-interval 4000 \
  --profile-export-file my_profile_export.json \
  --url nim-llm-llama3-8b-instruct.nim:8000

你会获得类似下方的结果：

2024-07-18 07:11 [INFO] genai_perf.parser:166 - Model name 'meta/llama3-8b-instruct' cannot be used to create artifact directory. Instead, 'meta_llama3-8b-instruct' will be used.
2024-07-18 07:12 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m meta/llama3-8b-instruct --async --input-data artifacts/meta_llama3-8b-instruct-openai-completions-concurrency10/llm_inputs.json --endpoint v1/completions --service-kind openai -u nim-llm.nim:8000 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/meta_llama3-8b-instruct-openai-completions-concurrency10/my_profile_export.json -i http --concurrency-range 10'
                                                      LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃            Statistic ┃           avg ┃           min ┃           max ┃           p99 ┃           p90 ┃           p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Request latency (ns) │ 3,946,813,567 │ 3,917,276,037 │ 3,955,037,532 │ 3,955,012,078 │ 3,954,685,886 │ 3,954,119,635 │
│     Num output token │           112 │           105 │           119 │           119 │           117 │           115 │
│      Num input token │           200 │           200 │           200 │           200 │           200 │           200 │
└──────────────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┴───────────────┘
Output token throughput (per sec): 284.85
Request throughput (per sec): 2.53

你可以查看 GenAI-Perf 收集的指标，如请求延迟、输出令牌吞吐量和请求吞吐量。有关 GenAI-Perf 可用命令行选项的详细信息，请参阅官方文档

总结

本文介绍了在 EKS 上使用 Llama-3-8B 模型部署 NVIDIA NIM 解决方案的过程，其中使用了 Karpenter 和 AWS 服务（如亚马逊 EFS 和弹性负载均衡）来创建可扩展且经济高效的基础设施。 Karpenter 对工作节点的动态扩展确保了基于需求的高效资源分配。我们还使用英伟达的 GenAI-Perf 工具对性能指标进行了基准测试，以展示系统的能力。

为了简化部署过程，Data on EKS 提供了可随时部署的 IaC 模板，使企业能够在几小时内搭建自己的基础架构。

公司介绍

云妙算（CloudPilot.ai）是一家全球领先的 Karpenter 托管云服务提供商，致力于通过智能化、自动化的云资源调度和编排技术，帮助企业最大化云资源利用率。我们秉持“让云上的每一分钱都物超所值”的使命，为客户提升10倍的资源效率，同时将云成本降低50%以上。

目前，Karpenter 已为全球超500家知名企业在生产环境中提供服务，包括阿迪达斯、Anthropic、Slack、Figma等。云妙算已为数十家全球顶尖科技公司提供服务，累计为客户节省超过30万美金，平均节省67%。 选择云妙算，让每一笔支出都更智慧。

免费试用，2步5分钟，降低50%云成本：
cloudpilot.ai

posted on 2024-10-24 14:18 CloudPilotAI 阅读(64) 评论(0) 编辑收藏举报

刷新页面返回顶部

cloudpilot-ai

导航

公告