基础设施与算力管理层
Infrastructure & Compute Management Layer — GPU/NPU 集群、Kubernetes 编排、分布式存储、高性能网络、监控告警的统一下层基座
2. 层级定位与边界
L1 是 AI 基础能力平台的最底层,负责将所有物理硬件资源抽象为可编程、可调度、可观测的算力池。它向上层(L2 模型部署层)提供标准化的 K8s API 和 IaaS API,屏蔽底层硬件的异构性和运维复杂性。
3. 边界规范
不提供:模型权重路径知识、推理引擎配置文件、模型版本管理 API、业务层鉴权策略。
不负责:硬件采购、物理上架、硬件维修 RMA、机房空调/电力等物理基础设施。这些由物理运维团队(Data Center Ops)负责。
计算 ↔ 网络:通过 CNI 插件(Calico + Multus)交互,为 Pod 分配独立网络命名空间。
监控 ↔ 全部:每个子模块暴露 Prometheus Metrics 端点,由 Prometheus Server 统一采集。
4. 核心模块
4.1 GPU/NPU 算力管理
4.1.1 异构芯片管理
L1 需要统一管理多个厂商、多代际的异构加速芯片。每台 GPU 节点通过标准化的 Node Labels 向 K8s 上报其加速器信息。
NVIDIA A100-SXM-80GB
第三代 Tensor Core · MIG 支持 (最多 7 实例) · 600GB/s NVLink 3.0 · PCIe 4.0 · 80GB HBM2e · 312 TFLOPS (FP16)
Ampere 架构 当前主力NVIDIA H100-SXM-80GB
第四代 Tensor Core · Transformer Engine · MIG 支持 (最多 7 实例) · 900GB/s NVLink 4.0 · PCIe 5.0 · 1979 TFLOPS (FP8)
Hopper 架构 高优先级AMD MI300X
CDNA 3 架构 · 192GB HBM3 · 5.2 TB/s 显存带宽 · 896 GB/s Infinity Fabric · 896 TFLOPS (FP16)
ROCm 生态华为 Ascend 910B
DaVinci 架构 · 64GB HBM2e · HCCS 互联 (392 GB/s) · CANN 软件栈 · 320 TFLOPS (FP16)
CANN 驱动 国产化寒武纪 MLU370-S4
MLUarch03 架构 · 24GB GDDR6 · 256 TFLOPS (INT8) · Cambricon Neuware 软件栈
MLU 生态 国产化天数智芯 天垓100
7nm 通用 GPU · 32GB HBM2e · 147 TFLOPS (FP16) · 支持 Pytorch/TensorFlow
Iluvatar CoreX4.1.2 GPU 节点标签规范
所有 GPU 节点通过标准化的 Labels 向集群上报算力信息。此规范是资源调度和配额管理的基础。
# NVIDIA A100 节点标签示例
apiVersion: v1
kind: Node
metadata:
labels:
accelerator: "nvidia"
accelerator-model: "A100-SXM-80GB"
accelerator-memory: "80Gi"
accelerator-count: "8"
accelerator-topology: "nvlink-fullmesh"
accelerator-mig-enabled: "true"
accelerator-mig-profiles: "1g.10gb,2g.20gb,3g.40gb,7g.80gb"
accelerator-driver-version: "550.54.15"
accelerator-cuda-version: "12.4"
node-type: "gpu-compute"
network.bandwidth: "100Gbps"
network.interface: "mlx5_0"
storage.local-nvme: "3.5Ti"
---
# 华为昇腾 910B 节点标签示例
apiVersion: v1
kind: Node
metadata:
labels:
accelerator: "huawei-ascend"
accelerator-model: "Ascend-910B"
accelerator-memory: "64Gi"
accelerator-count: "8"
accelerator-topology: "hccs-ring"
accelerator-driver-version: "23.0.rc1"
accelerator-cann-version: "7.0.0"
node-type: "gpu-compute"
network.bandwidth: "100Gbps"
storage.local-nvme: "1.8Ti"
4.1.3 GPU 虚拟化
平台支持三种 GPU 共享/虚拟化方案,根据业务场景灵活选择:
| 方案 | 技术 | 隔离粒度 | 适用场景 |
|---|---|---|---|
| MIG (Multi-Instance GPU) | NVIDIA A100/H100 原生 | 硬件级隔离,显存/缓存/计算单元全隔离 | 生产级多租户推理,SLA 敏感 |
| vGPU (Time-Slicing) | NVIDIA vGPU Manager | 时间片轮转,显存独占 | 虚拟桌面、开发调试环境 |
| MPS (Multi-Process Service) | NVIDIA CUDA MPS | CUDA context 共享,显存共享 | 小批量推理,训练数据预处理 |
4.1.4 拓扑感知调度
多卡训练任务的性能高度依赖 GPU 间的互联拓扑。L1 通过 Volcano + 自定义调度插件实现拓扑感知调度:
# GPU 拓扑感知调度示例:Pod 模板
apiVersion: v1
kind: Pod
metadata:
name: training-pod
annotations:
# 拓扑约束:要求 8 张 GPU 在同一 NUMA 域内
scheduling.volcano.sh/topology-hint: "numa-single"
# PCIe 交换机亲和性
scheduling.volcano.sh/pcie-switch-affinity: "required"
# NVLink 域限制
scheduling.volcano.sh/nvlink-domain: "full-mesh"
spec:
containers:
- name: trainer
image: pytorch:2.3.0-cuda12.4
resources:
requests:
nvidia.com/gpu: 8
limits:
nvidia.com/gpu: 8
affinity:
nodeAffinity:
requiredDuringScheduling:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator-topology
operator: In
values: ["nvlink-fullmesh"]
4.1.5 故障检测与自动隔离
GPU 故障是分布式训练中最常见的问题类型。L1 构建了多层故障检测体系:
DCGM (Data Center GPU Manager) 以 5 秒间隔持续采集 XID 错误、ECC 错误、GPU 卡死等故障信号。Node Problem Detector 配置如下:
# Node Problem Detector GPU 故障规则
apiVersion: v1
kind: ConfigMap
metadata:
name: node-problem-detector-config
namespace: kube-system
data:
gpu-problem.yaml: |
conditions:
- type: "GPUError"
reason: "GPUCardFailure"
message: "GPU card is experiencing hardware failures"
rules:
- type: "permanent"
condition: "XidError"
metric: "dcgm_gpu_xid_errors_total"
threshold: 1
window: 5m
actions:
- action: cordon
- action: drain
timeout: 30m
- action: alert
severity: critical
channel: "feishu"
4.1.6 资源池化与配额管理
通过 Volcano Queue 实现多租户 GPU 资源池化与配额管理:
# Volcano Queue:按团队划分的 GPU 资源池
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ai-platform-team
spec:
weight: 10
capability:
nvidia.com/gpu: "128"
cpu: "512"
memory: "4096Gi"
reclaimable: true
guarantee:
resourceGroup:
resources:
- nvidia.com/gpu
percentage: 50 # 保障至少 64 张 GPU
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research-lab
spec:
weight: 5
capability:
nvidia.com/gpu: "32"
cpu: "128"
memory: "1024Gi"
reclaimable: false # 独占队列,不可被回收
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: inference-prod
spec:
weight: 20
capability:
nvidia.com/gpu: "256"
cpu: "1024"
memory: "4096Gi"
reclaimable: false
guarantee:
percentage: 80 # 保障 80% 资源用于生产推理
4.2 容器编排平台
4.2.1 Kubernetes 基础集群
Kubernetes 1.28+ 作为统一的资源调度和容器编排基座。所有 GPU 节点、存储服务、网络组件均以原生 K8s 资源进行管理。集群配置至少 3 个 Control Plane 节点,使用 etcd 集群存储集群状态。
| 组件 | 版本 | 说明 |
|---|---|---|
| Kubernetes | 1.28+ | 统一调度基座,采用 Containerd 运行时 |
| etcd | 3.5 | 集群状态存储,部署于 Control Plane 节点 |
| CoreDNS | 1.11 | 集群 DNS 服务 |
| Containerd | 1.7+ | 容器运行时,支持 GPU runc 扩展 |
| Helm | 3.14 | 包管理器,所有组件通过 Helm Charts 部署 |
4.2.2 Volcano:批量调度与队列管理
Volcano 是 Kubernetes 原生批量计算引擎,提供 AI/ML 工作负载所需的 Gang Scheduling、Queue 管理、公平调度、资源预留等能力。
| 特性 | 说明 | 配置场景 |
|---|---|---|
| Gang Scheduling | 所有 Pod 同时分配(All-or-Nothing) | 分布式训练任务,8 卡/16 卡同时就绪 |
| Queue 管理 | 按团队/项目划分资源池 | AI 平台组 128 卡,研究组 32 卡 |
| 公平调度 | DRF (Dominant Resource Fairness) | 多租户场景,避免资源饥饿 |
| Preemption | 高优任务抢占低优任务 | 生产推理任务抢占离线训练 |
| 资源预留 | Reservation 机制 | 预订集群资源用于重要任务 |
# Volcano Gang Scheduling 配置:8 卡分布式训练
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-train-8gpu
spec:
schedulerName: volcano
minAvailable: 8 # Gang Scheduling: 最少 8 个 Pod 同时调度
queue: ai-platform-team
tasks:
- replicas: 8
name: trainer
template:
spec:
containers:
- name: pytorch-train
image: pytorch:2.3.0-cuda12.4
command: ["torchrun", "--nnodes=8", "train.py"]
resources:
requests:
nvidia.com/gpu: 1
cpu: 16
memory: "64Gi"
limits:
nvidia.com/gpu: 1
affinity:
podAntiAffinity:
requiredDuringScheduling:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values: ["distributed-train-8gpu"]
topologyKey: kubernetes.io/hostname # 每节点一张卡
4.2.3 HAMi:GPU 共享调度
HAMi (formerly k8s-vGPU-scheduler) 在 Volcano 之上提供 GPU 设备级共享调度,支持显存和算力(SM Utilization)的细粒度切分:
# HAMi GPU 共享配置
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-device-config
namespace: kube-system
data:
config.yaml: |
nodes:
- selector:
accelerator-model: "A100-SXM-80GB"
devices:
- index: 0
type: nvidia.com/gpu
memory: 81920 # MB
cores: 100 # SM 百分比
- index: 1
type: nvidia.com/gpu
memory: 81920
cores: 100
# 显存虚拟化策略
memory:
enable: true
strategy: "isolation" # isolation | sharing
# 算力限制
cores:
enable: true
---
# Pod 使用 HAMi 共享 GPU
apiVersion: v1
kind: Pod
metadata:
name: shared-gpu-pod
annotations:
hami.io/gpu-memory: "16Gi" # 分配 16GB 显存
hami.io/gpu-cores: "30" # 分配 30% SM 算力
spec:
containers:
- name: inference
image: nvcr.io/nvidia/tritonserver:24.02-py3
resources:
requests:
nvidia.com/gpu: 1 # 请求一张 GPU (但只使用一部分)
4.2.4 NVIDIA GPU Operator
NVIDIA GPU Operator 24.3+ 自动化管理 GPU 节点的驱动安装、容器运行时配置、DCGM Exporter 部署和设备插件注册:
# Helm 部署 GPU Operator helm install gpu-operator nvidia/gpu-operator \ --namespace kube-system \ --set driver.enabled=true \ --set driver.version="550.54.15" \ --set toolkit.enabled=true \ --set toolkit.version="v1.16.0-ubuntu22.04" \ --set devicePlugin.enabled=true \ --set migManager.enabled=true \ --set migManager.default=none \ --set dcgmExporter.enabled=true \ --set dcgmExporter.collectInterval=5s
4.2.5 Karmada:多集群管理(可选)
对于跨数据中心或混合云场景,Karmada 提供 K8s 原生多集群管理能力,实现联邦资源调度:
# Karmada 多集群分发策略
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: gpu-workload-propagation
spec:
resourceSelectors:
- apiVersion: batch.volcano.sh/v1alpha1
kind: Job
placement:
clusterAffinity:
clusterNames:
- dc-shanghai-gpu
- dc-beijing-gpu
clusterTolerations:
- key: "gpu-type"
operator: "Equal"
value: "A100"
spreadConstraints:
- spreadByField: "cluster"
maxGroups: 1 # 每个 Job 只调度到一个集群
4.2.6 Istio:东西向流量治理
使用 Istio 1.21+ 管理推理服务之间的东西向流量,提供灰度发布流量分割、指标采集和 mTLS:
# Istio VirtualService:推理服务灰度
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: inference-gray
spec:
hosts:
- llm-service
http:
- match:
- headers:
x-version:
exact: v2
route:
- destination:
host: llm-service-v2
- route:
- destination:
host: llm-service-v1
weight: 90
- destination:
host: llm-service-v2
weight: 10
4.2.7 KEDA:GPU 感知自动扩缩
# KEDA ScaledObject:基于 GPU 利用率的 Pod 自动伸缩
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: gpu-inference-scaler
namespace: inference
spec:
scaleTargetRef:
name: llm-inference-deployment
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- type: prometheus
metricType: AverageValue
metadata:
serverAddress: http://prometheus-server.monitoring:9090
query: |
avg(avg_over_time(
DCGM_FI_DEV_GPU_UTIL{namespace="inference"}[2m]
)) by (pod)
threshold: "70" # GPU 利用率 > 70% 时扩容
activationThreshold: "10"
- type: prometheus
metricType: AverageValue
metadata:
serverAddress: http://prometheus-server.monitoring:9090
query: |
avg(rate(
istio_requests_total{destination_service_namespace="inference"}[1m]
)) by (destination_workload)
threshold: "100" # QPS > 100 时扩容
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
4.3 存储系统
AI 平台的数据存储有鲜明的分层特征。不同数据类型有不同的访问模式、吞吐要求和持久性需求。L1 采用三层存储架构,每种存储对应不同的访问场景:
4.3.1 对象存储:模型权重与数据湖
MinIO(高性能场景)和 Ceph RGW(超大规模场景)提供 S3-compatible 对象存储:
# MinIO Tenant 配置
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
name: ai-platform-minio
namespace: storage
spec:
image: quay.io/minio/minio:RELEASE.2024-06-11T01-11-02Z
pools:
- servers: 4
volumesPerServer: 8
volumeClaimTemplate:
metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Ti
storageClassName: ceph-rbd-ssd
mountPath: /export
credsSecret:
name: minio-creds-secret
serviceMetadata:
consoleService:
type: ClusterIP
minioService:
type: ClusterIP
console:
image: quay.io/minio/console:v0.44.0
bucketDedup: true
buckets:
- name: model-weights
region: us-east-1
quota: 100Ti
- name: training-data
region: us-east-1
quota: 500Ti
- name: checkpoints
region: us-east-1
quota: 200Ti
- name: containers
region: us-east-1
quota: 50Ti
4.3.2 文件存储:训练数据共享
JuiceFS 将对象存储挂载为 POSIX 文件系统,为训练任务提供高吞吐的数据读取能力:
# JuiceFS CSI 驱动存储类
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: juicefs-cache
provisioner: csi.juicefs.com
parameters:
csi.storage.k8s.io/node-publish-secret-name: juicefs-secret
csi.storage.k8s.io/node-publish-secret-namespace: storage
juicefs/minio-server: "http://minio.storage:9000"
juicefs/bucket: "training-data"
juicefs/cache-size: "51200" # 50GB 本地缓存
juicefs/cache-dir: "/mnt/nvme/juicefs-cache"
juicefs/cache-evict-threads: "8"
juicefs/atime-mode: "noatime" # 禁用 atime 提升性能
juicefs/meta-cache-ttl: "3600"
reclaimPolicy: Retain
---
# PVC 声明示例
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Ti
storageClassName: juicefs-cache
4.3.3 块存储:数据库持久化
Ceph RBD 提供高性能块存储,用于 etcd、MySQL、ES 等有状态服务的持久化:
# Ceph RBD 存储类(SSD 池) apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ceph-rbd-ssd provisioner: rbd.csi.ceph.com parameters: clusterID: "ceph-ai-platform" pool: "ssd-pool" imageFeatures: "layering" csi.storage.k8s.io/fstype: "ext4" csi.storage.k8s.io/controller-expand-secret-name: csi-ceph-secret csi.storage.k8s.io/node-stage-secret-name: csi-ceph-secret mounter: "rbd-nbd" # 使用 NBD 以支持扩容 allowVolumeExpansion: true reclaimPolicy: Delete --- # Ceph RBD 存储类(HDD 池) apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ceph-rbd-hdd provisioner: rbd.csi.ceph.com parameters: clusterID: "ceph-ai-platform" pool: "hdd-pool" imageFeatures: "layering" csi.storage.k8s.io/fstype: "xfs" mounter: "rbd" allowVolumeExpansion: true reclaimPolicy: Delete
4.3.4 本地缓存:Stargz Snapshotter 懒加载
GPU 节点配备 NVMe SSD(3.5TB ~ 7TB)作为本地缓存层,通过 Stargz Snapshotter 实现容器镜像的懒加载,大幅加速模型服务 Pod 启动:
# Containerd 配置 Stargz Snapshotter
version = 2
[plugins]
[plugins."io.containerd.snapshotter.v1.stargz"]
root_path = "/var/lib/containerd-stargz-grpc"
# 远程镜像仓库配置
[plugins."io.containerd.snapshotter.v1.stargz".config]
noprefetch = false
convert_image_layer_to_stargz = true
# 本地缓存
[plugins."io.containerd.snapshotter.v1.stargz".cache]
max_cache_fds = 10000
cache_on_root = true
# 并发预取
[plugins."io.containerd.snapshotter.v1.stargz".prefetch]
prefetch_size = 524288000 # 500MB 预取
4.3.5 存储访问模式矩阵
| 数据类型 | 存储方案 | 访问方式 | 容量需求 | IO 要求 |
|---|---|---|---|---|
| 模型权重 (5-200GB) | MinIO / Ceph RGW | Pod 初始化时拉取 → 缓存至本地 | 50-200 TiB | 读带宽 > 5GB/s |
| 训练数据集 | JuiceFS + MinIO | POSIX 挂载,直接读取 | 200-500 TiB | 顺序读 > 10GB/s |
| 训练 Checkpoint | MinIO | S3 API 上传/下载 | 50-200 TiB | 写带宽 > 2GB/s |
| 数据库卷 | Ceph RBD SSD | 块设备挂载 | 5-20 TiB | 随机 IOPS > 50k |
| 容器镜像 | Harbor + MinIO | Stargz 懒加载 | 10-50 TiB | 启动 < 10s |
| 日志 / 监控 | Ceph RGW | S3 API | 50-100 TiB | 写带宽 > 1GB/s |
4.4 网络架构
4.4.1 三平面网络设计
L1 采用三平面物理隔离的网络架构,分别承载管理、存储和计算三种不同特性的流量:
| 网络平面 | 带宽 | 协议 | 承载流量 | 隔离方式 |
|---|---|---|---|---|
| 管理网络 (Management) | 1 Gbps | TCP/IP | SSH 管理、K8s API、DNS、NTP、BMC/IPMI | 物理独立网卡 + VLAN |
| 存储网络 (Storage) | 25 Gbps | TCP/IP + RDMA | Ceph OSD 复制、MinIO 数据读写、NFS/JuiceFS 流量 | 物理独立网卡 + VLAN + QoS |
| 计算网络 (Compute) | 100 Gbps / 200 Gbps | InfiniBand NDR / RoCEv2 | GPU-GPU 通信 (NCCL/RCCL)、分布式训练 AllReduce、推理请求分发 | 物理独立网卡 / IB Fabric |
4.4.2 计算网络:InfiniBand 与 RoCEv2
多节点分布式训练对 GPU 间通信带宽和延迟极度敏感。L1 支持两种高性能网络方案:
| 方案 | InfiniBand NDR200 | RoCEv2 (RDMA over Converged Ethernet) |
|---|---|---|
| 单链路带宽 | 200 Gbps (HDR) / 400 Gbps (NDR) | 100 Gbps / 200 Gbps |
| 延迟 (节点间) | < 1.2 μs | < 2.0 μs |
| 拥塞控制 | 硬件级 (IB CC) | DCQCN + ECN |
| 时延保障 | 确定性时延 (Lossless) | 有损/无损自适应 |
| GPU Direct RDMA | 原生支持 | 支持 (需配置) |
| 成本 | 高 (专用 Fabric) | 中 (标准以太网) |
| 推荐场景 | 大规模训练 > 64 GPUs | 中规模训练 + 推理混合 |
# RoCEv2 交换机配置示例 (Mellanox / NVIDIA Spectrum) # QoS 配置: 为 RDMA 流量分配独立缓存池 mlnx_qos -i eth0 --pfc 0,0,0,1,0,0,0,0 # 启用 Priority Flow Control on Priority 3 mlnx_qos -i eth0 --trust dscp # 基于 DSCP 信任 # ECN 配置: 标记阈值 echo "24" > /sys/class/net/eth0/ecn/redp/red_min_threshold echo "24" > /sys/class/net/eth0/ecn/redp/red_max_threshold echo "100" > /sys/class/net/eth0/ecn/redp/red_probability # GPU Direct RDMA 验证 nvidia-smi topo -m # GPU0 GPU1 GPU2 GPU3 CPU Affinity # GPU0 X NV1 NV1 NV2 0-31 # GPU1 NV1 X NV2 NV1 0-31 # GPU2 NV1 NV2 X NV1 64-95 # GPU3 NV2 NV1 NV1 X 64-95 # NCCL 环境变量配置 NCCL_IB_DISABLE=0 NCCL_IB_GID_INDEX=3 NCCL_SOCKET_IFNAME=ib0 NCCL_IB_HCA=mlx5_0,mlx5_1 NCCL_IB_TIMEOUT=22 NCCL_IB_RETRY_CNT=7 NCCL_IB_SL=3 NCCL_IB_QPS_PER_CONNECTION=8 NCCL_NET_GDR_LEVEL=3 NCCL_NET_GDR_READ=1 NCCL_P2P_DISABLE=0 NCCL_DEBUG=INFO
4.4.3 网络隔离与安全策略
使用 Calico 作为 CNI 插件,支持 NetworkPolicy 对 GPU Pod 进行细粒度网络隔离:
# NetworkPolicy:GPU Pod 只允许同命名空间通信
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: gpu-pod-isolation
namespace: training
spec:
podSelector:
matchExpressions:
- key: accelerator
operator: In
values: ["nvidia", "huawei-ascend"]
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
purpose: training
- podSelector:
matchLabels:
app: training-controller
egress:
- to:
- namespaceSelector: {}
ports:
- port: 443 # HTTPS 访问 (Harbor/MinIO)
- port: 6443 # K8s API Server
- port: 53 # DNS
protocol: UDP
- to:
- podSelector:
matchExpressions:
- key: accelerator
operator: Exists
- namespaceSelector:
matchLabels:
purpose: training
# 允许 GPU 间 NCCL 通信
ports:
- port: 0
protocol: TCP
- port: 0
protocol: UDP
---
# NetworkPolicy:禁止外部访问 GPU 节点
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-external-gpu-access
namespace: kube-system
spec:
podSelector:
matchLabels:
app: gpu-operator
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: 10.0.0.0/8 # 仅允许内网访问
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
4.4.4 Multus:多网卡多网络平面
使用 Multus CNI 为 GPU Pod 附加多张网卡,实现管理/存储/计算三平面网络:
# Multus NetworkAttachmentDefinition
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: ib-net
namespace: default
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "ib-sriov",
"master": "ib0",
"mode": "bridge",
"deviceID": "15b3:101b",
"linkState": "auto",
"capabilities": {
"ips": true
},
"ipam": {
"type": "whereabouts",
"range": "192.168.200.0/24"
}
}
---
# Pod 声明多网络
apiVersion: v1
kind: Pod
metadata:
name: multi-net-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "ib-net", "namespace": "default"},
{"name": "storage-net", "namespace": "default"}
]
spec:
containers:
- name: trainer
image: pytorch:2.3.0-cuda12.4
resources:
limits:
nvidia.com/gpu: 8
4.5 监控与告警
4.5.1 监控技术栈
| 组件 | 版本 | 功能 |
|---|---|---|
| Prometheus | 2.52+ | 指标采集 + 告警规则评估,部署于 monitoring 命名空间 |
| DCGM Exporter | 3.3+ | NVIDIA GPU 指标采集:利用率、显存、温度、功率、PCIe 吞吐 |
| AMD ROCm Exporter | 0.15+ | AMD MI 系列 GPU 指标采集 |
| ASCEND Exporter | 1.8+ | 华为昇腾芯片(DCGM 等效)指标采集 |
| Node Exporter | 1.8+ | 节点资源指标:CPU、内存、磁盘、网络 |
| Blackbox Exporter | 0.25+ | 端点健康探测 |
| Grafana | 11.0+ | 可视化仪表盘,多数据源聚合 |
| Loki | 3.0+ | 日志聚合,与 Grafana 集成 |
| Tempo / Jaeger | 2.5+ / 1.57 | 分布式追踪 |
| Alertmanager | 0.27+ | 告警去重、分组、路由,支持飞书/钉钉/企业微信 |
| Node Problem Detector | 0.8+ | 节点故障检测与自动修复 |
| VictoriaMetrics | 1.101+ | Prometheus 兼容的时序数据库(长期存储替代) |
4.5.2 GPU 核心指标
DCGM Exporter 暴露以下关键 GPU 指标,覆盖利用率、显存、温度和功耗:
| 指标名 | 类型 | 含义 | 告警阈值 |
|---|---|---|---|
DCGM_FI_DEV_GPU_UTIL | Gauge (0-100) | GPU 核心利用率 | 持续 5min < 10% 低效告警 |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge (0-100) | 显存带宽利用率 | 持续 > 95% 带宽告警 |
DCGM_FI_DEV_FB_USED | Gauge (bytes) | 已用显存 | > 90% 显存告警 |
DCGM_FI_DEV_GPU_TEMP | Gauge (C) | GPU 温度 | > 85°C 警告, > 95°C 紧急 |
DCGM_FI_DEV_POWER_USAGE | Gauge (W) | 瞬时功耗 | > 400W (单卡) |
DCGM_FI_DEV_XID_ERRORS | Counter | GPU XID 错误计数 | > 0 即告警 |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | Counter | ECC 可纠正错误 | 增量 > 100/天 |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | Counter | ECC 不可纠正错误 | > 0 即告警 |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT | Gauge (bytes/s) | PCIe 发送吞吐 | 同比异常下降 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | Gauge (bytes/s) | NVLink 总带宽 | 同比异常下降 |
4.5.3 Prometheus 告警规则
# PrometheusRule: GPU 故障告警
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
namespace: monitoring
labels:
prometheus: k8s
role: alert-rules
spec:
groups:
- name: gpu-failure
interval: 30s
rules:
- alert: GPUXidError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "GPU XID Error on {{ $labels.Instance }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.kubernetes_io_hostname }} encountered XID error. GPU index: {{ $labels.gpu }}. Error count: {{ $value }}"
runbook_url: "https://runbook.internal/gpu-xid-error"
action: "检查 GPU 日志: nvidia-smi -q -d HEALTH; dmesg | grep -i nvidia"
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: warning
team: infra
annotations:
summary: "GPU High Temperature on {{ $labels.Instance }}"
description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C (threshold: 85°C)"
- alert: GPUHighTemperatureCritical
expr: DCGM_FI_DEV_GPU_TEMP > 95
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "GPU Critical Temperature on {{ $labels.Instance }}"
description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C (threshold: 95°C). Immediate action needed!"
- alert: GPUECCUncorrectableError
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[15m]) > 0
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "GPU ECC Uncorrectable Error on {{ $labels.Instance }}"
description: "GPU {{ $labels.gpu }} has uncorrectable ECC errors. This indicates potential hardware failure."
- alert: GPUMemoryUsageHigh
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100 > 90
for: 10m
labels:
severity: warning
team: mlops
annotations:
summary: "GPU Memory Usage High on {{ $labels.Instance }}"
description: "GPU {{ $labels.gpu }} memory usage at {{ $value }}%. OOM risk."
action: "检查 pod GPU 内存消耗: nvidia-smi pmon -c 1"
- alert: GPUPowerLimitReached
expr: DCGM_FI_DEV_POWER_USAGE > DCGM_FI_DEV_MAX_POWER_USAGE * 0.95
for: 5m
labels:
severity: warning
team: infra
annotations:
summary: "GPU Power Limit Near Threshold"
description: "GPU {{ $labels.gpu }} power usage {{ $value }}W approaching max power limit."
- alert: GPUNodeUnavailable
expr: up{job="gpu-node"} == 0
for: 2m
labels:
severity: critical
team: infra
annotations:
summary: "GPU Node Unreachable"
description: "GPU node {{ $labels.instance }} is down/unreachable for > 2 minutes."
- alert: GPULowUtilizationWarning
expr: avg by(kubernetes_io_hostname) (DCGM_FI_DEV_GPU_UTIL) < 10
for: 30m
labels:
severity: warning
team: mlops
annotations:
summary: "GPU Cluster Underutilized"
description: "Average GPU utilization across cluster is {{ $value }}%. Consider releasing unused resources."
- alert: GPUNodeDraining
expr: kube_node_status_condition{condition="NodeProblemDetected",status="true"} == 1
for: 1m
labels:
severity: warning
team: infra
annotations:
summary: "GPU Node Auto-Draining"
description: "Node {{ $labels.node }} is being drained due to detected problems."
4.5.4 Grafana 仪表盘
预配置的 Grafana 仪表盘包含以下关键面板:
| 仪表盘 | 核心面板 | 数据源 |
|---|---|---|
| GPU 集群总览 | GPU 总数按型号分布 (饼图) · 集群平均利用率 (时序) · 故障 GPU 数 (统计) · 排队任务数 (统计) · 可用/已用显存 (Gauge) | Prometheus + VictoriaMetrics |
| 单节点详情 | 8 张 GPU 利用率 (热力图) · GPU 温度 (时序) · GPU 功耗 (时序) · NVLink 带宽 (时序) · PCIe 吞吐 (时序) · CPU/内存/网络叠加 | Prometheus |
| 训练任务监控 | 任务 GPU 利用率 · 显存分配曲线 · NCCL 通信带宽 · 任务吞吐 (samples/s) · GPU 时间线 (Gantt 图) · 任务排队时间 | Prometheus + Loki |
| 集群成本 | GPU 卡时使用量 (按团队/项目分组) · 利用率成本分摊 · Spot 实例使用占比 · 浪费资源预估 | Prometheus + MySQL |
| 告警事件流 | 活跃告警 · 告警历史 · 告警响应时间 · 告警按严重性分布 | Alertmanager + Loki |
4.5.5 Alertmanager 通知配置
# Alertmanager 配置:飞书通知
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: ''
route:
receiver: 'feishu-critical'
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: feishu-critical
repeat_interval: 1h
- match:
severity: warning
receiver: feishu-warning
repeat_interval: 4h
- match:
severity: info
receiver: feishu-info
repeat_interval: 24h
receivers:
- name: feishu-critical
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxxxxxx'
send_resolved: true
http_config:
headers:
Content-Type: application/json
- name: feishu-warning
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/yyyyyyyyy'
send_resolved: true
- name: feishu-info
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/zzzzzzzzz'
send_resolved: false
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'cluster', 'service']
5. 数据流与 API 规范
5.1 L2 如何调用 L1
L2 模型部署层通过 K8s API 与 L1 交互。L2 运维人员的典型操作模式如下:
5.1.1 L2 请求 GPU Pod
# L2 创建 GPU Pod (调用 L1 K8s API)
POST /api/v1/namespaces/inference/pods
Content-Type: application/json
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": "vllm-llama3-70b-8gpu",
"namespace": "inference",
"annotations": {
"sidecar.istio.io/inject": "true",
"prometheus.io/scrape": "true",
"prometheus.io/port": "8000"
}
},
"spec": {
"schedulerName": "volcano",
"containers": [{
"name": "vllm",
"image": "registry.internal/vllm:v0.5.0-cuda12.4",
"command": ["python3", "-m", "vllm.entrypoints.openai.api_server",
"--model", "/models/llama3-70b",
"--tensor-parallel-size", "8",
"--gpu-memory-utilization", "0.95",
"--max-model-len", "8192",
"--dtype", "bfloat16",
"--port", "8000"],
"ports": [{"containerPort": 8000, "name": "http"}],
"env": [
{"name": "NCCL_IB_DISABLE", "value": "0"},
{"name": "NCCL_NET_GDR_LEVEL", "value": "3"}
],
"resources": {
"requests": {"nvidia.com/gpu": "8", "cpu": "64", "memory": "512Gi"},
"limits": {"nvidia.com/gpu": "8", "cpu": "64", "memory": "512Gi"}
},
"volumeMounts": [{
"name": "model-weights",
"mountPath": "/models"
}]
}],
"volumes": [{
"name": "model-weights",
"persistentVolumeClaim": {"claimName": "llama3-70b-weights"}
}],
"affinity": {
"nodeAffinity": {
"requiredDuringScheduling": {
"nodeSelectorTerms": [{
"matchExpressions": [
{"key": "accelerator-model", "operator": "In", "values": ["A100-SXM-80GB"]},
{"key": "accelerator-count", "operator": "Gt", "values": ["7"]}
]
}]
}
},
"podAntiAffinity": {
"preferredDuringScheduling": [{
"podAffinityTerm": {
"labelSelector": {"matchLabels": {"app": "vllm"}},
"topologyKey": "kubernetes.io/hostname"
},
"weight": 100
}]
}
}
}
}
# 响应 (K8s API)
HTTP/1.1 201 Created
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": "vllm-llama3-70b-8gpu",
"namespace": "inference",
"uid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"creationTimestamp": "2026-06-02T10:30:00Z"
},
"status": {
"phase": "Pending",
"conditions": [
{"type": "PodScheduled", "status": "False", "reason": "SchedulingGpu"},
{"type": "Initialized", "status": "False"}
]
}
}
5.1.2 查询 GPU 节点状态
# L2 查询节点 GPU 状态
GET /api/v1/nodes/gpu-node-01/proxy/metrics?gpu=0,1
# 查询节点拓扑
GET /api/v1/nodes/gpu-node-01
# 响应 (K8s Node)
HTTP/1.1 200 OK
{
"apiVersion": "v1",
"kind": "Node",
"metadata": {
"name": "gpu-node-01",
"labels": {
"accelerator": "nvidia",
"accelerator-model": "A100-SXM-80GB",
"accelerator-count": "8",
"accelerator-topology": "nvlink-fullmesh",
"accelerator-mig-enabled": "true",
"node-type": "gpu-compute",
"network.bandwidth": "100Gbps"
}
},
"status": {
"capacity": {
"nvidia.com/gpu": "8",
"cpu": "128",
"memory": "2048Gi",
"ephemeral-storage": "3.5Ti"
},
"allocatable": {
"nvidia.com/gpu": "8",
"cpu": "124",
"memory": "1980Gi",
"ephemeral-storage": "3.4Ti"
},
"conditions": [
{"type": "Ready", "status": "True", "lastHeartbeatTime": "2026-06-02T10:29:00Z"},
{"type": "NetworkUnavailable", "status": "False"},
{"type": "GPUHealthy", "status": "True"}
],
"images": [
{"names": ["vllm:v0.5.0-cuda12.4"], "sizeBytes": 8589934592}
]
}
}
5.2 关键 CRD 定义
# GPUQuota CRD:GPU 配额管理
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: gpuquotas.ai.internal
spec:
group: ai.internal
scope: Namespaced
names:
plural: gpuquotas
singular: gpuquota
kind: GPUQuota
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
team:
type: string
totalGPUs:
type: integer
minimum: 0
priority:
type: integer
enum: [1, 2, 3, 4, 5]
preemptible:
type: boolean
validUntil:
type: string
format: date-time
allowedModels:
type: array
items:
type: string
excludedNodes:
type: array
items:
type: string
required: [team, totalGPUs]
---
# GPUQuota 实例:AI 平台团队配额
apiVersion: ai.internal/v1
kind: GPUQuota
metadata:
name: ai-platform-quota
namespace: default
spec:
team: ai-platform
totalGPUs: 256
priority: 3
preemptible: false
allowedModels:
- "A100-SXM-80GB"
- "H100-SXM-80GB"
5.3 关键 API 端点汇总
| 路径 | 方法 | 说明 | 所属组件 |
|---|---|---|---|
/api/v1/namespaces/{ns}/pods | POST | 创建 Pod(含 GPU 资源声明) | K8s API |
/api/v1/nodes/{node} | GET | 查询节点资源与状态 | K8s API |
/apis/scheduling.volcano.sh/v1beta1/queues | POST/GET | 管理 Volcano 资源队列 | Volcano |
/apis/scheduling.volcano.sh/v1beta1/podgroups | POST | 创建 PodGroup(Gang Scheduling) | Volcano |
/apis/storage.k8s.io/v1/storageclasses | GET | 查询存储类定义 | K8s API |
/api/v1/namespaces/{ns}/persistentvolumeclaims | POST | 创建 PVC 声明 | K8s API |
/apis/networking.k8s.io/v1/networkpolicies | POST | 创建网络隔离策略 | K8s API |
/metrics | GET | Prometheus 指标端点 | DCGM Exporter |
/api/v1/query | GET | Prometheus 即时查询 | Prometheus |
/api/v1/alertmanager/alerts | POST | 接收告警事件 | Alertmanager |
6. SLA / SLO 目标
L1 作为平台的最底层,其 SLO 目标直接影响上层所有服务的可用性和性能。以下指标是 L1 团队承诺的关键 SLO:
| 类别 | SLI | SLO 目标 | 测量方式 | 窗口期 |
|---|---|---|---|---|
| 可用性 | GPU 集群整体可用性 | ≥ 99.9% | Prometheus Node Exporter Up 指标 | 月度 |
| K8s API Server 可用性 | ≥ 99.95% | API Server 请求成功率 | 月度 | |
| 存储系统可用性 | ≥ 99.99% | Ceph/MinIO 健康检查 | 月度 | |
| 性能 | GPU Pod 调度延迟 (P50) | < 30s | Pod Scheduled 时间戳差值 | 周度 |
| GPU Pod 调度延迟 (P99) | < 120s | Pod Scheduled 时间戳差值 | 周度 | |
| 存储读带宽 (单节点) | ≥ 5 GB/s | FIO / JuiceFS bench | 月度 | |
| 存储写带宽 (单节点) | ≥ 2 GB/s | FIO / JuiceFS bench | 月度 | |
| 故障恢复 | GPU 故障检测时间 | < 1 min | DCGM XID 事件 → Prometheus 告警 | 月度 |
| GPU 故障隔离时间 | < 5 min | Node cordon → drain 完成 | 月度 | |
| 节点替换时间 | < 30 min | 新节点注册到集群就绪 | 季度 | |
| PV 自动供应时间 | < 10 s | PVC 创建 → PV Bound | 周度 | |
| 容量 | GPU 利用率目标 | ≥ 70% | DCGM GPU Util 平均值 | 周度 |
| 集群碎片率 | < 15% | 不可分配 GPU / 总 GPU | 周度 |
7. 技术选型
以下表格列出了 L1 层所有技术组件的选型结果、版本和选型理由:
| 领域 | 选型 | 版本 | 备选 | 选型理由 |
|---|---|---|---|---|
| 容器编排 | Kubernetes | 1.28+ | Nomad · Slurm | 生态最丰富,GPU 调度全链路支持,社区活跃度最高。业界 AI 平台事实标准。 |
| 批量调度 | Volcano | 1.9+ | YuniKorn · Koordinator | 原生支持 Gang Scheduling、Queue 管理、GPU 拓扑调度。CNCF Incubating。与 K8s 深度集成。 |
| GPU 共享 | HAMi | 2.3+ | Run:ai · MIG · time-slicing | 开源、无锁设计、支持显存与算力双维度限制、支持 MIG 与 vGPU 混合调度。 |
| GPU 驱动管理 | NVIDIA GPU Operator | 24.3+ | 手动部署 | 自动化 GPU 节点初始化流程,降低运维复杂度。MIG 配置、DCGM 部署一键完成。 |
| 多集群管理 | Karmada (可选) | 1.10+ | Clusternet · OCM · KubeFed | K8s 原生 API、无需 Agent、支持多集群调度策略、社区活跃。仅跨 DC 场景使用。 |
| 服务网格 | Istio | 1.21+ | Linkerd · Consul Connect | 功能最丰富,Ingress + 东-西流量统一管理,Envoy 代理生态。适用于推理流量的灰度发布。 |
| GPU 自动伸缩 | KEDA | 2.14+ | HPA · VPA | 支持 Prometheus 触发器,可根据 GPU 利用率/QPS 做弹性伸缩。HPA 不直接支持外部指标。 |
| 对象存储 | MinIO + Ceph RGW | 2024-06+ / Reef 18.2 | SeaweedFS · Swift | MinIO 用于高性能模型存取,Ceph RGW 用于统一存储池。S3 API 兼容,成熟度高。 |
| 文件存储 | JuiceFS | 1.1+ | Lustre · GPFS · NFS | POSIX 兼容,基于对象存储实现。元数据独立,性能优于传统 NFS。支持本地缓存加速。 |
| 块存储 | Ceph RBD | Reef 18.2 | Longhorn · OpenEBS | 生产级可靠,三副本数据安全,支持快照/克隆,CSI 驱动成熟。 |
| 本地缓存 | Stargz Snapshotter | 0.15+ | Nydus · OverlayBD | 与 Containerd 原生集成,懒加载加速大模型镜像启动,减少镜像分发时间 90%。 |
| 网络 (计算) | InfiniBand NDR / RoCEv2 | NDR200 / 100GbE | Slingshot (HPE Cray) | IB 确定性低延迟适合大规模训练;RoCEv2 成本适中适合混合负载。两者均支持 GPU Direct RDMA。 |
| CNI 插件 | Calico + Multus | 3.28+ / 4.0+ | Cilium · Flannel | Calico 支持 NetworkPolicy + eBPF;Multus 支持多网卡绑定,实现三平面网络。 |
| 监控 | Prometheus + VictoriaMetrics | 2.52+ / 1.101+ | Thanos · Mimir | Prometheus 标准套件;VictoriaMetrics 负责长期存储,兼容 PromQL,单机即可支撑百万指标。 |
| GPU 监控 | DCGM Exporter | 3.3+ | nvidia-smi · NVML 自采 | 开箱即用 Prometheus 格式,覆盖所有关键 GPU 指标,支持 XID/ECC 等硬件故障信号。 |
| 日志 | Loki | 3.0+ | ELK · ClickHouse | 与 Grafana 原生集成,无需独立日志存储,标签索引降低运维复杂度。 |
| 追踪 | Tempo | 2.5+ | Jaeger · SigNoz | Grafana 生态统一,支持低成本对象存储后端,适合推理链路追踪。 |
| 告警 | Alertmanager | 0.27+ | Grafana OnCall · PagerDuty | Prometheus 原生生态,支持飞书/钉钉/企微 webhook,路由规则灵活。 |
| 故障检测 | Node Problem Detector | 0.8+ | 自研巡检脚本 | K8s 原生问题检测,支持自定义检测器与自动 Node Drain,与 GPU 故障处理流水线集成。 |
8. 关联关系图
下图展示了 L1 与 L2(上层)、物理硬件(下层)以及 L1 内部子模块之间的关系:
8.1 模块依赖关系速览
| 源模块 | 目标模块 | 交互方式 | 说明 |
|---|---|---|---|
| 算力管理 | 编排平台 | Device Plugin API | GPU 设备注册到 K8s, 调度时分配 |
| 编排平台 | 存储系统 | CSI Driver | Pod 声明 PVC → 自动创建 PV 并挂载 |
| 编排平台 | 网络架构 | CNI Plugin | Pod 创建时分配 IP 和多网络平面 |
| 算力管理 | 监控告警 | DCGM Exporter | GPU 健康指标暴露给 Prometheus |
| 存储系统 | 监控告警 | Ceph Exporter | 存储集群状态暴露给 Prometheus |
| 网络架构 | 监控告警 | Node Exporter | 网络接口指标暴露 |
| 编排平台 | L2 上层 | K8s API | L2 通过 K8s API 管理推理 Pod 生命周期 |
| 监控告警 | L2 上层 | Prometheus API | L2 可查询 GPU 利用率等指标做扩缩容决策 |