k8s中Nvidia节点驱动的配置问题
首先确认下需要安装nvidia驱动的节点
查看VGA型号
lspci -vnn| grep VGA
返回结果的十六进制码到网站查询
型号查询地址
https://admin.pci-ids.ucw.cz/mods/PC/10de
卸载原有驱动
apt remove --purge nvidia-*
禁用nouveau并安装官网下载来的nvidia二进制驱动
NVIDIA_DRIVER_VERSION=
cp /etc/modprobe.d/blacklist.conf /etc/modprobe.d/blacklist.conf.bak
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist lbm-nouveau" >> /etc/modprobe.d/blacklist.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist.conf
echo "alias nouveau off" >> /etc/modprobe.d/blacklist.conf
echo "alias lbm-nouveau off" >> /etc/modprobe.d/blacklist.confecho options nouveau modeset=0 | tee -a /etc/modprobe.d/nouveau-kms.conf
update-initramfs -uservice lightdm stop
init 3
chmod 755 NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run
./NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run --no-x-check --no-nouveau-check
reboot
容器nvidia工具包安装
apt install -y nvidia-container-toolkit
如果运行时是docker
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
如果运行时是containerd
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
sed -i 's/SystemdCgroup = false/SystemdCgroup = true/g' /etc/containerd/config.toml
# /etc/containerd/config.toml
# [plugins."io.containerd.grpc.v1.cri"]
# [plugins."io.containerd.grpc.v1.cri".containerd]
# default_runtime_name = "nvidia-container-runtime"# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
# runtime_type = "io.containerd.runc.v2"# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
# runtime_type = "io.containerd.runtime.v1.linux"
# runtime_engine = "/usr/bin/nvidia-container-runtime"nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd
如果是k3s
nvidia-ctk runtime configure --runtime=containerd --set-as-default --config /var/lib/rancher/k3s/agent/etc/containerd/config.toml
sudo cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmplsudo systemctl restart k3s
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
kubectl get pods -n kube-system | grep nvidia-device-plugin
k8s调用nvidia格式
apiVersion: v1
kind: Pod
metadata:name: nvidia-smi-test
spec:restartPolicy: OnFailurecontainers:- name: nvidia-smiimage: nvidia/cuda:12.0.0-base-ubuntu20.04command: ["nvidia-smi"]resources:limits:nvidia.com/gpu: 1
标记gpu节点
kubectl label nodes <node-name> nvidia.com/gpu=true
安装gpu-operator
helm install gpu-operator gpu-operator --namespace gpu-operator --create-namespace
helm安装后
kubectl get clusterpolicies.nvidia.com cluster-policy -n gpu-operator -o yaml > cluster-policy.yaml
创建configmap
time-slicing-config
apiVersion: v1
kind: ConfigMap
metadata:name: time-slicing-config-all
data:any: |-version: v1flags:migStrategy: nonesharing:timeSlicing:resources:- name: nvidia.com/gpureplicas: 4
kubectl patch clusterpolicies.nvidia.com/cluster-policy \-n gpu-operator --type merge \-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}'