当前位置: 首页 > news >正文

[kubernetes]-k8s通过psp限制nvidia-plugin插件的使用

导语: k8s通过psp限制nvidia-plugin插件的使用。刚开始接触psp 记录一下 后续投入生产测试了再完善。

通过apiserver开启psp 静态pod会自动更新

# PSP(Pod Security Policy) 在默认情况下并不会开启。通过将PodSecurityPolicy关键词添加到 --enbale-admission-plugins 配置数组后,可以开启PSP权限认证功能。
# /etc/kubernetes/manifests/kube-apiserver.yaml   在NodeRestriction后添加PodSecurityPolicy
- --enable-admission-plugins=NodeRestriction,PodSecurityPolicy

直接创建容器测试

lung.yaml

apiVersion: apps/v1
kind: Deployment
metadata:name: lunglabels:k8s-app: lungk8s-med-type: biz-internel
spec:strategy:type: Recreatereplicas: 1selector:matchLabels:k8s-app: lungtemplate:metadata:labels:k8s-app: lungspec:
#      runtimeClassName: nvidia
#      hostPID: truecontainers:- name: lungimage: nvidia/cuda:11.3.0-base-ubi8command: ["sh","-c","tail -f /dev/null "]#command: ["sh","-c","for i in `ls /srv/conf-drwise220531`;do rm -rf /root/lung/$i/conf && ln -s  /srv/conf-drwise220531/$i/conf /root/lung/$i/  ;done  && rm -rf /root/lung/Release/path.conf /root/lung/path.conf  && ln -s /srv/conf-drwise220531/Release/path.conf /root/lung/Release/ && ln -s /root/lung/Release/path.conf  /root/lung/ &&  sh /root/aiclassifier/startup.sh &&  sh /root/lung/startup.sh "]
#        securityContext:
#          privileged: trueenv:- name: NVIDIA_DRIVER_CAPABILITIESvalue: compute,utility,video,graphics,display- name: NVIDIA_VISIBLE_DEVICESvalue: allvolumeMounts:- mountPath: /dev/shmname: dshmvolumes:- name: dshmemptyDir:medium: MemorysizeLimit: 1Gi
#deepwise-operator
#      serviceAccountName: deepwise-operator
# 创建deployment测试  发下会有psp的问题
# 注意:开启PodSecurityPolicy功能后,即使没有使用任何安全策略,都会使得创建pods(包括调度任务重新创建pods)失败kubectl apply -f lung -n deepwise 

创建对应的资源限制策略

4.nvidia-plugin.yaml

# 显卡驱动的限制
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:name: psp-nvidia
spec:privileged: falsefsGroup:rule: RunAsAnyrunAsUser:rule: RunAsAnyseLinux:rule: RunAsAnysupplementalGroups:rule: RunAsAnyvolumes:- "*"hostPID: falsehostIPC: falsehostNetwork: false---
apiVersion: v1
kind: ServiceAccount
metadata:namespace: kube-systemname: nvidiaoperator---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:name: psp-permissive-nvidianamespace: kube-system
rules:- apiGroups:- extensionsresources:- podsecuritypoliciesresourceNames:- psp-nvidiaverbs:- use
---apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: psp-permissive-nvidianamespace: kube-system
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: psp-permissive-nvidia
subjects:- kind: ServiceAccountname: nvidiaoperatornamespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:name: nvidia-device-plugin-daemonsetnamespace: kube-system
spec:selector:matchLabels:name: nvidia-device-plugin-dsupdateStrategy:type: RollingUpdatetemplate:metadata:# This annotation is deprecated. Kept here for backward compatibility# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/annotations:scheduler.alpha.kubernetes.io/critical-pod: ""labels:name: nvidia-device-plugin-dsspec:runtimeClassName: nvidiatolerations:# This toleration is deprecated. Kept here for backward compatibility# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/- key: CriticalAddonsOnlyoperator: Exists- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"containers:- image: harbor.deepwise.com/terra-k8s/k8s-device-plugin:v0.10.0name: nvidia-device-plugin-ctrargs: ["--fail-on-init-error=false"]securityContext:allowPrivilegeEscalation: falsecapabilities:drop: ["ALL"]volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:- name: device-pluginhostPath:path: /var/lib/kubelet/device-pluginsserviceAccountName: nvidiaoperator

5.nvida-psp.yaml

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:name: psp-nvidia
spec:privileged: falsefsGroup:rule: RunAsAnyrunAsUser:rule: RunAsAnyseLinux:rule: RunAsAnysupplementalGroups:rule: RunAsAnyvolumes:- "*"hostPID: falsehostIPC: falsehostNetwork: false---
apiVersion: v1
kind: ServiceAccount
metadata:namespace: kube-systemname: nvidiaoperator---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:name: psp-permissive-nvidianamespace: kube-system
rules:- apiGroups:- extensionsresources:- podsecuritypoliciesresourceNames:- psp-nvidiaverbs:- use
---apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: psp-permissive-nvidianamespace: kube-system
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: psp-permissive-nvidia
subjects:- kind: ServiceAccountname: nvidiaoperatornamespace: kube-system

6.deepwise-psp.yaml

# 用户的限制
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:name: psp-deepwise
spec:privileged: falsefsGroup:rule: RunAsAnyrunAsUser:rule: RunAsAnyseLinux:rule: RunAsAnysupplementalGroups:rule: RunAsAnyvolumes:- "*"hostPID: falsehostIPC: falsehostNetwork: false---
apiVersion: v1
kind: ServiceAccount
metadata:namespace: deepwisename: deepwise-operator---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:name: psp-permissive-deepwisenamespace: deepwise
rules:- apiGroups:- extensionsresources:- podsecuritypoliciesresourceNames:- psp-deepwiseverbs:- use
---apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:name: psp-permissive-deepwisenamespace: deepwise
roleRef:apiGroup: rbac.authorization.k8s.iokind: Rolename: psp-permissive-deepwise
subjects:- kind: ServiceAccountname: deepwise-operatornamespace: deepwise

7.runtimeclass.yaml

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:name: "nvidia"
handler: "nvidia"

如果是docker运行时,handler需要调整为docker。使用containerd则不需要调整。参考https://opni.io/setup/gpu/

重新创建lung的deployment

# 加上runtimeClassName: nvidia
# 加上serviceAccountName: deepwise-operator
apiVersion: apps/v1
kind: Deployment
metadata:name: lunglabels:k8s-app: lungk8s-med-type: biz-internel
spec:strategy:type: Recreatereplicas: 1selector:matchLabels:k8s-app: lungtemplate:metadata:labels:k8s-app: lungspec:runtimeClassName: nvidia
#      hostPID: truecontainers:- name: lungimage: nvidia/cuda:11.3.0-base-ubi8command: ["sh","-c","tail -f /dev/null "]#command: ["sh","-c","for i in `ls /srv/conf-drwise220531`;do rm -rf /root/lung/$i/conf && ln -s  /srv/conf-drwise220531/$i/conf /root/lung/$i/  ;done  && rm -rf /root/lung/Release/path.conf /root/lung/path.conf  && ln -s /srv/conf-drwise220531/Release/path.conf /root/lung/Release/ && ln -s /root/lung/Release/path.conf  /root/lung/ &&  sh /root/aiclassifier/startup.sh &&  sh /root/lung/startup.sh "]
#        securityContext:
#          privileged: trueenv:- name: NVIDIA_DRIVER_CAPABILITIESvalue: compute,utility,video,graphics,display- name: NVIDIA_VISIBLE_DEVICESvalue: allvolumeMounts:- mountPath: /dev/shmname: dshmvolumes:- name: dshmemptyDir:medium: MemorysizeLimit: 1Gi
#deepwise-operatorserviceAccountName: deepwise-operator

参考文档

https://kubernetes.io/zh-cn/docs/reference/access-authn-authz/admission-controllers/

https://blog.csdn.net/tushanpeipei/article/details/121940757

https://blog.csdn.net/weixin_45081220/article/details/125407608

http://www.lryc.cn/news/22332.html

相关文章:

  • 简单易懂又非常牛逼的Spring源码解析,推断构造与bean的实例化
  • Win11的两个实用技巧系列清理磁盘碎片、设置系统还原点的方法
  • 嵌入式 STM32 红外遥控
  • 【java web篇】使用JDBC操作数据库
  • 华为OD机试题,用 Java 解【最小步骤数】问题
  • JAVA中 throw 和 throws 的区别含案例
  • 基于SpringCloud的可靠消息最终一致性05:保存并发送事务消息
  • SQL语句大全(详解)
  • 视频营销活动中7个常见的错误
  • MapReduce小试牛刀
  • 2023年全国最新工会考试精选真题及答案7
  • 13-mvc框架原理与实现方式
  • 弹性盒子布局
  • C# Sqlite数据库加密
  • 高压放大器在声波谐振电小天线收发测试系统中的应用
  • 锁相环的组成和原理及应用
  • [C++]string类模拟实现
  • 一个更适合Java初学者的轻量级开发工具:BlueJ
  • 从程序员到项目组长,要经历六重修炼
  • 我的 System Verilog 学习记录(5)
  • 多芯片设计 Designing For Multiple Die
  • 2022年全国职业院校技能大赛(中职组)网络安全竞赛试题A(10)
  • 数据结构-简介
  • python装饰器及其用法
  • Appium自动化测试之启动时跳过初始化设置
  • JavaScript DOM【快速掌握知识点】
  • 不需要高深技术,只需要Python:创建一个可定制的HTTP服务器!
  • 渗透测试常用浏览器插件汇总
  • 社区1月月报|OceanBase 4.1 即将发版,哪些功能将会更新?
  • Javascript的API基本内容(二)