在Kubernetes部署Stable Diffusion

Kubernetes Atlas 上部署 Stable Diffusion 架构 (包括 Stable Diffusion web UI 和 自动模型获取),可以快速完成一个基于 NVIDIA GPU (我使用 Nvidia Tesla P10 GPU运算卡 )的 Machine Learning Atlas 案例。此外,我的部署在 私有云架构 构建的 采用OVMF实现passthrough GPU和NVMe存储 虚拟化,也验证了云计算方案。

功能

  • 自动模型获取

  • 结合 NVIDIA GPU Operator ,采用 NVIDIA CUDA 库,具有多功能交互UI

  • GFPGAN 用于人脸重建,RealESRGAN 用于超采样

  • 文本倒置(Textual Inversion)

准备工作

安装

  • 添加helm仓库 stable-diffusion-k8s :

添加stable-diffusion-k8s helm仓库
helm repo add amithkk-sd https://amithkk.github.io/stable-diffusion-k8s
helm repo update
  • (可选)创建一个 values.yaml 配置定制设置:

定制values.yaml
# Default values for stable-diffusion.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

replicaCount: 1

image:
  repository: amithkk/stable-diffusion
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "latest"

# Stable Diffusion and optional companion models. Change these out if you'd like
models:
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
#cliFlags: "--extra-models-cpu --optimized-turbo"
cliFlags: "--extra-models-cpu --no-half"

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: manual
  accessMode: ReadWriteOnce
  size: 8Gi


serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

podAnnotations: {}

podSecurityContext: {}
  # fsGroup: 2000

securityContext: {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: true
  className: "cilium"
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

resources: {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 100
  targetCPUUtilizationPercentage: 80
  # targetMemoryUtilizationPercentage: 80

nodeSelector: 
  nvidia.com/gpu.present: "true"

tolerations: []

affinity: {}

values.yaml

  • values.yaml 中,修订 cliArgs 可以向WebUI传递参数。默认使用参数是 --extra-models-cpu --optimized-turbo ,此时会使用 6GB GPU

  • 如果要激活 文本倒置(Textual Inversion) ,则去除 --optimize--optimize-turbo 参数,然后添加 --no-halfcliFlags (见上文我配置 values.yaml )

  • 如果输出总是一个绿色图像,则使用参数 --precision full --no-half

开始安装

安装stable-diffusion-k8s
helm install --generate-name amithkk-sd/stable-diffusion -f values.yaml

提示信息:

安装stable-diffusion-k8s的输出信息
NAME: stable-diffusion-1673539037
LAST DEPLOYED: Thu Jan 12 23:57:31 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  http://chart-example.local/

Helm Install Complete! It may take a while for the models & container to download (~8GiB)
http://github.com/amithkk/stable-diffusion-k8s

问题排查

stable diffusion pod调度排查

  • 发现pod调度没有成功,始终pending:

  • 检查 pods:

    kubectl describe pods stable-diffusion-1673539037-0
    

可以看到调度失败信息:

stable-diffusion pod 调度失败信息
...
QoS Class:                   Burstable
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  4m42s  default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
  • 检查配备了GPU的节点 z-k8s-n-1 标签,是存在 nvidia.com/gpu.present=true

  • 显示调度失败是因为没有 PersistentVolumeClaims 持久化卷申明,也就是说集群需要先部署一个卷

这里 Node-Selectors:              nvidia.com/gpu.present=true ,可以通过 kubectl get nodes --show-labels 看到,安装了NVIDIA GPU的 z-k8s-n-1 是具备该标签的

这个报错 修复”Unbound Immediate PersistentVolumeClaims”错误 感觉原因是我部署节点都只分配了 9.5G 磁盘作为 containerd运行时(runtime) 存储目录,实际上多次安装以后磁盘空间只剩下 1.x GB:

/dev/vdb1空间不足
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           392M   68M  325M  18% /run
/dev/vda2        16G  9.5G  6.3G  61% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda1       240M  5.3M  235M   3% /boot/efi
/dev/vdb1       9.4G  7.9G  1.5G  85% /var/lib/containerd

而这个 stable-diffusion 自身镜像就需要下载8G,同时卷申明也需要空间

检查:

$ kubectl get pvc
NAME                                                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0   Pending                                                     16m

$ kubectl get pv
No resources found
  • 在线扩展Ceph RBD磁盘vdb1 将虚拟机 z-k8s-n-1/var/lib/containerd 扩展成50G

  • 不过还是存在同样的调度问题,仔细 GPU节点调度异常排查 发现这是因为 passthrough GPU设备只能分配给一个pod使用,而在 安装NVIDIA GPU Operator 的示例pod没有删除,导致GPU设备占用。删除掉占用GPU的测试pod之后…怎么,还是没有调度成功…

  • 仔细检查 vpc

    $ kubectl get pvc
    NAME                                                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0   Pending                                                     13h
    
    $ kubectl describe pvc stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0
    Name:          stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0
    Namespace:     default
    StorageClass:
    Status:        Pending
    Volume:
    Labels:        app.kubernetes.io/instance=stable-diffusion-1673539037
                   app.kubernetes.io/name=stable-diffusion
    Annotations:   <none>
    Finalizers:    [kubernetes.io/pvc-protection]
    Capacity:
    Access Modes:
    VolumeMode:    Filesystem
    Used By:       stable-diffusion-1673539037-0
    Events:
      Type    Reason         Age                     From                         Message
      ----    ------         ----                    ----                         -------
      Normal  FailedBinding  4m19s (x3262 over 13h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set
    

可以看到 describe pod 有如下存储定义:

stable-diffusion-1673589055-model-store:

Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: stable-diffusion-1673589055-model-store-stable-diffusion-1673589055-0 ReadOnly: false

Stable Diffusion on Kubernetes with Helm 项目的issue中有一个 Issues with “storageClassName” #1 提到 storageClassName 定义持久化存储,有一个代码合并提示文档:

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  accessMode: ReadWriteOnce
  size: 8Gi

为了简化,先暂时关闭持久化存储:

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: "-"
  accessMode: ReadWriteOnce
  size: 8Gi
  • 然后尝试debug方式安装:

    $ helm install --debug --generate-name amithkk-sd/stable-diffusion -f values.yaml
    

发现安装失败原因是无法下载:

使用helm install –debug 安装stable-diffusion显示无法下载
install.go:178: [debug] Original chart version: ""
Error: INSTALLATION FAILED: Get "https://objects.githubusercontent.com/github-production-release-asset-2e65be/531492005/ed53d251-396e-481c-830b-f4aaf978e997?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230113T060427Z&X-Amz-Expires=300&X-Amz-Signature=2ec2c1d958e6260e2db265c6fd623a73230e3ce3639c493fdd1f01f809386e30&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=531492005&response-content-disposition=attachment%3B%20filename%3Dstable-diffusion-1.0.6.tgz&response-content-type=application%2Foctet-stream": read tcp 192.168.6.101:41078->185.199.108.133:443: read: connection reset by peer
helm.go:84: [debug] Get "https://objects.githubusercontent.com/github-production-release-asset-2e65be/531492005/ed53d251-396e-481c-830b-f4aaf978e997?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230113T060427Z&X-Amz-Expires=300&X-Amz-Signature=2ec2c1d958e6260e2db265c6fd623a73230e3ce3639c493fdd1f01f809386e30&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=531492005&response-content-disposition=attachment%3B%20filename%3Dstable-diffusion-1.0.6.tgz&response-content-type=application%2Foctet-stream": read tcp 192.168.6.101:41078->185.199.108.133:443: read: connection reset by peer
INSTALLATION FAILED
main.newInstallCmd.func2
	helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra@v1.4.0/command.go:902
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
	runtime/proc.go:255
runtime.goexit
	runtime/asm_amd64.s:1581

启动代理翻墙,可以看到 helm 正常工作如下:

启用代理翻墙,helm install –debug 安装stable-diffusion
install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /home/huatai/.cache/helm/repository/stable-diffusion-1.0.6.tgz

client.go:128: [debug] creating 5 resource(s)
NAME: stable-diffusion-1673590319
LAST DEPLOYED: Fri Jan 13 14:12:03 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
affinity: {}
autoscaling:
  enabled: false
  maxReplicas: 100
  minReplicas: 1
  targetCPUUtilizationPercentage: 80
cliFlags: --extra-models-cpu --no-half
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: amithkk/stable-diffusion
  tag: latest
imagePullSecrets: []
ingress:
  annotations: {}
  className: cilium
  enabled: true
  hosts:
  - host: chart-example.local
    paths:
    - path: /
      pathType: ImplementationSpecific
  tls: []
models:
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
nameOverride: ""
nodeSelector:
  nvidia.com/gpu.present: "true"
persistence:
  accessMode: ReadWriteOnce
  annotations: {}
  size: 8Gi
  storageClass: '-'
podAnnotations: {}
podSecurityContext: {}
replicaCount: 1
resources: {}
securityContext: {}
service:
  port: 80
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []

COMPUTED VALUES:
affinity: {}
autoscaling:
  enabled: false
  maxReplicas: 100
  minReplicas: 1
  targetCPUUtilizationPercentage: 80
cliFlags: --extra-models-cpu --no-half
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: amithkk/stable-diffusion
  tag: latest
imagePullSecrets: []
ingress:
  annotations: {}
  className: cilium
  enabled: true
  hosts:
  - host: chart-example.local
    paths:
    - path: /
      pathType: ImplementationSpecific
  tls: []
models:
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
nameOverride: ""
nodeSelector:
  nvidia.com/gpu.present: "true"
persistence:
  accessMode: ReadWriteOnce
  annotations: {}
  size: 8Gi
  storageClass: '-'
podAnnotations: {}
podSecurityContext: {}
replicaCount: 1
resources: {}
securityContext: {}
service:
  port: 80
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []

HOOKS:
MANIFEST:
---
# Source: stable-diffusion/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: stable-diffusion/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: stable-diffusion-1673590319-config
data:
  CLI_FLAGS: "--extra-models-cpu --no-half"
---
# Source: stable-diffusion/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
---
# Source: stable-diffusion/templates/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  serviceName: stable-diffusion-1673590319
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: stable-diffusion
      app.kubernetes.io/instance: stable-diffusion-1673590319
  volumeClaimTemplates:
    - metadata:
        name: stable-diffusion-1673590319-model-store
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: "-"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: stable-diffusion
        app.kubernetes.io/instance: stable-diffusion-1673590319
    spec:
      serviceAccountName: stable-diffusion-1673590319
      securityContext:
        {}
      initContainers:
        - name: ensure-stable-models
          image: busybox:1.35
          command: ["/bin/sh"]
          args:
            - -c
            - >-
                wget -nc https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media -O /models/model.ckpt;
                wget -nc https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth -P /models;
                wget -nc https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth -P /models;
                wget -nc https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth -P /models
          volumeMounts:
            - mountPath: /models
              name: stable-diffusion-1673590319-model-store
      containers:
        - name: stable-diffusion-stable-diffusion
          securityContext:
            {}
          image: "amithkk/stable-diffusion:latest"
          imagePullPolicy: IfNotPresent
          envFrom:
            - configMapRef:
                name: stable-diffusion-1673590319-config
          ports:
            - name: http
              containerPort: 7860
              protocol: TCP
          volumeMounts:
            - mountPath: /models
              name: stable-diffusion-1673590319-model-store
          # Todo - Implement an efficient readiness and liveness check
          resources:
            {}
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
# Source: stable-diffusion/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  ingressClassName: cilium
  rules:
    - host: "chart-example.local"
      http:
        paths:
          - path: /
            pathType: ImplementationSpecific
            backend:
              service:
                name: stable-diffusion-1673590319
                port:
                  number: 80

NOTES:
1. Get the application URL by running these commands:
  http://chart-example.local/

Helm Install Complete! It may take a while for the models & container to download (~8GiB)
http://github.com/amithkk/stable-diffusion-k8s

但是奇怪,这次没有看到 kubectl get pods 输出有 stable-diffusion 的pod

回退掉 storageClass: "-" ,尝试验证:

$ helm install --debug --dry-run --generate-name amithkk-sd/stable-diffusion -f values.yaml

PVC和PV配置解决调度问题

  • 检查 stable-diffusion 所要求的pvc:

    kubectl get pvc stable-diffusion-1673591786-model-store-stable-diffusion-1673591786-0 -o yaml
    

可以看到:

stable-diffusion的PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: "2023-01-13T06:36:29Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: stable-diffusion-1673591786
    app.kubernetes.io/name: stable-diffusion
  name: stable-diffusion-1673591786-model-store-stable-diffusion-1673591786-0
  namespace: default
  resourceVersion: "50346196"
  uid: 9dbe56a9-c741-4bd3-9914-df2794647ee6
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  volumeMode: Filesystem
status:
  phase: Pending

参考 Configure a Pod to Use a PersistentVolume for Storage 有一句非常关键的话:

If the control plane finds a suitable PersistentVolume with the same StorageClass, it binds the claim to the volume.

明白了:

  • stable-diffusion 声明了一个PVC,需要有对应的PV,这个PVC和PV对应绑定关系是通过 storageClassName 指定的关联

  • Stable Diffusion on Kubernetes with Helm 项目的issue中有一个 Issues with “storageClassName” #1 提到 storageClassName 定义持久化存储

    • 默认没有配置,如果是自己部署,需要定义一个PV和PVC关联的 storageClass

    • 可以使用本地存储(例如我现在就搞一个)

  • stable-diffusion-pv.ymal :

创建定义本地存储 stable-diffusion-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: stable-diffusion-local-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/var/lib/containerd/models"

注意:这里定义了存储类型命名是 manual ,所以需要对应修改 values.yaml 在其中添加了一行:

在values.yaml中添加一行 storageClass: manual
persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: manual
  accessMode: ReadWriteOnce
  size: 8Gi
  • 此时再次执行 helm install ... -f values.yaml 就可以看到能够正确调度到 z-k8s-n-1

    $ kubectl get pods -o wide
    NAME                              READY   STATUS      RESTARTS        AGE     IP           NODE        NOMINATED NODE   READINESS GATES
    ,..
    stable-diffusion-1673593163-0     0/2     Init:1/3    36 (16m ago)    169m    10.0.3.164   z-k8s-n-1   <none>           <none>
    

stable-diffusion容器启动失败排查

调度虽然解决了,但是观察pod始终停留在初始化状态 Init:1/3

  • 检查pod:

    kubectl describe pods stable-diffusion-1673593163-0
    

可以看到是容器启动问题:

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  15m (x628 over 170m)  kubelet  Back-off restarting failed container
  • 检查日志:

    kubectl logs stable-diffusion-1673593163-0
    

可以看到:

Error from server (BadRequest): container "stable-diffusion-stable-diffusion" in pod "stable-diffusion-1673593163-0" is waiting to start: PodInitializing

之前发现容器镜像下载因为GFW原因出现TLS连接超时,怀疑是镜像下载问题导致。

解决方法

  • containerd代理 (类似 配置Docker使用代理) 在虚拟机中运行的 containerd 服务配置代理,这样下载镜像就能够翻越GFW

  • 使用Kubernetes的环境变量( env: ),向启动的poed容器注入代理配置:

    env:
    - name: HTTP_PROXY
      value: "http://192.168.6.200:3128"
    - name: HTTPS_PROXY
      value: "http://192.168.6.200:3128"
    

参考