在Kubernetes部署Stable Diffusion¶

在 Kubernetes Atlas 上部署 Stable Diffusion 架构 (包括 Stable Diffusion web UI 和自动模型获取)，可以快速完成一个基于 NVIDIA GPU (我使用 Nvidia Tesla P10 GPU运算卡 )的 Machine Learning Atlas 案例。此外，我的部署在私有云架构构建的采用OVMF实现passthrough GPU和NVMe存储虚拟化，也验证了云计算方案。

功能¶

自动模型获取
结合 NVIDIA GPU Operator ，采用 NVIDIA CUDA 库，具有多功能交互UI
GFPGAN 用于人脸重建，RealESRGAN 用于超采样
文本倒置(Textual Inversion)

准备工作¶

首先部署一个至少有一个节点具备GPU的Kubernetes集群:
在上述Kubernetes集群 Kubernetes集群(z-k8s) 安装NVIDIA GPU Operator
本地安装好 helm

安装¶

添加helm仓库 stable-diffusion-k8s :

添加stable-diffusion-k8s helm仓库¶

helm repo add amithkk-sd https://amithkk.github.io/stable-diffusion-k8s
helm repo update

(可选)创建一个 values.yaml 配置定制设置:
- 可能需要设置的参数有 nodeAffinity , cliArgs (见下文) 以及 ingress （这样就不需要使用 kubectl port-forward ，我采用 Cilium Kubernetes Ingress HTTP配置案例 )

定制values.yaml¶

# Default values for stable-diffusion.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

replicaCount: 1

image:
  repository: amithkk/stable-diffusion
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "latest"

# Stable Diffusion and optional companion models. Change these out if you'd like
models:
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
#cliFlags: "--extra-models-cpu --optimized-turbo"
cliFlags: "--extra-models-cpu --no-half"

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: manual
  accessMode: ReadWriteOnce
  size: 8Gi


serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

podAnnotations: {}

podSecurityContext: {}
  # fsGroup: 2000

securityContext: {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: true
  className: "cilium"
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

resources: {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 100
  targetCPUUtilizationPercentage: 80
  # targetMemoryUtilizationPercentage: 80

nodeSelector: 
  nvidia.com/gpu.present: "true"

tolerations: []

affinity: {}

values.yaml¶

在 values.yaml 中，修订 cliArgs 可以向WebUI传递参数。默认使用参数是 --extra-models-cpu --optimized-turbo ，此时会使用 6GB GPU
如果要激活文本倒置(Textual Inversion) ，则去除 --optimize 和 --optimize-turbo 参数，然后添加 --no-half 到 cliFlags （见上文我配置 values.yaml )
如果输出总是一个绿色图像，则使用参数 --precision full --no-half

开始安装¶

安装stable-diffusion-k8s¶

helm install --generate-name amithkk-sd/stable-diffusion -f values.yaml

提示信息:

安装stable-diffusion-k8s的输出信息¶

NAME: stable-diffusion-1673539037
LAST DEPLOYED: Thu Jan 12 23:57:31 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  http://chart-example.local/

Helm Install Complete! It may take a while for the models & container to download (~8GiB)
http://github.com/amithkk/stable-diffusion-k8s

问题排查¶

stable diffusion pod调度排查¶

发现pod调度没有成功，始终pending:

检查 pods:

kubectl describe pods stable-diffusion-1673539037-0

可以看到调度失败信息:

stable-diffusion pod 调度失败信息¶

...
QoS Class:                   Burstable
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  4m42s  default-scheduler  0/8 nodes are available: 8 pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

检查配备了GPU的节点 z-k8s-n-1 标签，是存在 nvidia.com/gpu.present=true
显示调度失败是因为没有 PersistentVolumeClaims 持久化卷申明，也就是说集群需要先部署一个卷

这里 Node-Selectors: nvidia.com/gpu.present=true ，可以通过 kubectl get nodes --show-labels 看到，安装了NVIDIA GPU的 z-k8s-n-1 是具备该标签的

这个报错修复”Unbound Immediate PersistentVolumeClaims”错误感觉原因是我部署节点都只分配了 9.5G 磁盘作为 containerd运行时(runtime) 存储目录，实际上多次安装以后磁盘空间只剩下 1.x GB:

/dev/vdb1空间不足¶

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           392M   68M  325M  18% /run
/dev/vda2        16G  9.5G  6.3G  61% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda1       240M  5.3M  235M   3% /boot/efi
/dev/vdb1       9.4G  7.9G  1.5G  85% /var/lib/containerd

而这个 stable-diffusion 自身镜像就需要下载8G，同时卷申明也需要空间

检查:

$ kubectl get pvc
NAME                                                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0   Pending                                                     16m

$ kubectl get pv
No resources found

在线扩展Ceph RBD磁盘vdb1 将虚拟机 z-k8s-n-1 的 /var/lib/containerd 扩展成50G
不过还是存在同样的调度问题，仔细 GPU节点调度异常排查发现这是因为 passthrough GPU设备只能分配给一个pod使用，而在安装NVIDIA GPU Operator 的示例pod没有删除，导致GPU设备占用。删除掉占用GPU的测试pod之后…怎么，还是没有调度成功…

仔细检查 vpc

$ kubectl get pvc
NAME                                                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0   Pending                                                     13h

$ kubectl describe pvc stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0
Name:          stable-diffusion-1673539037-model-store-stable-diffusion-1673539037-0
Namespace:     default
StorageClass:
Status:        Pending
Volume:
Labels:        app.kubernetes.io/instance=stable-diffusion-1673539037
               app.kubernetes.io/name=stable-diffusion
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       stable-diffusion-1673539037-0
Events:
  Type    Reason         Age                     From                         Message
  ----    ------         ----                    ----                         -------
  Normal  FailedBinding  4m19s (x3262 over 13h)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

可以看到 describe pod 有如下存储定义:

stable-diffusion-1673589055-model-store:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: stable-diffusion-1673589055-model-store-stable-diffusion-1673589055-0 ReadOnly: false

在 Stable Diffusion on Kubernetes with Helm 项目的issue中有一个 Issues with “storageClassName” #1 提到 storageClassName 定义持久化存储，有一个代码合并提示文档:

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  accessMode: ReadWriteOnce
  size: 8Gi

为了简化，先暂时关闭持久化存储:

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: "-"
  accessMode: ReadWriteOnce
  size: 8Gi

然后尝试debug方式安装:

$ helm install --debug --generate-name amithkk-sd/stable-diffusion -f values.yaml

发现安装失败原因是无法下载:

使用helm install –debug 安装stable-diffusion显示无法下载¶

install.go:178: [debug] Original chart version: ""
Error: INSTALLATION FAILED: Get "https://objects.githubusercontent.com/github-production-release-asset-2e65be/531492005/ed53d251-396e-481c-830b-f4aaf978e997?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230113T060427Z&X-Amz-Expires=300&X-Amz-Signature=2ec2c1d958e6260e2db265c6fd623a73230e3ce3639c493fdd1f01f809386e30&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=531492005&response-content-disposition=attachment%3B%20filename%3Dstable-diffusion-1.0.6.tgz&response-content-type=application%2Foctet-stream": read tcp 192.168.6.101:41078->185.199.108.133:443: read: connection reset by peer
helm.go:84: [debug] Get "https://objects.githubusercontent.com/github-production-release-asset-2e65be/531492005/ed53d251-396e-481c-830b-f4aaf978e997?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230113T060427Z&X-Amz-Expires=300&X-Amz-Signature=2ec2c1d958e6260e2db265c6fd623a73230e3ce3639c493fdd1f01f809386e30&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=531492005&response-content-disposition=attachment%3B%20filename%3Dstable-diffusion-1.0.6.tgz&response-content-type=application%2Foctet-stream": read tcp 192.168.6.101:41078->185.199.108.133:443: read: connection reset by peer
INSTALLATION FAILED
main.newInstallCmd.func2
	helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra@v1.4.0/command.go:902
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
	runtime/proc.go:255
runtime.goexit
	runtime/asm_amd64.s:1581

启动代理翻墙，可以看到 helm 正常工作如下:

启用代理翻墙，helm install –debug 安装stable-diffusion¶

install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /home/huatai/.cache/helm/repository/stable-diffusion-1.0.6.tgz

client.go:128: [debug] creating 5 resource(s)
NAME: stable-diffusion-1673590319
LAST DEPLOYED: Fri Jan 13 14:12:03 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
affinity: {}
autoscaling:
  enabled: false
  maxReplicas: 100
  minReplicas: 1
  targetCPUUtilizationPercentage: 80
cliFlags: --extra-models-cpu --no-half
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: amithkk/stable-diffusion
  tag: latest
imagePullSecrets: []
ingress:
  annotations: {}
  className: cilium
  enabled: true
  hosts:
  - host: chart-example.local
    paths:
    - path: /
      pathType: ImplementationSpecific
  tls: []
models:
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
nameOverride: ""
nodeSelector:
  nvidia.com/gpu.present: "true"
persistence:
  accessMode: ReadWriteOnce
  annotations: {}
  size: 8Gi
  storageClass: '-'
podAnnotations: {}
podSecurityContext: {}
replicaCount: 1
resources: {}
securityContext: {}
service:
  port: 80
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []

COMPUTED VALUES:
affinity: {}
autoscaling:
  enabled: false
  maxReplicas: 100
  minReplicas: 1
  targetCPUUtilizationPercentage: 80
cliFlags: --extra-models-cpu --no-half
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: amithkk/stable-diffusion
  tag: latest
imagePullSecrets: []
ingress:
  annotations: {}
  className: cilium
  enabled: true
  hosts:
  - host: chart-example.local
    paths:
    - path: /
      pathType: ImplementationSpecific
  tls: []
models:
  gfpGan13: https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth
  realEsrGanx4Animep02: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth
  realEsrGanx4p10: https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth
  sd14: https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media
nameOverride: ""
nodeSelector:
  nvidia.com/gpu.present: "true"
persistence:
  accessMode: ReadWriteOnce
  annotations: {}
  size: 8Gi
  storageClass: '-'
podAnnotations: {}
podSecurityContext: {}
replicaCount: 1
resources: {}
securityContext: {}
service:
  port: 80
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []

HOOKS:
MANIFEST:
---
# Source: stable-diffusion/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: stable-diffusion/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: stable-diffusion-1673590319-config
data:
  CLI_FLAGS: "--extra-models-cpu --no-half"
---
# Source: stable-diffusion/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
---
# Source: stable-diffusion/templates/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  serviceName: stable-diffusion-1673590319
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: stable-diffusion
      app.kubernetes.io/instance: stable-diffusion-1673590319
  volumeClaimTemplates:
    - metadata:
        name: stable-diffusion-1673590319-model-store
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: "-"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: stable-diffusion
        app.kubernetes.io/instance: stable-diffusion-1673590319
    spec:
      serviceAccountName: stable-diffusion-1673590319
      securityContext:
        {}
      initContainers:
        - name: ensure-stable-models
          image: busybox:1.35
          command: ["/bin/sh"]
          args:
            - -c
            - >-
                wget -nc https://www.googleapis.com/storage/v1/b/aai-blog-files/o/sd-v1-4.ckpt?alt=media -O /models/model.ckpt;
                wget -nc https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth -P /models;
                wget -nc https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth -P /models;
                wget -nc https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth -P /models
          volumeMounts:
            - mountPath: /models
              name: stable-diffusion-1673590319-model-store
      containers:
        - name: stable-diffusion-stable-diffusion
          securityContext:
            {}
          image: "amithkk/stable-diffusion:latest"
          imagePullPolicy: IfNotPresent
          envFrom:
            - configMapRef:
                name: stable-diffusion-1673590319-config
          ports:
            - name: http
              containerPort: 7860
              protocol: TCP
          volumeMounts:
            - mountPath: /models
              name: stable-diffusion-1673590319-model-store
          # Todo - Implement an efficient readiness and liveness check
          resources:
            {}
      nodeSelector:
        nvidia.com/gpu.present: "true"
---
# Source: stable-diffusion/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stable-diffusion-1673590319
  labels:
    helm.sh/chart: stable-diffusion-1.0.6
    app.kubernetes.io/name: stable-diffusion
    app.kubernetes.io/instance: stable-diffusion-1673590319
    app.kubernetes.io/version: "1.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  ingressClassName: cilium
  rules:
    - host: "chart-example.local"
      http:
        paths:
          - path: /
            pathType: ImplementationSpecific
            backend:
              service:
                name: stable-diffusion-1673590319
                port:
                  number: 80

NOTES:
1. Get the application URL by running these commands:
  http://chart-example.local/

Helm Install Complete! It may take a while for the models & container to download (~8GiB)
http://github.com/amithkk/stable-diffusion-k8s

但是奇怪，这次没有看到 kubectl get pods 输出有 stable-diffusion 的pod

回退掉 storageClass: "-" ，尝试验证:

$ helm install --debug --dry-run --generate-name amithkk-sd/stable-diffusion -f values.yaml

PVC和PV配置解决调度问题¶

检查 stable-diffusion 所要求的pvc:

kubectl get pvc stable-diffusion-1673591786-model-store-stable-diffusion-1673591786-0 -o yaml

可以看到:

stable-diffusion的PVC¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: "2023-01-13T06:36:29Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: stable-diffusion-1673591786
    app.kubernetes.io/name: stable-diffusion
  name: stable-diffusion-1673591786-model-store-stable-diffusion-1673591786-0
  namespace: default
  resourceVersion: "50346196"
  uid: 9dbe56a9-c741-4bd3-9914-df2794647ee6
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  volumeMode: Filesystem
status:
  phase: Pending

参考 Configure a Pod to Use a PersistentVolume for Storage 有一句非常关键的话:

If the control plane finds a suitable PersistentVolume with the same StorageClass, it binds the claim to the volume.

明白了:

stable-diffusion 声明了一个PVC，需要有对应的PV，这个PVC和PV对应绑定关系是通过 storageClassName 指定的关联
在 Stable Diffusion on Kubernetes with Helm 项目的issue中有一个 Issues with “storageClassName” #1 提到 storageClassName 定义持久化存储
- 默认没有配置，如果是自己部署，需要定义一个PV和PVC关联的 storageClass
- 可以使用本地存储(例如我现在就搞一个)
在 stable-diffusion-pv.ymal :

创建定义本地存储 stable-diffusion-pv.yaml¶

apiVersion: v1
kind: PersistentVolume
metadata:
  name: stable-diffusion-local-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/var/lib/containerd/models"

注意：这里定义了存储类型命名是 manual ，所以需要对应修改 values.yaml 在其中添加了一行:

在values.yaml中添加一行 storageClass: manual¶

persistence:
  annotations: {}
  ## If defined, storageClass: <storageClass>
  ## If set to "-", storageClass: "", which disables dynamic provisioning
  ## If undefined (the default) or set to null, no storageClass spec is
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
  ##   GKE, AWS & OpenStack)
  ##
  storageClass: manual
  accessMode: ReadWriteOnce
  size: 8Gi

此时再次执行 helm install ... -f values.yaml 就可以看到能够正确调度到 z-k8s-n-1

$ kubectl get pods -o wide
NAME                              READY   STATUS      RESTARTS        AGE     IP           NODE        NOMINATED NODE   READINESS GATES
,..
stable-diffusion-1673593163-0     0/2     Init:1/3    36 (16m ago)    169m    10.0.3.164   z-k8s-n-1   <none>           <none>

stable-diffusion容器启动失败排查¶

调度虽然解决了，但是观察pod始终停留在初始化状态 Init:1/3

检查pod:

kubectl describe pods stable-diffusion-1673593163-0

可以看到是容器启动问题:

Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  15m (x628 over 170m)  kubelet  Back-off restarting failed container

检查日志:

kubectl logs stable-diffusion-1673593163-0

可以看到:

Error from server (BadRequest): container "stable-diffusion-stable-diffusion" in pod "stable-diffusion-1673593163-0" is waiting to start: PodInitializing

之前发现容器镜像下载因为GFW原因出现TLS连接超时，怀疑是镜像下载问题导致。

解决方法¶

containerd代理 (类似配置Docker使用代理) 在虚拟机中运行的 containerd 服务配置代理，这样下载镜像就能够翻越GFW

使用Kubernetes的环境变量( env: )，向启动的poed容器注入代理配置:

env:
- name: HTTP_PROXY
  value: "http://192.168.6.200:3128"
- name: HTTPS_PROXY
  value: "http://192.168.6.200:3128"

参考¶

Stable Diffusion on Kubernetes with Helm