单条命令安装kubeflow

准备工作

官方二进制安装脚本执行(需要非常畅通的网络),在当前目录下对应OS的 kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"  | bash
  • kubectl

安装

  • clone下仓库并进入 apps 目录:

单条命令安装kubeflow
git clone git@github.com:kubeflow/manifests.git
cd manifests

# 只需要以下单一命令进行安装
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

备注

安装是如此简洁,令人击节赞叹…我厂的软件交付…

  • 完成安装后,可能需要等待一些时间让所有的pods就绪,可以通过以下命令来确认:

检查是否所有安装的 kubeflow 相关Pods就绪
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com

备注

我在 Kubernetes集群(y-k8s) 部署的虚拟机中采用了极小化的虚拟磁盘,遇到一个尴尬的问题就是 节点压力驱逐 ,也就是磁盘空间不足导致运行Pod被驱逐。在上述Pods检测就绪发现存在问题时,通过 使用libvirt和XFS在线扩展Ceph RBD设备 实现扩容解决(离线扩展方式,并且将 /var/lib/docker 迁移到 /var/lib/containerd )

需要注意,采用这种简单的部署方式,可能仅适用于测试环境。只少我看到的 deployments 都是单副本,没有提供冗余。后续我再仔细研究一下。

异常排查

在解决了 ref:y-k8s 集群的磁盘空间不足问题之后,我清理了 ContainerStatusUnknown 的pod,然后按照上文依次检查相关 namespace 中pod是否正常运行。

oidc-authservice pending

  • 检查 kubectl get pods -n istio-system 输出显示 oidc-authservice-0 处于 pending 状态,检查 kubectl -n istio-system describe pods oidc-authservice-0 输出如下:

describe pods oidc-authservice-0 可以看到调度失败原因是没有对应 pvc
Name:             oidc-authservice-0
Namespace:        istio-system
Priority:         0
Service Account:  authservice
Node:             <none>
Labels:           app=authservice
                  controller-revision-hash=oidc-authservice-7bd6b4b965
                  statefulset.kubernetes.io/pod-name=oidc-authservice-0
Annotations:      sidecar.istio.io/inject: false
Status:           Pending
IP:
IPs:              <none>
Controlled By:    StatefulSet/oidc-authservice
Containers:
  authservice:
    Image:      gcr.io/arrikto/kubeflow/oidc-authservice:e236439
    Port:       8080/TCP
    Host Port:  0/TCP
    Readiness:  http-get http://:8081/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      oidc-authservice-client      Secret     Optional: false
      oidc-authservice-parameters  ConfigMap  Optional: false
    Environment:                   <none>
    Mounts:
      /var/lib/authservice from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-khhmb (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  authservice-pvc
    ReadOnly:   false
  kube-api-access-khhmb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m30s (x48 over 4h1m)  default-scheduler  0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod..
  • 检查PVC: kubectl -n istio-system get pvc 可以看到 authservice-pvc 处于 Pending ,则检查 kubectl -n istio-system get pvc authservice-pvc -o yaml 输出如下:

get pvc authservice-pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"authservice-pvc","namespace":"istio-system"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10Gi"}}}}
  creationTimestamp: "2023-08-30T14:46:40Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: authservice-pvc
  namespace: istio-system
  resourceVersion: "13270831"
  uid: 425e321b-20cd-44b4-a797-28f092bfc42a
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  volumeMode: Filesystem
status:
  phase: Pending

我最初以为这是一个简单的 Kubernetes PV 和 PVC 绑定 (类似我之前实践过的 kube-prometheus-stack 持久化卷 ) ,想正好实践一下 ZFS NFS 输出为 在Kubernetes中部署NFS

但是仔细检查这个 authservice-pvc 就会发现和 pv/pvc 的静态配置有所不同: authservice-pvc 并没有提供 storageClassName 来对应绑定 pvpvc 。也就是说,这里的实现是 Kubernetes动态卷制备(Dynamic Volume Provisioning)

如果我不是在云计算厂商的平台部署(通常云厂商会提供 Kubernetes 容器存储接口(Container Storage Interface, CSI) ,并且只要配置好 Admission Plugin DefaultStorageClass 就能无需指定 sc storage class直接创建存储pv ),就必须自己部署实现:

然后通过指定 Admission Plugin DefaultStorageClass 实现为 kubeflow mainfest 提供 Kubernetes动态卷制备(Dynamic Volume Provisioning)

参考