Hubble集成OpenTelemetry

前置条件

部署说明

本文实践在 Cilium 网络环境部署 OpenTelemetry 以及采用 Jaeger分布式跟踪系统cert-manager: X.509证书管理 ,同时采用 CiliumNetworkPolicyCiliumClusterwideNetworkPolicy 来激活DNS和HTTP可视化的部署简单demo:

  • 第一个 OpenTelemetryCollector 配置会部署Hubble adaptor和配置Hubble receiver来访问每个节点的L7流数据,然后将跟踪数据写入到 Jaeger分布式跟踪系统

  • 第二个 OpenTelemetryCollector 将部署上游 OpenTelemetry 发行版来作为demo应用的sidecar

基本设置

在Cilium的K8s集群中安装cert-manager
kubectl apply -k github.com/cilium/kustomize-bases/cert-manager
  • 执行以下命令确保 cert-manager 完全就绪:

等待cert-manager就绪
(
  set -e
  kubectl wait deployment --namespace="cert-manager" --for="condition=Available" cert-manager-webhook cert-manager-cainjector cert-manager --timeout=3m
  kubectl wait pods --namespace="cert-manager" --for="condition=Ready" --all --timeout=3m
  kubectl wait apiservice --for="condition=Available" v1.cert-manager.io v1.acme.cert-manager.io --timeout=3m
  until kubectl get secret --namespace="cert-manager" cert-manager-webhook-ca 2> /dev/null ; do sleep 0.5 ; done
)

输出显示:

cert-manager就绪
deployment.apps/cert-manager-webhook condition met
deployment.apps/cert-manager-cainjector condition met
deployment.apps/cert-manager condition met
pod/cert-manager-7b4f4986bb-6sdlt condition met
pod/cert-manager-cainjector-6b9d8b7d57-5fw2r condition met
pod/cert-manager-webhook-d7bc6f65d-25kwr condition met
apiservice.apiregistration.k8s.io/v1.cert-manager.io condition met
apiservice.apiregistration.k8s.io/v1.acme.cert-manager.io condition met
NAME                      TYPE     DATA   AGE
cert-manager-webhook-ca   Opaque   3      115s
在Cilium的K8s集群中安装Jaeger operator
kubectl apply -k github.com/cilium/kustomize-bases/jaeger

输出显示:

在Cilium的K8s集群中安装Jaeger operator输出信息
namespace/jaeger created
customresourcedefinition.apiextensions.k8s.io/jaegers.jaegertracing.io created
serviceaccount/jaeger-operator created
role.rbac.authorization.k8s.io/jaeger-operator created
clusterrole.rbac.authorization.k8s.io/jaeger-operator created
rolebinding.rbac.authorization.k8s.io/jaeger-operator created
clusterrolebinding.rbac.authorization.k8s.io/jaeger-operator created
deployment.apps/jaeger-operator created
配置一个内存后端的 jaeger 实例
cat > jaeger.yaml << EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-default
  namespace: jaeger
spec:
  strategy: allInOne
  storage:
    type: memory
    options:
      memory:
        max-traces: 100000
  ingress:
    enabled: false
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
EOF
kubectl apply -f jaeger.yaml
在Cilium的K8s集群中安装OpenTelemetry operator
kubectl apply -k github.com/cilium/kustomize-bases/opentelemetry
  • 配置Hubble receiver 和 Jaeger exporter:

在Cilium的K8s集群中配置Hubble receiver 和 Jaeger exporter
cat > otelcol.yaml << EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otelcol-hubble
  namespace: kube-system
spec:
  mode: daemonset
  image: ghcr.io/cilium/hubble-otel/otelcol:v0.1.1
  env:
    # set NODE_IP environment variable using downwards API
    - name: NODE_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
  volumes:
    # this example connect to Hubble socket of Cilium agent
    # using host port and TLS
    - name: hubble-tls
      projected:
        defaultMode: 256
        sources:
          - secret:
              name: hubble-relay-client-certs
              items:
                - key: tls.crt
                  path: client.crt
                - key: tls.key
                  path: client.key
                - key: ca.crt
                  path: ca.crt
    # it's possible to use the UNIX socket also, for which
    # the following volume will be needed
    # - name: cilium-run
    #   hostPath:
    #     path: /var/run/cilium
    #     type: Directory
  volumeMounts:
    # - name: cilium-run
    #   mountPath: /var/run/cilium
    - name: hubble-tls
      mountPath: /var/run/hubble-tls
      readOnly: true
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:55690
      hubble:
        # NODE_IP is substituted by the collector at runtime
        # the '\' prefix is required only in order for this config to be
        # inlined in the guide and make it easy to paste, i.e. to avoid
        # shell subtituting it
        endpoint: \${NODE_IP}:4244 # unix:///var/run/cilium/hubble.sock
        buffer_size: 100
        include_flow_types:
          # this sets an L7 flow filter, removing this section will
          # disable filtering and result all types of flows being turned
          # into spans;
          # other type filters can be set, the names are same as what's
          # used in 'hubble observe -t <type>'
          traces: ["l7"]
        tls:
          insecure_skip_verify: true
          ca_file: /var/run/hubble-tls/ca.crt
          cert_file: /var/run/hubble-tls/client.crt
          key_file: /var/run/hubble-tls/client.key
    processors:
      batch:
        timeout: 30s
        send_batch_size: 100

    exporters:
      jaeger:
        endpoint: jaeger-default-collector.jaeger.svc.cluster.local:14250
        tls:
          insecure: true

    service:
      telemetry:
        logs:
          level: info
      pipelines:
        traces:
          receivers: [hubble, otlp]
          processors: [batch]
          exporters: [jaeger]
EOF
kubectl apply -f otelcol.yaml

然后检查collector作为 DaemonSet 正确运行:

在Cilium的K8s集群中检查otelcol-hubble-collector是否正常运行
kubectl get pod -n kube-system -l app.kubernetes.io/name=otelcol-hubble-collector

如果正常,会看到每个worker节点正常运行了如下:

在Cilium的K8s集群中检查otelcol-hubble-collector输出信息
NAME                             READY   STATUS    RESTARTS   AGE
otelcol-hubble-collector-2g82m   1/1     Running   0          5m15s
otelcol-hubble-collector-46gtj   1/1     Running   0          5m15s
otelcol-hubble-collector-97pwj   1/1     Running   0          5m15s
otelcol-hubble-collector-qhzkn   1/1     Running   0          5m15s
otelcol-hubble-collector-xt7xl   1/1     Running   0          5m15s
  • 现在可以检查日志:

在Cilium的K8s集群中检查otelcol-hubble-collector日志
kubectl logs -n kube-system -l app.kubernetes.io/name=otelcol-hubble-collector
通过kubectl port-forward访问jaeger的管理界面
kubectl port-forward svc/jaeger-default-query -n jaeger 16686

部署podinfo demon应用

以下部署案例是为了验证前面部署的trace系统,提供一个DNS和HTTP跟踪演示

  • 创建一个 demo 应用的名字空间:

    kubectl create ns podinfo
    
  • 激活 podinfo 应用的HTTP可视化以及所有DNS流量:

配置cilium可视化策略
cat > visibility-policies.yaml << EOF
---
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: default-allow
spec:
  endpointSelector: {}
  egress:
    - toEntities:
        - cluster
        - world
    - toEndpoints:
        - {}
---
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: dns-visibility
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
      - matchLabels:
          k8s:io.kubernetes.pod.namespace: kube-system
          k8s:k8s-app: kube-dns
      toPorts:
      - ports:
        - port: "53"
          protocol: ANY
        rules:
          dns:
            - matchPattern: "*"
    - toFQDNs:
      - matchPattern: "*"
    - toEndpoints:
      - {}
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: http-visibility
  namespace: podinfo
spec:
  endpointSelector: {}
  egress:
    - toPorts:
      - ports:
        - port: "9898"
          protocol: TCP
        rules:
          http:
          - method: ".*"
    - toEndpoints:
      - {}
EOF
kubectl apply -f visibility-policies.yaml

提示信息:

ciliumclusterwidenetworkpolicy.cilium.io/default-allow created
ciliumclusterwidenetworkpolicy.cilium.io/dns-visibility created
ciliumnetworkpolicy.cilium.io/http-visibility created
  • podinfo应用是OpenTelemetry SDK检测,导出跟踪的一种方法是使用collector sidecar: 添加sidecar配置

配置OpenTelemetry sidecar
cat > otelcol-podinfo.yaml << EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otelcol-podinfo
  namespace: podinfo
spec:
  mode: sidecar
  config: |
    receivers:
      otlp:
        protocols:
          http: {}
    exporters:
      logging:
        loglevel: info
      otlp:
        endpoint: otelcol-hubble-collector.kube-system.svc.cluster.local:55690
        tls:
          insecure: true

    service:
      telemetry:
        logs:
          level: info
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [otlp, logging]

EOF
kubectl apply -f otelcol-podinfo.yaml
  • 现在部署podinfo应用:

部署podinfo示例应用
kubectl apply -k github.com/cilium/kustomize-bases/podinfo
  • 检查部署和服务:

部署podinfo的部署和服务检查
kubectl get -n podinfo deployments,services

输出显示:

部署podinfo的部署和服务检查的信息
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/podinfo-backend    2/2     2            2           81s
deployment.apps/podinfo-client     1/2     2            1           81s
deployment.apps/podinfo-frontend   1/2     2            1           81s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/podinfo-backend    ClusterIP   10.106.77.139   <none>        9898/TCP,9999/TCP   81s
service/podinfo-client     ClusterIP   10.102.155.20   <none>        9898/TCP,9999/TCP   81s
service/podinfo-frontend   ClusterIP   10.99.183.236   <none>        9898/TCP,9999/TCP   81s
  • 向应用发送压力

此时会看到终端输出一些trace信息,类似:

traceparent: 00-ad4981da48a7957d2c3ec1f7f722ba87-964601c0a0298a04-01
[
  "Hubble+OpenTelemetry=ROCKS"
]

这里 ad4981da48a7957d2c3ec1f7f722ba87 就是 traceid

Jaeger分布式跟踪系统 交互界面可以根据这个 traceid 搜索到会话的跟踪信息:

../../../../_images/podinfo_jaeger.png

参考