kubernetes 应用程序自检和调试 - Kubernetes(K8S)教程

1 使用 kubectl describe pod 来获取有关 pod 的详细信息
2 示例：调试 Pending 状态的 Pod
3 示例：调试一个关闭（或者无法到达）的节点

一旦你的应用程序运行起来了，你将不可避免地需要对它进行调试。之前我们介绍过如何使用 kubectl get pod 来检索有关您的 pod 的简单状态信息。但还有很多方法可以获得有关应用程序的更多信息。

使用 kubectl describe pod 来获取有关 pod 的详细信息

在这个例子中，我们将使用 Deployment 来创建两个 pod，与前面的示例类似。

nginx-dep.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment spec: selector: matchLabels: app: nginx replicas: 2 template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx resources: limits: memory: "128Mi" cpu: "500m" ports: - containerPort: 80

nginx-dep.yaml Copy nginx-dep.yaml to clipboard

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        ports:
        - containerPort: 80

使用如下命令来创建 deployment：

$ kubectl create -f https://k8s.io/docs/tasks/debug-application-cluster/nginx-dep.yaml
deployment "nginx-deployment" created

$ kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-1006230814-6winp   1/1       Running   0          11s
nginx-deployment-1006230814-fmgu3   1/1       Running   0          11s

我们可以使用 kubectl describe pod 获取每个 pod 的更多信息。例如：

$ kubectl describe pod nginx-deployment-1006230814-6winp
Name:		nginx-deployment-1006230814-6winp
Namespace:	default
Node:		kubernetes-node-wul5/10.240.0.9
Start Time:	Thu, 24 Mar 2016 01:39:49 +0000
Labels:		app=nginx,pod-template-hash=1006230814
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind"                           :"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16"                             ...
Status:		Running
IP:		10.244.0.6
Controllers:	ReplicaSet/nginx-deployment-1006230814
Containers:
  nginx:
    Container ID:	docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
    Image:		nginx
    Image ID:		docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
    Port:		80/TCP
    QoS Tier:
      cpu:	Guaranteed
      memory:	Guaranteed
    Limits:
      cpu:	500m
      memory:	128Mi
    Requests:
      memory:		128Mi
      cpu:		500m
    State:		Running
      Started:		Thu, 24 Mar 2016 01:39:51 +0000
    Ready:		True
    Restart Count:	0
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro)
Conditions:
  Type          Status
  Initialized   True
  Ready         True
  PodScheduled  True
Volumes:
  default-token-4bcbi:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-4bcbi
    Optional:   false
QoS Class:      Guaranteed
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen	LastSeen	Count	From					SubobjectPath		Type		Reason		Message
  ---------	--------	-----	----					-------------		--------	------		-------
  54s		54s		1	{default-scheduler }						Normal		Scheduled	Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5
  54s		54s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulling		pulling image "nginx"
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Pulled		Successfully pulled image "nginx"
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Created		Created container with docker id 90315cc9f513
  53s		53s		1	{kubelet kubernetes-node-wul5}	spec.containers{nginx}	Normal		Started		Started container with docker id 90315cc9f513

在这里您可以看到有关容器和 Pod 的配置信息（标签，资源需求等），以及有关容器和 Pod 的状态信息（状态，准备情况，重新启动次数，事件等）。

容器状态是 Waiting，Running 或 Terminated 之一。根据状态，可以获得更多信息 – 在这里您可以看到，对于处于运行状态的容器，系统会告诉您何时启动的容器。

Ready 告诉您容器是否通过了最后一次准备就绪探测。（在这种情况下，容器没有配置就绪探针；如果未配置准备就绪探针，则假定容器已准备就绪。）

重启数量会告诉您容器重新启动的次数; 此信息可用于检测重启策略为 ‘always’ 的容器的循环崩溃。

目前，与 Pod 相关的唯一条件是二进制 Ready 状态，这表明该 Pod 可以处理请求，并且应该添加到所有匹配服务的负载均衡池中。

最后，您会看到与您的 Pod 有关的最近事件日志。系统压缩多个相同的事件，只显示第一次和最后一次出现的时间以及出现的次数。”From” 表示记录事件的组件，”SubobjectPath” 告诉您哪个对象（例如容器内的容器）被引用，”Reason” 和 “Message” 告诉您发生了什么。

示例：调试 Pending 状态的 Pod

通过事件排查的一种常见情况是创建了不适合任何节点的 Pod。例如，Pod 可能会请求比任何节点上的空闲资源更多的资源，或者可能会指定一个不匹配任何节点的标签选择器。假设我们在上面的 Deployment 例子中创建 5 个 replicas（而不是 2 个），并请求 600 millicores 而不是 500 millicores，集群拥有 4 个节点，每个（虚拟）机器有 1 个 CPU。在这种情况下，其中一个 Pod 将无法调度。（请注意，由于在每个节点上运行了集群附加 pod，例如 fluentd 和 skydns 等，如果我们请求 1000 millicores，则没有任何一个 pod 可以成功调度。）

$ kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-1006230814-6winp   1/1       Running   0          7m
nginx-deployment-1006230814-fmgu3   1/1       Running   0          7m
nginx-deployment-1370807587-6ekbw   1/1       Running   0          1m
nginx-deployment-1370807587-fg172   0/1       Pending   0          1m
nginx-deployment-1370807587-fz9sd   0/1       Pending   0          1m

要找出 nginx-deployment-1370807587-fz9sd pod 未运行的原因，我们可以在待处理的 Pod 上使用 kubectl describe pod 并查看其事件：

$ kubectl describe pod nginx-deployment-1370807587-fz9sd
  Name:		nginx-deployment-1370807587-fz9sd
  Namespace:	default
  Node:		/
  Labels:		app=nginx,pod-template-hash=1370807587
  Status:		Pending
  IP:
  Controllers:	ReplicaSet/nginx-deployment-1370807587
  Containers:
    nginx:
      Image:	nginx
      Port:	80/TCP
      QoS Tier:
        memory:	Guaranteed
        cpu:	Guaranteed
      Limits:
        cpu:	1
        memory:	128Mi
      Requests:
        cpu:	1
        memory:	128Mi
      Environment Variables:
  Volumes:
    default-token-4bcbi:
      Type:	Secret (a volume populated by a Secret)
      SecretName:	default-token-4bcbi
  Events:
    FirstSeen	LastSeen	Count	From			        SubobjectPath	Type		Reason			    Message
    ---------	--------	-----	----			        -------------	--------	------			    -------
    1m		    48s		    7	    {default-scheduler }			        Warning		FailedScheduling	pod (nginx-deployment-1370807587-fz9sd) failed to fit in any node
  fit failure on node (kubernetes-node-6ta5): Node didn't have enough resource: CPU, requested: 1000, used: 1420, capacity: 2000
  fit failure on node (kubernetes-node-wul5): Node didn't have enough resource: CPU, requested: 1000, used: 1100, capacity: 2000

在这里，您可以看到 scheduler 生成的事件，表明由于 FailedScheduling（可能还有其他原因），Pod 无法调度。该消息告诉我们没有任何节点能够满足 Pod 的需求。

要解决这种情况，可以使用 kubectl scale 来更新您的部署以指定 4 个或更少的 replicas。（或者您可以让一个Pod 保持 pending，这是无害的。）

在 etcd 中存储了类似于 kubectl describe pod 结尾处看到的事件，并提供有关集群中正在发生的事情的高级信息。您可以使用如下命令列出所有事件：

kubectl get events

但是您需要记住事件是具有命名空间的。这意味着如果您对某些命名空间对象的事件感兴趣（例如，命名空间 my-namespace 中的 Pod 发生了什么），则需要明确地为命令提供一个命名空间：

kubectl get events --namespace=my-namespace

要查看来自所有命名空间的事件，可以使用 --all-namespaces 参数。

除 kubectl describe pod 之外，另一种获得关于 pod 额外信息的方法（超出了 kubectl get pod 提供的内容）是将 -o yaml 输出格式标志传递给 kubectl get pod。这会给你 YAML 格式的信息，甚至比 kubectl describe pod 更多的信息 – 基本上是系统拥有的 Pod 的所有信息。在这里，您将看到类似注解（这是没有标签限制的键值元数据，给 Kubernetes 系统组件内部使用）、重新启动策略、端口和卷。

$ kubectl get pod nginx-deployment-1006230814-6winp -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1006230814","uid":"4c84c175-f161-11e5-9a78-42010af00005","apiVersion":"extensions","resourceVersion":"133434"}}
  creationTimestamp: 2016-03-24T01:39:50Z
  generateName: nginx-deployment-1006230814-
  labels:
    app: nginx
    pod-template-hash: "1006230814"
  name: nginx-deployment-1006230814-6winp
  namespace: default
  resourceVersion: "133447"
  selfLink: /api/v1/namespaces/default/pods/nginx-deployment-1006230814-6winp
  uid: 4c879808-f161-11e5-9a78-42010af00005
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: nginx
    ports:
    - containerPort: 80
      protocol: TCP
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 500m
        memory: 128Mi
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bcbi
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: kubernetes-node-wul5
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: default-token-4bcbi
    secret:
      secretName: default-token-4bcbi
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2016-03-24T01:39:51Z
    status: "True"
    type: Ready
  containerStatuses:
  - containerID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
    image: nginx
    imageID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
    lastState: {}
    name: nginx
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2016-03-24T01:39:51Z
  hostIP: 10.240.0.9
  phase: Running
  podIP: 10.244.0.6
  startTime: 2016-03-24T01:39:49Z

示例：调试一个关闭（或者无法到达）的节点

有时，在调试时，查看节点的状态可能很有用 – 例如，您已经注意到节点上运行的 Pod 的奇怪行为，或想查明 Pod 不调度到节点上的原因。与 Pod 一样，可以使用 kubectl describe node 和 kubectl get node -o yaml 来检索有关节点的详细信息。例如，如果某个节点关闭（从网络断开连接，或 kubelet 死亡并不会重新启动等），您将看到以下内容。注意显示节点为 NotReady 的事件，并且还注意到 Pod 不再运行（它们在 NotReady 状态五分钟后被驱逐）。

$ kubectl get nodes
NAME                     STATUS        AGE     VERSION
kubernetes-node-861h     NotReady      1h      v1.6.0+fff5156
kubernetes-node-bols     Ready         1h      v1.6.0+fff5156
kubernetes-node-st6x     Ready         1h      v1.6.0+fff5156
kubernetes-node-unaj     Ready         1h      v1.6.0+fff5156

$ kubectl describe node kubernetes-node-861h
Name:			kubernetes-node-861h
Role
Labels:		 beta.kubernetes.io/arch=amd64
           beta.kubernetes.io/os=linux
           kubernetes.io/hostname=kubernetes-node-861h
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:	Mon, 04 Sep 2017 17:13:23 +0800
Phase:
Conditions:
  Type		Status		LastHeartbeatTime			LastTransitionTime			Reason					Message
  ----    ------    -----------------     ------------------      ------          -------
  OutOfDisk             Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  MemoryPressure        Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  DiskPressure          Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
  Ready                 Unknown         Fri, 08 Sep 2017 16:04:28 +0800         Fri, 08 Sep 2017 16:20:58 +0800         NodeStatusUnknown       Kubelet stopped posting node status.
Addresses:	10.240.115.55,104.197.0.26
Capacity:
 cpu:           2
 hugePages:     0
 memory:        4046788Ki
 pods:          110
Allocatable:
 cpu:           1500m
 hugePages:     0
 memory:        1479263Ki
 pods:          110
System Info:
 Machine ID:                    8e025a21a4254e11b028584d9d8b12c4
 System UUID:                   349075D1-D169-4F25-9F2A-E886850C47E3
 Boot ID:                       5cd18b37-c5bd-4658-94e0-e436d3f110e0
 Kernel Version:                4.4.0-31-generic
 OS Image:                      Debian GNU/Linux 8 (jessie)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://1.12.5
 Kubelet Version:               v1.6.9+a3d1dfa6f4335
 Kube-Proxy Version:            v1.6.9+a3d1dfa6f4335
ExternalID:                     15233045891481496305
Non-terminated Pods:            (9 in total)
  Namespace                     Name                                            CPU Requests    CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                            ------------    ----------      --------------- -------------
......
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests         Memory Limits
  ------------  ----------      ---------------         -------------
  900m (60%)    2200m (146%)    1009286400 (66%)        5681286400 (375%)
Events:         <none>

$ kubectl get node kubernetes-node-861h -o yaml
apiVersion: v1
kind: Node
metadata:
  creationTimestamp: 2015-07-10T21:32:29Z
  labels:
    kubernetes.io/hostname: kubernetes-node-861h
  name: kubernetes-node-861h
  resourceVersion: "757"
  selfLink: /api/v1/nodes/kubernetes-node-861h
  uid: 2a69374e-274b-11e5-a234-42010af0d969
spec:
  externalID: "15233045891481496305"
  podCIDR: 10.244.0.0/24
  providerID: gce://striped-torus-760/us-central1-b/kubernetes-node-861h
status:
  addresses:
  - address: 10.240.115.55
    type: InternalIP
  - address: 104.197.0.26
    type: ExternalIP
  capacity:
    cpu: "1"
    memory: 3800808Ki
    pods: "100"
  conditions:
  - lastHeartbeatTime: 2015-07-10T21:34:32Z
    lastTransitionTime: 2015-07-10T21:35:15Z
    reason: Kubelet stopped posting node status.
    status: Unknown
    type: Ready
  nodeInfo:
    bootID: 4e316776-b40d-4f78-a4ea-ab0d73390897
    containerRuntimeVersion: docker://Unknown
    kernelVersion: 3.16.0-0.bpo.4-amd64
    kubeProxyVersion: v0.21.1-185-gffc5a86098dc01
    kubeletVersion: v0.21.1-185-gffc5a86098dc01
    machineID: ""
    osImage: Debian GNU/Linux 7 (wheezy)
    systemUUID: ABE5F6B4-D44B-108B-C46A-24CCE16C8B6E

译者：tianshapjq / 原文链接