4 个 Kubernetes Pod 调度失败的常见原因与排错实战

你的 Pod 一直卡在 Pending？kubectl describe 里满屏 FailedScheduling 事件？别慌——80% 的调度问题集中在这 4 个原因上。本文带你从现象到排查一条龙搞定。

痛点：Pod Pending 半小时，业务干等

某次线上扩容，Deployment 副本数从 3 扩到 10，结果 7 个新 Pod 全部 Pending。kubectl get pods 看到一片黄色，运维群开始 @ 你了。

真实场景：团队在 AWS EKS 上跑 Java 微服务，Node 规格 m5.xlarge（4C/16G），每个 Pod request 2C/4G。扩容前 3 个节点刚好够用，扩到 10 副本直接打满。

方案：4 步定位 + 对症下药

第 1 步：看事件，定方向

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

调度失败一定会在 Events 里留下线索。重点关注 FailedScheduling 事件的 message，常见关键词：

关键词	含义
`Insufficient cpu` / `Insufficient memory`	资源不够
`node(s) had taint`	污点不匹配
`node(s) didn't match Pod's node affinity`	亲和性不满足
`0/N nodes are available` + `unschedulable`	节点被 cordon 了

第 2 步：资源不足——最常见的罪魁祸首

查看集群可分配资源：

# 查看各节点 Allocatable vs 已用
kubectl describe nodes | grep -A 5 "Allocated resources"

# 更直观：用 kubectl-view-allocations 插件
kubectl view-allocations

快速计算方法：

# 统计所有 Pod 的 CPU request 总和
kubectl get pods -A -o json | \
  jq '[.items[].spec.containers[].resources.requests.cpu // "0m" | 
  gsub("m$";"") | tonumber] | add'

解法：

短期：调大 Cluster Autoscaler 的 maxSize，让新节点自动加入
长期：审查 resource requests 是否虚高。很多团队 request 写 2C 实际只用 0.3C

# 优化前
resources:
  requests:
    cpu: "2"
    memory: "4Gi"

# 优化后——基于实际 P95 用量 + 20% buffer
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

第 3 步：Taint/Toleration 不匹配

节点被打了污点（Taint），但 Pod 没有对应的容忍（Toleration）。

# 查看节点污点
kubectl get nodes -o custom-columns=\
  NAME:.metadata.name,TAINTS:.spec.taints

# 常见场景：专用 GPU 节点、master 节点
# 节点上有 taint: gpu=true:NoSchedule
# Pod 需要加 toleration 才能调度上去

Pod 加容忍：

tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

第 4 步：NodeAffinity / PodAntiAffinity 写错了

亲和性规则写错是隐蔽坑，尤其是 requiredDuringSchedulingIgnoredDuringExecution 这个硬约束——一旦没有满足条件的节点，Pod 永远 Pending。

# 检查 Pod 的亲和性配置
kubectl get pod <pod-name> -o yaml | grep -A 20 affinity

排查技巧： 把 required 先改成 preferred 试一下，如果 Pod 能调度了，说明就是亲和性规则过严。

3 个常见坑

坑 1：Requests 和 Limits 搞混

requests 是调度依据，limits 是运行时上限。只设 limits 不设 requests，调度器默认 requests = limits，白白占资源。

坑 2：PDB 阻止驱逐导致节点缩容卡住

PodDisruptionBudget 设置了 minAvailable: 100%，Cluster Autoscaler 想缩容时无法驱逐 Pod，节点一直空转浪费钱。

# 不要这样写
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: "100%"  # ← 没有任何 Pod 可以被驱逐

# 推荐
spec:
  minAvailable: "60%"   # 允许同时驱逐 40% 的 Pod

坑 3：DaemonSet 占满资源预算

每个节点上的 kube-proxy、fluent-bit、node-exporter 等 DaemonSet 也占 request。3 个 DaemonSet 各 request 200m CPU，8 节点集群就吃掉 4.8C——差不多一整台 m5.xlarge。

建议：定期审计 DaemonSet 的 requests，大部分监控 sidecar 50m CPU 就够了。

总结

问题	排查命令	解法
资源不足	`kubectl describe nodes`	调 Autoscaler / 优化 requests
Taint 不匹配	`kubectl get nodes -o custom-columns=...TAINTS`	加 toleration
亲和性过严	`kubectl get pod -o yaml \\| grep affinity`	required → preferred
节点 cordoned	`kubectl get nodes`	`kubectl uncordon <node>`

Pod 调度问题看着吓人，实际上套路就这几个。养成习惯：先 describe 看事件，再 check 资源，最后查亲和性和污点。三板斧下去，90% 的 Pending 都能解决。