Argo Rollouts 实战：3 步实现 Kubernetes 金丝雀与蓝绿发布

痛点

Kubernetes 原生 Deployment 只支持 RollingUpdate 和 Recreate 两种策略。对于线上核心业务，RollingUpdate 的问题很明显：

无法控制流量比例 — 新版本 Pod 一起来就接收等比流量，没有渐进式验证的过程
回滚不够快 — 发现问题后 kubectl rollout undo 需要等待新 Pod 重新滚动，P0 故障时每秒都在烧钱
缺乏自动化判定 — 无法基于 Prometheus 指标自动判断新版本是否健康，全靠人眼盯监控

生产环境需要的是：先放 5% 流量到新版本，观察 5 分钟，指标正常再逐步加到 50%、100%；异常时 10 秒内自动回滚到旧版本。这正是 Argo Rollouts 解决的问题。

方案

Argo Rollouts 是 Argo 项目的子项目，提供 Kubernetes 原生的渐进式交付（Progressive Delivery）能力。核心特性：

能力	说明
金丝雀（Canary）	按比例切流量，支持多步骤、暂停、自动推进
蓝绿（Blue-Green）	双版本并行，验证通过后一键切换
自动分析（Analysis）	对接 Prometheus/Datadog/New Relic，指标不达标自动回滚
流量管理	集成 Istio、Nginx Ingress、ALB、Traefik，实现精确流量分割

与 ArgoCD 互补：ArgoCD 负责 GitOps 同步，Argo Rollouts 负责发布策略，二者联合使用是生产最佳实践。

实操步骤

第 1 步：安装 Argo Rollouts

# 安装 Controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

# 安装 kubectl 插件（方便观察发布状态）
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

# 验证安装
kubectl argo rollouts version

第 2 步：定义 Canary Rollout 资源

将原有 Deployment 替换为 Rollout 资源（API 基本兼容，改 kind 和 strategy 即可）：

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-api
spec:
  replicas: 10
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      containers:
      - name: web-api
        image: registry.example.com/web-api:v2.1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
  strategy:
    canary:
      # 金丝雀步骤定义
      steps:
      - setWeight: 5        # 5% 流量到新版本
      - pause: {duration: 5m}  # 观察 5 分钟
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 5m}
      # 最后自动全量
      canaryService: web-api-canary
      stableService: web-api-stable
      trafficRouting:
        nginx:
          stableIngress: web-api-ingress

配套 Service 和 Ingress：

---
apiVersion: v1
kind: Service
metadata:
  name: web-api-stable
spec:
  selector:
    app: web-api
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: web-api-canary
spec:
  selector:
    app: web-api
  ports:
  - port: 80
    targetPort: 8080

第 3 步：配置自动分析（AnalysisTemplate）

这是 Argo Rollouts 的杀手级功能 — 基于 Prometheus 指标自动判断是否继续推进或回滚：

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
spec:
  metrics:
  - name: success-rate
    # 每 60 秒检查一次
    interval: 60s
    # 连续失败 3 次则判定不通过
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{app="web-api",status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total{app="web-api"}[5m]))
    # 成功率必须 > 95%
    successCondition: result[0] >= 0.95
  - name: p99-latency
    interval: 60s
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{app="web-api"}[5m])) by (le)
          )
    # P99 延迟必须 < 500ms
    successCondition: result[0] < 0.5

在 Rollout 中引用 Analysis：

  strategy:
    canary:
      steps:
      - setWeight: 5
      - analysis:
          templates:
          - templateName: success-rate-check
          args:
          - name: service-name
            value: web-api-canary
      - setWeight: 50
      - pause: {duration: 10m}

发布与观察：

# 更新镜像触发金丝雀发布
kubectl argo rollouts set image web-api web-api=registry.example.com/web-api:v2.2.0

# 实时观察发布进度（TUI 视图）
kubectl argo rollouts get rollout web-api --watch

# 手动推进（如果步骤中有 pause: {}）
kubectl argo rollouts promote web-api

# 紧急回滚
kubectl argo rollouts abort web-api

避坑指南

坑 1：Nginx Ingress 流量分割需要额外 annotation

使用 Nginx Ingress 做流量路由时，必须确保 Ingress 有正确的 annotation，否则流量不会按比例分割：

metadata:
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "0"  # 由 Rollouts Controller 动态管理

同时确认 Nginx Ingress Controller 版本 ≥ 0.48.0，低版本 canary 功能有 Bug。

坑 2：Analysis 查询窗口要覆盖流量预热期

PromQL 查询窗口设置 [5m] 时，确保金丝雀 Pod 已经有足够的请求量。建议 pause 至少等于查询窗口的 2 倍（即 10 分钟），否则可能因样本不足导致误判。

坑 3：revisionHistoryLimit 不要设太小

Rollout 的 revisionHistoryLimit 建议设为 3-5。设为 1 或 0 会导致回滚时找不到历史 ReplicaSet，回滚失败。同时，kubectl argo rollouts undo 依赖历史版本存在。

总结

维度	原生 Deployment	Argo Rollouts
流量控制	无（按 Pod 比例）	精确百分比
自动回滚	不支持	Analysis 自动判定
发布速度	不可控	自定义步骤和暂停
可观测性	需额外开发	内置 TUI + Dashboard

落地建议： 1. 先在非核心服务试点，熟悉 Rollout 资源语法 2. 第一周用纯 pause 手动推进，验证流量分割正确 3. 第二周接入 AnalysisTemplate，实现全自动渐进式交付 4. 配合 ArgoCD，将 Rollout YAML 纳入 GitOps 管理

金丝雀发布不是银弹，但对于 API 服务、Web 应用等有明确 SLI 指标的场景，Argo Rollouts 是目前 Kubernetes 生态中最成熟、最轻量的渐进式交付方案。