Velero 实战：3 步搞定 Kubernetes 集群备份与灾难恢复

痛点

生产环境的 Kubernetes 集群一旦出事——误删 Namespace、etcd 损坏、跨区域迁移——没有备份等于裸奔。传统的 etcd snapshot 只覆盖集群状态，不管 PV 数据；手动写 CronJob 导出 YAML 又维护成本高、恢复时还原顺序容易出错。

Velero（前身 Heptio Ark）是 VMware 开源的 Kubernetes 备份/恢复/迁移工具，支持： - 集群资源（Deployment、ConfigMap、CRD 等）的声明式备份 - PersistentVolume 数据快照（通过 CSI 或云厂商插件） - 定时调度、TTL 自动过期、跨集群恢复

下面用 AWS EKS + S3 作为存储后端，演示从 0 到生产可用的备份方案。

方案概览

┌─────────────┐       ┌───────────────┐       ┌─────────────────┐
│  EKS Cluster│──────▶│  Velero Server│──────▶│  S3 Bucket      │
│  (Resources)│       │  (namespace:  │       │  (备份存储)      │
│  + PV Data  │       │   velero)     │       │  + CSI Snapshot  │
└─────────────┘       └───────────────┘       └─────────────────┘

核心组件： - Velero Server：运行在集群内的 Controller，监听 Backup/Restore CR - BackupStorageLocation (BSL)：指向 S3/GCS/Azure Blob 的备份存储位置 - VolumeSnapshotLocation (VSL)：PV 快照的云厂商区域配置 - velero CLI：本地操作工具，创建备份、查看状态、触发恢复

实操步骤

Step 1：安装 Velero

先创建 S3 Bucket 和 IAM 权限：

# 创建专用 S3 Bucket
aws s3 mb s3://my-cluster-velero-backups --region ap-northeast-1

# 创建 IAM Policy（最小权限）
cat > velero-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": "arn:aws:s3:::my-cluster-velero-backups/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my-cluster-velero-backups"
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name VeleroBackupPolicy \
  --policy-document file://velero-policy.json

使用 Helm 安装 Velero（推荐方式）：

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

# 创建 credentials 文件
cat > credentials-velero << 'EOF'
[default]
aws_access_key_id=<YOUR_ACCESS_KEY>
aws_secret_access_key=<YOUR_SECRET_KEY>
EOF

helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.backupStorageLocation[0].name=default \
  --set configuration.backupStorageLocation[0].provider=aws \
  --set configuration.backupStorageLocation[0].bucket=my-cluster-velero-backups \
  --set configuration.backupStorageLocation[0].config.region=ap-northeast-1 \
  --set configuration.volumeSnapshotLocation[0].name=default \
  --set configuration.volumeSnapshotLocation[0].provider=aws \
  --set configuration.volumeSnapshotLocation[0].config.region=ap-northeast-1 \
  --set initContainers[0].name=velero-plugin-for-aws \
  --set initContainers[0].image=velero/velero-plugin-for-aws:v1.10.0 \
  --set initContainers[0].volumeMounts[0].mountPath=/target \
  --set initContainers[0].volumeMounts[0].name=plugins \
  --set credentials.useSecret=true \
  --set credentials.secretContents.cloud="$(cat credentials-velero)"

# 验证安装
kubectl get pods -n velero
velero backup-location get

生产建议：优先使用 IRSA（IAM Roles for Service Accounts）代替静态 AK/SK，参考 eks-pod-identity 方案绑定权限。

Step 2：配置定时备份策略

# 每天凌晨 2 点全量备份，保留 7 天
velero schedule create daily-full-backup \
  --schedule="0 2 * * *" \
  --ttl 168h \
  --include-namespaces='*' \
  --snapshot-volumes=true

# 每小时备份关键业务 Namespace，保留 48 小时
velero schedule create hourly-prod-backup \
  --schedule="0 * * * *" \
  --ttl 48h \
  --include-namespaces=production,payment \
  --snapshot-volumes=true \
  --include-resources=deployments,services,configmaps,secrets,persistentvolumeclaims

# 查看调度状态
velero schedule get

按 Namespace 或 Label 精细控制备份范围：

# 只备份带特定 label 的资源
velero backup create tagged-backup \
  --selector app.kubernetes.io/part-of=core-platform \
  --snapshot-volumes=true

# 排除不需要备份的 Namespace
velero schedule create daily-excluding-system \
  --schedule="0 2 * * *" \
  --ttl 168h \
  --exclude-namespaces=kube-system,monitoring,velero

Step 3：恢复操作

场景 A — 误删 Namespace 恢复：

# 查看可用备份
velero backup get

# 恢复指定 Namespace
velero restore create --from-backup daily-full-backup-20260623020000 \
  --include-namespaces=production

# 监控恢复进度
velero restore describe <restore-name> --details

场景 B — 跨集群迁移：

# 目标集群安装 Velero，指向相同的 S3 Bucket
# 然后直接从备份恢复
velero restore create full-migration \
  --from-backup daily-full-backup-20260623020000

# 恢复后验证
kubectl get pods --all-namespaces
kubectl get pvc --all-namespaces

场景 C — 仅恢复特定资源：

# 只恢复 Deployment 和 Service
velero restore create partial-restore \
  --from-backup daily-full-backup-20260623020000 \
  --include-namespaces=production \
  --include-resources=deployments,services

避坑指南

1. PV 快照跨 AZ 恢复失败

问题：EBS Snapshot 是 AZ 级别的，恢复到不同 AZ 的节点会报 volume not found。

解决：

# 在 VolumeSnapshotLocation 中不限制 AZ
# 或者使用 Restic/Kopia 文件级备份替代 CSI 快照
helm upgrade velero vmware-tanzu/velero \
  --set configuration.defaultVolumesToFsBackup=true \
  --namespace velero

使用 --default-volumes-to-fs-backup 标志可以用文件系统级备份（Kopia）代替云快照，支持跨 AZ/跨 Region 恢复。

2. 备份体积暴涨，S3 成本失控

问题：snapshot-volumes 会为每个 PVC 创建完整快照，大数据量下 S3 存储费用飙升。

解决： - 合理设置 TTL，过期自动清理：--ttl 168h - 对大容量 PV（如 ElasticSearch 数据盘）使用 velero.io/exclude-from-backup=true 注解排除 - 启用 S3 Lifecycle Policy 做冷存储降级

# 给不需要备份的 PVC 打注解
kubectl annotate pvc elasticsearch-data-0 \
  velero.io/exclude-from-backup=true \
  -n logging

3. CRD 恢复顺序导致资源创建失败

问题：恢复时 CRD 定义还没创建好，依赖该 CRD 的 CR 就报 no matches for kind。

解决：Velero 默认按资源优先级恢复（CRD → Namespace → 其他），但自定义 CRD 可能不在默认列表中。解决方案：

# 分两步恢复：先恢复 CRD，再恢复其他
velero restore create step1-crds \
  --from-backup daily-full-backup-20260623020000 \
  --include-resources=customresourcedefinitions

# 等 CRD ready 后
velero restore create step2-all \
  --from-backup daily-full-backup-20260623020000 \
  --exclude-resources=customresourcedefinitions

生产 Checklist

检查项	建议值
备份频率	关键 NS 每小时，全量每天
TTL	小时级备份 48h，日备份 7d，周备份 30d
PV 备份方式	CSI 快照（同 AZ）或 Kopia（跨 AZ/Region）
恢复演练	每月至少 1 次恢复到 staging 验证
告警	备份失败时触发 PagerDuty/钉钉通知
加密	S3 开启 SSE-KMS，传输走 HTTPS

总结

Velero 是 Kubernetes 生态中最成熟的备份方案，3 步就能落地：安装 → 配置调度 → 验证恢复。核心要点：

权限最小化：用 IRSA 绑定 S3 + EC2 Snapshot 权限，不要用 Admin AK/SK
备份策略分层：关键业务高频备份、全量低频兜底、大数据盘按需排除
定期演练：备份不验证等于没备份，每月做一次恢复到 staging 的 DR 演练