V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
rabbitz
V2EX  ›  Kubernetes

请教下没有启用 HPA, deployment 什么情况下会自动将 replica 设置为 0

  •  
  •   rabbitz · 2022-05-09 15:55:46 +08:00 · 1485 次点击
    这是一个创建于 908 天前的主题,其中的信息可能已经有所发展或是发生改变。

    最近碰到个问题,使用 deployment 运行 nexus3 跑一段时间后 replicas 会被设置为 0 ,也没启用 HPA ,restartPolicy 配置的是 Always ,查了 kube-apiserver 的 audit 日志也不是人为操作的,查看 nexus3 的日志也没有报错。

    deployment yaml:

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: service-nexus3-deployment
      namespace: service
      annotations:
        deployment.kubernetes.io/revision: '6'
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: service-nexus3
          envronment: test
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: service-nexus3
            envronment: test
          annotations:
            kubesphere.io/restartedAt: '2022-02-16T01:11:44.479Z'
        spec:
          volumes:
            - name: service-nexus3-volume
              persistentVolumeClaim:
                claimName: service-nexus3-pvc
            - name: docker-proxy
              configMap:
                name: docker-proxy
                defaultMode: 493
          containers:
            - name: nexus3
              # 用的阿里的镜像仓库,删了仓库名
              image: 'registry.cn-hangzhou.aliyuncs.com/nexus3-latest'
              ports:
                - name: tcp8081
                  containerPort: 8081
                  protocol: TCP
              resources:
                limits:
                  cpu: '4'
                  memory: 8Gi
                requests:
                  cpu: 500m
                  memory: 1Gi
              volumeMounts:
                - name: service-nexus3-volume
                  mountPath: /data/server/nexus3/
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              imagePullPolicy: Always
            - name: docker-proxy
              # 用的阿里的镜像仓库,删了仓库名
              image: 'registry.cn-hangzhou.aliyuncs.com/nginx-latest'
              ports:
                - name: tcp80
                  containerPort: 80
                  protocol: TCP
              resources:
                limits:
                  cpu: '2'
                  memory: 4Gi
                requests:
                  cpu: 500m
                  memory: 1Gi
              volumeMounts:
                - name: docker-proxy
                  mountPath: /usr/local/nginx/conf/vhosts/
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              imagePullPolicy: Always
          restartPolicy: Always
          terminationGracePeriodSeconds: 30
          dnsPolicy: ClusterFirst
          nodeSelector:
            disktype: raid1
          securityContext: {}
          imagePullSecrets:
            - name: registrysecret
          schedulerName: default-scheduler
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 0
          maxSurge: 1
      revisionHistoryLimit: 10
      progressDeadlineSeconds: 600
    

    HPA:

    # kubectl get hpa -A
    No resources found
    

    deployment describe:

    ...
    ...
    Events:
      Type    Reason             Age                From                   Message
      ----    ------             ----               ----                   -------
      Normal  ScalingReplicaSet  34m (x2 over 38h)  deployment-controller  Scaled down replica set service-nexus3-deployment-57995fcd76 to 0
    

    kube controller 日志:

    # kubectl logs kube-controller-manager-k8s-130 -n kube-system|grep nexus
    I0509 10:49:11.687356       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"service", Name:"service-nexus3-deployment", UID:"e0c4abba-bbe5-4c19-9853-de63ee571124", APIVersion:"apps/v1", ResourceVersion:"126342143", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set service-nexus3-deployment-57995fcd76 to 0
    I0509 10:49:11.701642       1 event.go:281] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"service", Name:"service-nexus3-deployment-57995fcd76", UID:"9f96fdf1-1e20-4c83-ad18-1b3640d52493", APIVersion:"apps/v1", ResourceVersion:"126342151", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: service-nexus3-deployment-57995fcd76-t6bhx
    

    kube-apiserver audit 相关日志:

    nexus3 日志:

    已经出现好几次,日志都查了实在没有思路,特来请教,希望大佬们支支招。

    12 条回复    2022-05-10 11:45:26 +08:00
    anonydmer
        1
    anonydmer  
       2022-05-09 16:53:58 +08:00
    检查下是不是服务不稳定,容器在不停的失败和重启
    rabbitz
        2
    rabbitz  
    OP
       2022-05-09 17:10:33 +08:00
    replica to 0 之前 RESTARTS 一直是 0
    rabbitz
        3
    rabbitz  
    OP
       2022-05-09 17:11:35 +08:00
    不好意思,上面的图片发错了,底下这个才是
    wubowen
        4
    wubowen  
       2022-05-09 17:38:14 +08:00
    有点怀疑审计日志图里的内容可以证明非人为操作嘛?如果是人为操作的 scale ,最终也需要 replicaset controller 去删除 pod 吧?是不是可以考虑直接搜审计日志里和 kubeconfig 用户相关的操作,直接看是否有人为 scale
    defunct9
        5
    defunct9  
       2022-05-09 17:46:28 +08:00
    开 ssh ,让我上去看看
    basefas
        6
    basefas  
       2022-05-09 17:48:48 +08:00
    监控下这个项目的 replicas ,变了报警,然后看 event
    hwdef
        7
    hwdef  
       2022-05-09 17:53:01 +08:00
    hello kitty 可还行,有点萌
    rabbitz
        8
    rabbitz  
    OP
       2022-05-10 10:25:48 +08:00
    @wubowen #4 确实看不出来,监控的时候我把日志打全点
    rabbitz
        9
    rabbitz  
    OP
       2022-05-10 10:28:45 +08:00
    @basefas #6 目前监控把相关的日志信息收集都加上了
    rabbitz
        10
    rabbitz  
    OP
       2022-05-10 10:30:03 +08:00
    @defunct9 #5 生产业务开不了 ssh
    basefas
        11
    basefas  
       2022-05-10 11:11:43 +08:00
    @rabbitz #9 如果已经采集并保存了集群 event ,可以根据具体信息查下当时的 event
    rabbitz
        12
    rabbitz  
    OP
       2022-05-10 11:45:26 +08:00
    @basefas #11 集群 event 保存配置有点问题,暂时查不了历史数据
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1064 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 19:40 · PVG 03:40 · LAX 12:40 · JFK 15:40
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.