2025年2022学习0427【K8S 集群IO高,导致服务挂掉,排障记录】

2022学习0427【K8S 集群IO高,导致服务挂掉,排障记录】庆祝一下 今天解决了一个困扰我一个多月的问题 问题背景 2 核 2G 的腾讯云 S5 服务器上通过 minikube 部署了 K8S 集群 上面有 nodejs grafana 两个 deployment 以及 filebeat 一个 ds 问题现像 1 开始观察到 grafanna 常常无法访问

大家好,我是讯享网,很高兴认识大家。

庆祝一下,今天解决了一个困扰我一个多月的问题!
【问题背景】:2核2G的腾讯云S5服务器上通过minikube部署了K8S集群。上面有nodejs,grafana两个deployment,以及filebeat一个ds
【问题现像】:
1.开始观察到grafanna常常无法访问,master节点NotReady。
2.docker ps发现apiserver和scheduler频繁重启(约6mins,不影响服务)
3.docker ps发现proxy节点有过重启记录 (雪崩的根因,proxy一崩,master就NotReady了)
4.IO经常打满,甚至导致服务器无法登录,重启后大概能好一个小时。然后再次循环问题。
【异常日志汇总】:
日志都很长,就只截取了有代表性的,不重复的异常日志
apiserver:我开始只是关注了pod层面,kubectl logs apiserver,而忘记关注container层面了,

http: TLS handshake error from 172.17.0.36:37502: read tcp 172.17.0.36:8443-\u003e172.17.0.36:37502: read: connection reset by peer {"log":"I0423 21:50:56. 1 log.go:172] http: TLS handshake error from 172.17.0.36:37756: EOF\n","stream":"stderr","time":"2022-04-23T21:50:56.Z"} {"log":"E0423 21:50:56. 1 status.go:71] apiserver received an error that is not an metav1.Status: \u0026errors.errorString{s:\"context canceled\"}\n","stream":"stderr","time":"2022-04-23T21:50:56.Z"} 

讯享网

scheduler: 同apiserver,这俩container重启异常频繁,我开始居然没关注这里,单纯的以为就是不稳定。

讯享网E0423 22:54:15. 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: Get https://control-plane.minikube.internal:8443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=10s: context deadline exceeded I0423 22:54:17. 1 leaderelection.go:277] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition F0423 22:54:17. 1 server.go:244] leaderelection lost 

controller:

W0423 22:52:23. 1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] E0423 22:52:41. 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: Get https://control-plane.minikube.internal:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded I0423 22:52:43. 1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition F0423 22:52:43. 1 controllermanager.go:279] leaderelection lost 

etcd: — 这里我被误导了,range request 我一直觉得是etcd写日志,磁盘扛不住导致的,
这里还有个很大的误区,我不停尝试去优化etcd,下载了etcd-client。
ETCD优化措施方案
(其实我单节点并不需要Raft,但是为了学习,硬着头皮用上了。)
主要从3方面优化,
1.提高磁盘IO优先级
2.增加快照频率,减少日志累积,加快压缩
3.网络,这个暂且不提
哦, 最后证明没用,跟etcd优化毫无关系


讯享网

讯享网{"log":"2022-04-23 21:47:38. I | embed: rejected connection from \"127.0.0.1:44528\" (error \"read tcp 127.0.0.1:2379-\u003e127.0.0.1:44528: read: connection reset by peer\", ServerName \"\")\n","stream":"stderr","time":"2022-04-23T21:47:38.Z"} {"log":"2022-04-23 21:47:47. W | etcdserver: read-only range request \"key:\\\"/registry/services/endpoints/kube-system/kube-scheduler\\\" \" with result \"range_response_count:1 size:590\" took too long (1.0s) to execute\n","stream":"stderr","time":"2022-04-23T21:47:47.Z"} 

根据退出时间判断下报错时间线,这里能看出来报错的一些dependency,
创建时间越早证明上一轮崩的越早。
由此可见,除了proxy之外。 先后顺序为:controller>scheduler>apiserver

[root@VM-0-36-centos ~]# docker ps -a | grep xit 1bf9080b4c48 "/dashboard --insecu…" 15 minutes ago Exited (2) 15 minutes ago k8s_kubernetes-dashboard_kubernetes-dashboard-696dbcc666-42fns_kubernetes-dashboard_2568a5cd-72d1-433b-b5b2-27885d2d943e_42 395c8890dd51 "kube-apiserver --ad…" 19 minutes ago Exited (0) 15 minutes ago k8s_kube-apiserver_kube-apiserver-vm-0-36-centos_kube-system_e83e2dbe21a35b9d31ad_91 7fbee5fd1757 "kube-scheduler --au…" 20 minutes ago Exited (255) 15 minutes ago k8s_kube-scheduler_kube-scheduler-vm-0-36-centos_kube-system_c63aea358d14eb11f27c64756f_240 0e7d81be25d9 "kube-controller-man…" 27 minutes ago Exited (255) 15 minutes ago k8s_kube-controller-manager_kube-controller-manager-vm-0-36-centos_kube-system_0d5c3746cb0a798a6fc95c8dab3bff0b_245 7aee6e "/usr/bin/dumb-init …" 3 hours ago Exited (137) 2 hours ago k8s_nginx-ingress-controller_nginx-ingress-controller-6d746cd945-f67xn_ingress-nginx_0c01ad47-3f93-4d82-892e-e87b6a361db5_0 28bb20e1ce79 "/coredns -conf /etc…" 3 hours ago Exited (0) 20 minutes ago k8s_coredns_coredns-c-4ct42_kube-system_c389cc98-3f18-4351-8e5e-346b374a47a2_54 1e61dd0aef3d "/coredns -conf /etc…" 3 hours ago Exited (0) 20 minutes ago k8s_coredns_coredns-c-2sfml_kube-system_c3870b1b-66cb-4179-808e-6956ecc92ebe_51 

最重要的是proxy的container的日志,当时没保存,没办法复现了,大概内容如下

讯享网dial tcp 172.* connect reset 

其实在apiserver里已经反映出来了,内网通信172有问题。
网段冲突
这时候一篇文章救了我。
是我服务器内网网段地址和容器网络地址有冲突!我开始并没有在意,觉得这个跟IO没直接关系,修复了又能如何
但是秉着试一试,死马当活马医的态度,我切换了服务器内网网段。(切换完我配置的configmap什么的都没了,要重新创建。)
同时起minikube的时候还报错了, minikube delete 就可以了,重新来一次。

couldn't retrieve DNS addon deployments: 
小讯
上一篇 2025-03-18 19:57
下一篇 2025-01-17 19:45

相关推荐

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/67461.html