오류 메세지 혹은 현상
- 제품 접속 시 metrics 가 그려지는 화면의 로딩이 오래걸리는 현상 접수
원인
요약 : No space of device 현상으로 인한 worker node NotReady
이 때, prometheus 비정상 종료 및 meta.json 파일 손상
결과 : prometheus 비정상 작동
- sodaflow-app Log 확인
- prometheus 에서 에러가 발생하는 것을 확인2022-12-29 09:29:08,162 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0dff590>: Failed to establish a new connection: [Errno 110] Connection timed out')': //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D 2022-12-29 09:29:49,122 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.mod_prometheus.services | ERROR | Prometheus metrics exception raised: HTTPConnectionPool(host='kube-prometheus-stack-prometheus.monitoring', port=9090): Max retries exceeded with url: //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0ef6cd0>: Failed to establish a new connection: [Errno 110] Connection timed out')) 2022-12-29 09:29:49,123 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0e16f10>: Failed to establish a new connection: [Errno 110] Connection timed out')': //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D 2022-12-29 09:29:49,125 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.common.utils | DEBUG | DB rollback... 2022-12-29 09:29:49,249 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.common.utils | ERROR | Traceback (most recent call last):
- sodaflow-app-ui log 확인
- ui 와 통신이 잘 되지 않는 것을 확인[HPM] Proxy created: / -> http://sprk.abllife.co.kr/service UI Server Running on http://0.0.0.0:9001, Timeout: 600 seconds [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors) [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors) [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors) [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/experiments/get/params/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors) [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors) [HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
- kube-prometheus-stack-prometheus.monitoring log 확인
- sodaflow-app log 에서 "HTTPConnectionPool : kube-prometheus-stack-prometheus.monitoring" 현상 발생으로 추적
- 에러 : level=error ts=2022-12-29T04:22:48.928Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC5G3GX84KKHV3CAKX3J9 err="open /prometheus/01GFMPC5G3GX84KKHV3CAKX3J9/meta.json: no such file or directory"
=> worker1 번 node 가 재부팅되면서 meta.json 파일에 손상이 된 것으로 확인.kubectl logs -f prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring prometheus level=info ts=2022-12-29T04:22:48.848Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.0, branch=HEAD, revision=02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)" level=info ts=2022-12-29T04:22:48.848Z caller=main.go:369 build_context="(go=go1.15.6, user=root@d9f90f0b1f76, date=20210106-13:48:37)" level=info ts=2022-12-29T04:22:48.848Z caller=main.go:370 host_details="(Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Wed Nov 3 10:29:07 UTC 2021 x86_64 prometheus-kube-prometheus-stack-prometheus-0 (none))" level=info ts=2022-12-29T04:22:48.848Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2022-12-29T04:22:48.848Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2022-12-29T04:22:48.871Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2022-12-29T04:22:48.871Z caller=main.go:738 msg="Starting TSDB ..." level=info ts=2022-12-29T04:22:48.873Z caller=tls_config.go:192 component=web msg="TLS is disabled." http2=false level=info ts=2022-12-29T04:22:48.928Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666057325037 maxt=1666058400000 ulid=01GFMPC5DW43DKR6RE526D1BJC level=error ts=2022-12-29T04:22:48.928Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC5G3GX84KKHV3CAKX3J9 err="open /prometheus/01GFMPC5G3GX84KKHV3CAKX3J9/meta.json: no such file or directory" level=error ts=2022-12-29T04:22:48.929Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC7Z2JAJG0AZRGGSHZ4ZY err="open /prometheus/01GFMPC7Z2JAJG0AZRGGSHZ4ZY/meta.json: no such file or directory" level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPG7QJWZXE6X76TW84N5PJ err="open /prometheus/01GFMPG7QJWZXE6X76TW84N5PJ/meta.json: no such file or directory" level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPJ3HFDPM503DF575N55F4 err="open /prometheus/01GFMPJ3HFDPM503DF575N55F4/meta.json: no such file or directory" level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPKZJFMW4SAMMCF33V7HTE err="open /prometheus/01GFMPKZJFMW4SAMMCF33V7HTE/meta.json: no such file or directory" level=info ts=2022-12-29T04:22:51.976Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any" level=info ts=2022-12-29T04:24:06.524Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1m14.548796983s level=info ts=2022-12-29T04:24:06.524Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while" level=info ts=2022-12-29T04:24:08.921Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=309 level=info ts=2022-12-29T04:24:09.043Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=309
- node 상태 확인
- worker 1 이 NotReady 인 것을 확인
: memory 가 부족하여 정상적으로 작동하지 못한 것이 원인kubectl get nodes NAME STATUS ROLES AGE VERSION master1 Ready master 73d v1.18.19 master2 Ready master 55d v1.18.19 master3 Ready master 73d v1.18.19 worker1 NotReady <none> 73d v1.18.19
문제 해결
- 해결 방법 및 실패 사례를 함께 작성하였습니다.
- Prometheus 데이터 백업
: 이전에 쌓인 data 를 유지하기 위해 pvc 를 백업합니다.cd <PVC ROOT> tar zcvf monitoring.tar.gz monitoring-prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0-pvc-<UUID>
- Prometheus pod 내리기위의 명령어로도 정상 작동을 하지않는다면 4번까지 순차적으로 실행
kubectl delete pod prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring
cd <SETUP ROOT>/setup/06_prometheus_grafana/ kubectl delete -f yamls/ kubectl delete -f yamls/crds kuebctl delete ns monitoring
- Prometheus pvc 삭제
cd <PVC ROOT> # 삭제에 위험부담이 있다면 mv 명령어로 이름 변경을 권장합니다. rm -r monitoring.tar.gz monitoring-prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0-pvc-<UUID>
- Prometheus pod 적용
cd <SETUP ROOT>/setup/06_prometheus_grafana/ bash install_prometheus.sh
Log 확인
kubectl logs -f prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring prometheus level=info ts=2022-12-29T05:08:50.580Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.0, branch=HEAD, revision=02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)" level=info ts=2022-12-29T05:08:50.580Z caller=main.go:369 build_context="(go=go1.15.6, user=root@d9f90f0b1f76, date=20210106-13:48:37)" level=info ts=2022-12-29T05:08:50.580Z caller=main.go:370 host_details="(Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Wed Nov 3 10:29:07 UTC 2021 x86_64 prometheus-kube-prometheus-stack-prometheus-0 (none))" level=info ts=2022-12-29T05:08:50.580Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2022-12-29T05:08:50.581Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2022-12-29T05:08:50.585Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2022-12-29T05:08:50.585Z caller=main.go:738 msg="Starting TSDB ..." level=info ts=2022-12-29T05:08:50.586Z caller=tls_config.go:192 component=web msg="TLS is disabled." http2=false level=info ts=2022-12-29T05:08:50.602Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any" level=info ts=2022-12-29T05:08:50.602Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.28µs level=info ts=2022-12-29T05:08:50.602Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while" level=info ts=2022-12-29T05:08:50.603Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0 level=info ts=2022-12-29T05:08:50.603Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=338.751µs wal_replay_duration=714.256µs total_replay_duration=1.076226ms level=warn ts=2022-12-29T05:08:50.604Z caller=main.go:756 fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems." level=info ts=2022-12-29T05:08:50.604Z caller=main.go:761 msg="TSDB started" level=info ts=2022-12-29T05:08:50.604Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2022-12-29T05:08:50.606Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:08:50.607Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:08:50.609Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=4.95112ms remote_storage=2.037µs web_handler=738ns query_engine=1.075µs scrape=572.033µs scrape_sd=1.754355ms notify=31.236µs notify_sd=1.941503ms rules=41.851µs level=info ts=2022-12-29T05:08:50.609Z caller=main.go:710 msg="Server is ready to receive web requests." level=info ts=2022-12-29T05:08:55.388Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2022-12-29T05:08:55.389Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:08:55.390Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:08:55.392Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=3.424107ms remote_storage=2.787µs web_handler=846ns query_engine=1.322µs scrape=78.071µs scrape_sd=1.21689ms notify=17.889µs notify_sd=1.347041ms rules=68.197µs level=info ts=2022-12-29T05:10:17.562Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2022-12-29T05:10:17.570Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:10:17.571Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:10:17.572Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:10:17.573Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:10:17.574Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-12-29T05:10:17.646Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=84.05544ms remote_storage=3.848µs web_handler=883ns query_engine=1.496µs scrape=137.768µs scrape_sd=4.549424ms notify=23.889µs notify_sd=2.01647ms rules=70.00254ms
아티클이 유용했나요?
훌륭합니다!
피드백을 제공해 주셔서 감사합니다.
도움이 되지 못해 죄송합니다!
피드백을 제공해 주셔서 감사합니다.
피드백 전송
소중한 의견을 수렴하여 아티클을 개선하도록 노력하겠습니다.