Prometheus metrics exception raised: HTTPConnectionPool

작성자 이경미 수정일 2022-12-29 15:23

오류 메세지 혹은 현상

제품 접속 시 metrics 가 그려지는 화면의 로딩이 오래걸리는 현상 접수

원인

요약 : No space of device 현상으로 인한 worker node NotReady

이 때, prometheus 비정상 종료 및 meta.json 파일 손상

결과 : prometheus 비정상 작동

sodaflow-app Log 확인
- prometheus 에서 에러가 발생하는 것을 확인

2022-12-29 09:29:08,162 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0dff590>: Failed to establish a new connection: [Errno 110] Connection timed out')': //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D
2022-12-29 09:29:49,122 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.mod_prometheus.services | ERROR | Prometheus metrics exception raised: HTTPConnectionPool(host='kube-prometheus-stack-prometheus.monitoring', port=9090): Max retries exceeded with url: //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0ef6cd0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
2022-12-29 09:29:49,123 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f94e0e16f10>: Failed to establish a new connection: [Errno 110] Connection timed out')': //api/v1/query?query=kube_pod_labels%7Bnamespace%3D%27sodaflow%27%2Clabel_sodaflow_project%3D%27spsd%27%7D
2022-12-29 09:29:49,125 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.common.utils | DEBUG | DB rollback...
2022-12-29 09:29:49,249 | 409ecf08e53bd5a6712b7c365e540903 | superuser | app.common.utils | ERROR | Traceback (most recent call last):

sodaflow-app-ui log 확인
- ui 와 통신이 잘 되지 않는 것을 확인

[HPM] Proxy created: /  -> http://sprk.abllife.co.kr/service
UI Server Running on http://0.0.0.0:9001, Timeout: 600 seconds
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/experiments/get/params/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while proxying request sprk.abllife.co.kr:8443/api/prometheus/project/metrics/ to http://sprk.abllife.co.kr/service [ECONNRESET] (https://nodejs.org/api/errors.html#errors_common_system_errors)

kube-prometheus-stack-prometheus.monitoring log 확인
- sodaflow-app log 에서 "HTTPConnectionPool : kube-prometheus-stack-prometheus.monitoring" 현상 발생으로 추적
- 에러 : level=error ts=2022-12-29T04:22:48.928Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC5G3GX84KKHV3CAKX3J9 err="open /prometheus/01GFMPC5G3GX84KKHV3CAKX3J9/meta.json: no such file or directory"
=> worker1 번 node 가 재부팅되면서 meta.json 파일에 손상이 된 것으로 확인.

kubectl logs -f prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring prometheus
level=info ts=2022-12-29T04:22:48.848Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.0, branch=HEAD, revision=02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)"
level=info ts=2022-12-29T04:22:48.848Z caller=main.go:369 build_context="(go=go1.15.6, user=root@d9f90f0b1f76, date=20210106-13:48:37)"
level=info ts=2022-12-29T04:22:48.848Z caller=main.go:370 host_details="(Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Wed Nov 3 10:29:07 UTC 2021 x86_64 prometheus-kube-prometheus-stack-prometheus-0 (none))"
level=info ts=2022-12-29T04:22:48.848Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2022-12-29T04:22:48.848Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-12-29T04:22:48.871Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-12-29T04:22:48.871Z caller=main.go:738 msg="Starting TSDB ..."
level=info ts=2022-12-29T04:22:48.873Z caller=tls_config.go:192 component=web msg="TLS is disabled." http2=false
level=info ts=2022-12-29T04:22:48.928Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666057325037 maxt=1666058400000 ulid=01GFMPC5DW43DKR6RE526D1BJC
level=error ts=2022-12-29T04:22:48.928Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC5G3GX84KKHV3CAKX3J9 err="open /prometheus/01GFMPC5G3GX84KKHV3CAKX3J9/meta.json: no such file or directory"
level=error ts=2022-12-29T04:22:48.929Z caller=repair.go:53 component=tsdb msg="failed to read meta.json for a block during repair process; skipping" dir=/prometheus/01GFMPC7Z2JAJG0AZRGGSHZ4ZY err="open /prometheus/01GFMPC7Z2JAJG0AZRGGSHZ4ZY/meta.json: no such file or directory"
level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPG7QJWZXE6X76TW84N5PJ err="open /prometheus/01GFMPG7QJWZXE6X76TW84N5PJ/meta.json: no such file or directory"
level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPJ3HFDPM503DF575N55F4 err="open /prometheus/01GFMPJ3HFDPM503DF575N55F4/meta.json: no such file or directory"
level=error ts=2022-12-29T04:22:51.870Z caller=db.go:1061 component=tsdb msg="Failed to read meta.json for a block during reloadBlocks. Skipping" dir=/prometheus/01GFMPKZJFMW4SAMMCF33V7HTE err="open /prometheus/01GFMPKZJFMW4SAMMCF33V7HTE/meta.json: no such file or directory"
level=info ts=2022-12-29T04:22:51.976Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2022-12-29T04:24:06.524Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1m14.548796983s
level=info ts=2022-12-29T04:24:06.524Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2022-12-29T04:24:08.921Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=309
level=info ts=2022-12-29T04:24:09.043Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=309

node 상태 확인
- worker 1 이 NotReady 인 것을 확인
: memory 가 부족하여 정상적으로 작동하지 못한 것이 원인

kubectl get nodes
NAME    STATUS   ROLES    AGE   VERSION
master1   Ready    master   73d   v1.18.19
master2   Ready    master   55d   v1.18.19
master3   Ready    master   73d   v1.18.19
worker1   NotReady    <none>   73d   v1.18.19

문제 해결

해결 방법 및 실패 사례를 함께 작성하였습니다.

Prometheus 데이터 백업
: 이전에 쌓인 data 를 유지하기 위해 pvc 를 백업합니다.

cd <PVC ROOT>
tar zcvf monitoring.tar.gz monitoring-prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0-pvc-<UUID>

Prometheus pod 내리기

kubectl delete pod prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring

위의 명령어로도 정상 작동을 하지않는다면 4번까지 순차적으로 실행

cd <SETUP ROOT>/setup/06_prometheus_grafana/
kubectl delete -f yamls/
kubectl delete -f yamls/crds
kuebctl delete ns monitoring

Prometheus pvc 삭제

cd <PVC ROOT>
# 삭제에 위험부담이 있다면 mv 명령어로 이름 변경을 권장합니다. 
rm -r monitoring.tar.gz monitoring-prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0-pvc-<UUID>

Prometheus pod 적용

cd <SETUP ROOT>/setup/06_prometheus_grafana/
bash install_prometheus.sh

Log 확인

kubectl logs -f prometheus-kube-prometheus-stack-prometheus-0 -nmonitoring prometheus
level=info ts=2022-12-29T05:08:50.580Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.0, branch=HEAD, revision=02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)"
level=info ts=2022-12-29T05:08:50.580Z caller=main.go:369 build_context="(go=go1.15.6, user=root@d9f90f0b1f76, date=20210106-13:48:37)"
level=info ts=2022-12-29T05:08:50.580Z caller=main.go:370 host_details="(Linux 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Wed Nov 3 10:29:07 UTC 2021 x86_64 prometheus-kube-prometheus-stack-prometheus-0 (none))"
level=info ts=2022-12-29T05:08:50.580Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2022-12-29T05:08:50.581Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-12-29T05:08:50.585Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-12-29T05:08:50.585Z caller=main.go:738 msg="Starting TSDB ..."
level=info ts=2022-12-29T05:08:50.586Z caller=tls_config.go:192 component=web msg="TLS is disabled." http2=false
level=info ts=2022-12-29T05:08:50.602Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2022-12-29T05:08:50.602Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.28µs
level=info ts=2022-12-29T05:08:50.602Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2022-12-29T05:08:50.603Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2022-12-29T05:08:50.603Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=338.751µs wal_replay_duration=714.256µs total_replay_duration=1.076226ms
level=warn ts=2022-12-29T05:08:50.604Z caller=main.go:756 fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems."
level=info ts=2022-12-29T05:08:50.604Z caller=main.go:761 msg="TSDB started"
level=info ts=2022-12-29T05:08:50.604Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-12-29T05:08:50.606Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:08:50.607Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:08:50.609Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=4.95112ms remote_storage=2.037µs web_handler=738ns query_engine=1.075µs scrape=572.033µs scrape_sd=1.754355ms notify=31.236µs notify_sd=1.941503ms rules=41.851µs
level=info ts=2022-12-29T05:08:50.609Z caller=main.go:710 msg="Server is ready to receive web requests."
level=info ts=2022-12-29T05:08:55.388Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-12-29T05:08:55.389Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:08:55.390Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:08:55.392Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=3.424107ms remote_storage=2.787µs web_handler=846ns query_engine=1.322µs scrape=78.071µs scrape_sd=1.21689ms notify=17.889µs notify_sd=1.347041ms rules=68.197µs
level=info ts=2022-12-29T05:10:17.562Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-12-29T05:10:17.570Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:10:17.571Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:10:17.572Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:10:17.573Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:10:17.574Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-12-29T05:10:17.646Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=84.05544ms remote_storage=3.848µs web_handler=883ns query_engine=1.496µs scrape=137.768µs scrape_sd=4.549424ms notify=23.889µs notify_sd=2.01647ms rules=70.00254ms