Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fix #243

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2cbcfe0
policy: add a decapod app for policies
Oct 4, 2023
0923ba4
Merge pull request #218 from openinfradev/main
ktkfree Nov 13, 2023
4b43b37
Merge pull request #226 from openinfradev/main
ktkfree Nov 17, 2023
7128398
Merge pull request #178 from openinfradev/policy-serving
intelliguy Nov 24, 2023
49558f0
fluentbit: do not store as default over every logs
Dec 4, 2023
e008a7d
Merge pull request #230 from openinfradev/fluentbit
bluejayA Dec 5, 2023
50d7082
Merge pull request #231 from openinfradev/main
ktkfree Jan 15, 2024
a8653f6
feature. add alert ruler for tks_policy
ktkfree Apr 19, 2024
10f40d2
Merge pull request #233 from openinfradev/policy_ruler
intelliguy Apr 23, 2024
9d2964c
feature. remove thanos ruler from all stack_templates
ktkfree Apr 24, 2024
b298550
Merge pull request #235 from openinfradev/remove_thanos_ruller
intelliguy Apr 24, 2024
b7816bd
feature. change service type LoadBalancer for thanos-ruler
ktkfree Apr 25, 2024
5ab7d8e
Merge pull request #236 from openinfradev/change_servicetype_ruler
intelliguy Apr 25, 2024
f321e70
feature. add policy to byoh-reference
ktkfree May 3, 2024
15e62a7
Merge pull request #237 from openinfradev/byoh_fix
intelliguy May 3, 2024
eb5b524
Merge pull request #238 from openinfradev/develop
ktkfree May 17, 2024
7bd3a5f
fluentbit: add collecting targets for policy-serving
May 20, 2024
aaf00ca
Merge pull request #239 from openinfradev/policy-serving
ktkfree May 21, 2024
345bc1b
Merge pull request #240 from openinfradev/develop
ktkfree May 21, 2024
447a84d
Merge pull request #241 from openinfradev/release
ktkfree Jun 4, 2024
0273123
user-logging: add loki for non-platform-logs as loki-user
Jun 24, 2024
8031cc4
Merge pull request #242 from openinfradev/user-logging
intelliguy Jun 25, 2024
c07f57f
trivial. remove service type LoadBalaner from thanos-ruler
ktkfree Jul 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 42 additions & 57 deletions aws-msa-reference/lma/site-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ global:

lokiHost: loki-loki-distributed-gateway
lokiPort: 80
lokiuserHost: loki-user-loki-distributed-gateway
lokiuserPort: 80
s3Service: "minio.lma.svc:9000" # depends on $lmaNameSpace (ex. minio.taco-system.svc)

lmaNameSpace: lma
Expand Down Expand Up @@ -148,19 +150,23 @@ charts:
- name: taco-loki
host: $(lokiHost)
port: $(lokiPort)
lokiuser:
- name: taco-loki-user
host: $(lokiuserHost)
port: $(lokiuserPort)
targetLogs:
- tag: kube.*
bufferChunkSize: 2M
bufferMaxSize: 5M
do_not_store_as_default: false
index: container
loki_name: taco-loki
loki_name: taco-loki-user
memBufLimit: 20MB
multi_index:
- index: platform
loki_name: taco-loki
key: $kubernetes['namespace_name']
value: kube-system|$(lmaNameSpace)|taco-system|argo
value: kube-system|$(lmaNameSpace)|taco-system|gatekeeper-system|argo
parser: docker
path: /var/log/containers/*.log
type: kubernates
Expand Down Expand Up @@ -274,6 +280,8 @@ charts:
# - --deduplication.replica-label="replica"
storegateway.persistence.size: 8Gi
ruler.nodeSelector: $(nodeSelector)
ruler.service.type: LoadBalancer
ruler.service.annotations: $(awsNlbAnnotation)
ruler.alertmanagers:
- http://alertmanager-operated:9093
ruler.persistence.size: 8Gi
Expand All @@ -283,61 +291,7 @@ charts:
rules:
- alert: "PrometheusDown"
expr: absent(up{prometheus="lma/lma-prometheus"})
- alert: node-cpu-high-load
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 idle process의 cpu 점유율이 3분 동안 0% 입니다. (현재 사용률 {{$value}})
description: 워커 노드 CPU가 과부하 상태입니다. 일시적인 서비스 Traffic 증가, Workload의 SW 오류, Server HW Fan Fail등 다양한 원인으로 인해 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행 되는 pod중 CPU 자원을 많이 점유하는 pod의 설정을 점검해 보시길 제안드립니다. 예를 들어 pod spec의 limit 설정으로 과도한 CPU자원 점유을 막을 수 있습니다.
summary: Cpu resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (avg by (taco_cluster, instance) (rate(node_cpu_seconds_total{mode="idle"}[60s]))) < 0 #0.1 # 진짜 0?
for: 3m
labels:
severity: warning
- alert: node-memory-high-utilization
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 Memory 사용량이 3분동안 80% 를 넘어서고 있습니다. (현재 사용률 {{$value}})
descriptioon: 워커 노드의 Memory 사용량이 80%를 넘었습니다. 일시적인 서비스 증가 및 SW 오류등 다양한 원인으로 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행되는 pod중 Memory 사용량이 높은 pod들에 대한 점검을 제안드립니다.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) < 0.2
for: 3m
labels:
severity: warning
- alert: node-disk-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 root 볼륨은 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰 것으로 예상됩니다.
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
- alert: pvc-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.persistentvolumeclaim }})가 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰것으로 예상됩니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.persistentvolumeclaim }} PVC)
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Disk resources of the volume(pvc) {{ $labels.persistentvolumeclaim }} are running low.
discriminative: $labels.taco_cluster, $labels.persistentvolumeclaim
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 24*3600) < 0 # kubelet_volume_stats_capacity_bytes
for: 30m
labels:
severity: critical
- alert: pod-restart-frequently
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.pod }})가 30분 동안 5회 이상 재기동 ({{ $value }}회)
description: 특정 Pod가 빈번하게 재기동 되고 있습니다. 점검이 필요합니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.pod }} 파드)
Checkpoint: pod spec. 에 대한 점검이 필요합니다. pod의 log 및 status를 확인해 주세요.
discriminative: $labels.taco_cluster, $labels.pod, $labels.namespace
expr: increase(kube_pod_container_status_restarts_total{namespace!="kube-system"}[60m:]) > 2 # 몇회로 할 것인지?
for: 30m
labels:
severity: critical


- name: thanos-config
override:
objectStorage:
Expand Down Expand Up @@ -393,6 +347,37 @@ charts:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: loki-user
override:
global.dnsService: kube-dns
# global.clusterDomain: $(clusterName) # annotate cluste because the cluster name is still cluster.local regardless cluster
gateway.service.type: LoadBalancer
gateway.service.annotations: $(awsNlbAnnotation)
ingester.persistence.storageClass: $(storageClassName)
distributor.persistence.storageClass: $(storageClassName)
queryFrontend.persistence.storageClass: $(storageClassName)
ruler.persistence.storageClass: $(storageClassName)
indexGateway.persistence.storageClass: $(storageClassName)
# select target node's label
ingester.nodeSelector: $(nodeSelector)
distributor.nodeSelector: $(nodeSelector)
querier.nodeSelector: $(nodeSelector)
queryFrontend.nodeSelector: $(nodeSelector)
queryScheduler.nodeSelector: $(nodeSelector)
tableManager.nodeSelector: $(nodeSelector)
gateway.nodeSelector: $(nodeSelector)
compactor.nodeSelector: $(nodeSelector)
ruler.nodeSelector: $(nodeSelector)
indexGateway.nodeSelector: $(nodeSelector)
memcachedChunks.nodeSelector: $(nodeSelector)
memcachedFrontend.nodeSelector: $(nodeSelector)
memcachedIndexQueries.nodeSelector: $(nodeSelector)
memcachedIndexWrites.nodeSelector: $(nodeSelector)
loki:
storageConfig:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: lma-bucket
override:
s3.enabled: true
Expand Down
5 changes: 5 additions & 0 deletions aws-msa-reference/policy/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
resources:
- ../base

transformers:
- site-values.yaml
26 changes: 26 additions & 0 deletions aws-msa-reference/policy/site-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: openinfradev.github.com/v1
kind: HelmValuesTransformer
metadata:
name: site

global:
nodeSelector:
taco-lma: enabled
clusterName: cluster.local
storageClassName: taco-storage
repository: https://openinfradev.github.io/helm-repo/

charts:
- name: opa-gatekeeper
override:
postUpgrade.nodeSelector: $(nodeSelector)
postInstall.nodeSelector: $(nodeSelector)
preUninstall.nodeSelector: $(nodeSelector)
controllerManager.nodeSelector: $(nodeSelector)
audit.nodeSelector: $(nodeSelector)
crds.nodeSelector: $(nodeSelector)

enableDeleteOperations: true

- name: policy-resources
override: {}
100 changes: 42 additions & 58 deletions aws-reference/lma/site-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ global:

lokiHost: loki-loki-distributed-gateway
lokiPort: 80
lokiuserHost: loki-user-loki-distributed-gateway
lokiuserPort: 80
s3Service: "minio.lma.svc:9000" # depends on $lmaNameSpace (ex. minio.taco-system.svc)

lmaNameSpace: lma
Expand Down Expand Up @@ -148,19 +150,23 @@ charts:
- name: taco-loki
host: $(lokiHost)
port: $(lokiPort)
lokiuser:
- name: taco-loki-user
host: $(lokiuserHost)
port: $(lokiuserPort)
targetLogs:
- tag: kube.*
bufferChunkSize: 2M
bufferMaxSize: 5M
do_not_store_as_default: false
index: container
loki_name: taco-loki
loki_name: taco-loki-user
memBufLimit: 20MB
multi_index:
- index: platform
loki_name: taco-loki
key: $kubernetes['namespace_name']
value: kube-system|$(lmaNameSpace)|taco-system|argo
value: kube-system|$(lmaNameSpace)|taco-system|gatekeeper-system|argo
parser: docker
path: /var/log/containers/*.log
type: kubernates
Expand Down Expand Up @@ -244,7 +250,6 @@ charts:
consoleIngress.nodeSelector: $(nodeSelector)
postJob.nodeSelector: $(nodeSelector)


- name: thanos
override:
global.storageClass: $(storageClassName)
Expand Down Expand Up @@ -274,6 +279,8 @@ charts:
# - --deduplication.replica-label="replica"
storegateway.persistence.size: 8Gi
ruler.nodeSelector: $(nodeSelector)
ruler.service.type: LoadBalancer
ruler.service.annotations: $(awsNlbAnnotation)
ruler.alertmanagers:
- http://alertmanager-operated:9093
ruler.persistence.size: 8Gi
Expand All @@ -283,61 +290,7 @@ charts:
rules:
- alert: "PrometheusDown"
expr: absent(up{prometheus="lma/lma-prometheus"})
- alert: node-cpu-high-load
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 idle process의 cpu 점유율이 3분 동안 0% 입니다. (현재 사용률 {{$value}})
description: 워커 노드 CPU가 과부하 상태입니다. 일시적인 서비스 Traffic 증가, Workload의 SW 오류, Server HW Fan Fail등 다양한 원인으로 인해 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행 되는 pod중 CPU 자원을 많이 점유하는 pod의 설정을 점검해 보시길 제안드립니다. 예를 들어 pod spec의 limit 설정으로 과도한 CPU자원 점유을 막을 수 있습니다.
summary: Cpu resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (avg by (taco_cluster, instance) (rate(node_cpu_seconds_total{mode="idle"}[60s]))) < 0 #0.1 # 진짜 0?
for: 3m
labels:
severity: warning
- alert: node-memory-high-utilization
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 Memory 사용량이 3분동안 80% 를 넘어서고 있습니다. (현재 사용률 {{$value}})
descriptioon: 워커 노드의 Memory 사용량이 80%를 넘었습니다. 일시적인 서비스 증가 및 SW 오류등 다양한 원인으로 발생할 수 있습니다.
Checkpoint: 일시적인 Service Traffic의 증가가 관측되지 않았다면, Alert발생 노드에서 실행되는 pod중 Memory 사용량이 높은 pod들에 대한 점검을 제안드립니다.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) < 0.2
for: 3m
labels:
severity: warning
- alert: node-disk-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 노드({{ $labels.instance }})의 root 볼륨은 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰 것으로 예상됩니다.
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Memory resources of the node {{ $labels.instance }} are running low.
discriminative: $labels.taco_cluster, $labels.instance
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
- alert: pvc-full
annotations:
message: 지난 6시간동안의 추세로 봤을 때, 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.persistentvolumeclaim }})가 24시간 안에 Disk full이 예상됨
description: 현재 Disk 사용 추세기준 24시간 내에 Disk 용량이 꽉 찰것으로 예상됩니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.persistentvolumeclaim }} PVC)
Checkpoint: Disk 용량 최적화(삭제 및 Backup)을 수행하시길 권고합니다. 삭제할 내역이 없으면 증설 계획을 수립해 주십시요.
summary: Disk resources of the volume(pvc) {{ $labels.persistentvolumeclaim }} are running low.
discriminative: $labels.taco_cluster, $labels.persistentvolumeclaim
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 24*3600) < 0 # kubelet_volume_stats_capacity_bytes
for: 30m
labels:
severity: critical
- alert: pod-restart-frequently
annotations:
message: 클러스터({{ $labels.taco_cluster }})의 파드({{ $labels.pod }})가 30분 동안 5회 이상 재기동 ({{ $value }}회)
description: 특정 Pod가 빈번하게 재기동 되고 있습니다. 점검이 필요합니다. ({{ $labels.taco_cluster }} 클러스터, {{ $labels.pod }} 파드)
Checkpoint: pod spec. 에 대한 점검이 필요합니다. pod의 log 및 status를 확인해 주세요.
discriminative: $labels.taco_cluster, $labels.pod, $labels.namespace
expr: increase(kube_pod_container_status_restarts_total{namespace!="kube-system"}[60m:]) > 2 # 몇회로 할 것인지?
for: 30m
labels:
severity: critical


- name: thanos-config
override:
objectStorage:
Expand Down Expand Up @@ -393,6 +346,37 @@ charts:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: loki-user
override:
global.dnsService: kube-dns
# global.clusterDomain: $(clusterName) # annotate cluste because the cluster name is still cluster.local regardless cluster
gateway.service.type: LoadBalancer
gateway.service.annotations: $(awsNlbAnnotation)
ingester.persistence.storageClass: $(storageClassName)
distributor.persistence.storageClass: $(storageClassName)
queryFrontend.persistence.storageClass: $(storageClassName)
ruler.persistence.storageClass: $(storageClassName)
indexGateway.persistence.storageClass: $(storageClassName)
# select target node's label
ingester.nodeSelector: $(nodeSelector)
distributor.nodeSelector: $(nodeSelector)
querier.nodeSelector: $(nodeSelector)
queryFrontend.nodeSelector: $(nodeSelector)
queryScheduler.nodeSelector: $(nodeSelector)
tableManager.nodeSelector: $(nodeSelector)
gateway.nodeSelector: $(nodeSelector)
compactor.nodeSelector: $(nodeSelector)
ruler.nodeSelector: $(nodeSelector)
indexGateway.nodeSelector: $(nodeSelector)
memcachedChunks.nodeSelector: $(nodeSelector)
memcachedFrontend.nodeSelector: $(nodeSelector)
memcachedIndexQueries.nodeSelector: $(nodeSelector)
memcachedIndexWrites.nodeSelector: $(nodeSelector)
loki:
storageConfig:
aws:
s3: http://$(defaultUser):$(defaultPassword)@$(s3Service)/minio

- name: lma-bucket
override:
s3.enabled: true
Expand Down
5 changes: 5 additions & 0 deletions aws-reference/policy/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
resources:
- ../base

transformers:
- site-values.yaml
26 changes: 26 additions & 0 deletions aws-reference/policy/site-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: openinfradev.github.com/v1
kind: HelmValuesTransformer
metadata:
name: site

global:
nodeSelector:
taco-lma: enabled
clusterName: cluster.local
storageClassName: taco-storage
repository: https://openinfradev.github.io/helm-repo/

charts:
- name: opa-gatekeeper
override:
postUpgrade.nodeSelector: $(nodeSelector)
postInstall.nodeSelector: $(nodeSelector)
preUninstall.nodeSelector: $(nodeSelector)
controllerManager.nodeSelector: $(nodeSelector)
audit.nodeSelector: $(nodeSelector)
crds.nodeSelector: $(nodeSelector)

enableDeleteOperations: true

- name: policy-resources
override: {}
Loading
Loading