Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

yuvipanda · 2023-06-26T21:32:27Z

We don't really want to lose prometheus data, so we should mark all the PVs created for prometheus to have deletionPolicy set to Retain. Follow-up to #2688

Helps us recover prometheus data in case of accidental deletion Fixes 2i2c-org#2717

We add this unconditionally to all clusters for simplification, so we can set storageClass: gp3 for new clusters that come up on AWS without issue. This doesn't change the default, and does not change the storageclass in existing clusters. In addition to using gp3, it also sets reclaimPolicy to Retain, so if the PVC is deleted, it does not delete the PV or the underlying EBS volume. Ref 2i2c-org#2906 Ref 2i2c-org#2717

- Set up our own StorageClass for GKE clusters specifically for use with prometheus data. - Sets retentionPolicy to 'Retain', so we don't accidentally kill the disk and lose all the data. - Sets the disk type to 'Balanced', which is backed by SSDs and *much* faster than spinning disks. No more grafana timeouts! - Move the existing data by manually attaching to a small VM I created, and then copying over to new PVC. - Reduction in size, as 2i2c-org#3093 drastically reduced the size of the data! We went from about 512GB to only about 150GB after that. The size explosion has been solved! 512GB here still gives us enough room to grow. Once this lands, I'll manually go through and do this for every single GCP cluster. Grafana timeouts BE GONE. Ref 2i2c-org#2934 Ref 2i2c-org#2717 Ref 2i2c-org#2847 Fixes 2i2c-org#3111

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Jun 26, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jun 26, 2023

yuvipanda mentioned this issue Jun 26, 2023

pangeo-hubs: recover prometheus-metrics #2688

Closed

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 26, 2023

Set prometheus' data PV's reclaimPolicy to Retain

726a0a1

Helps us recover prometheus data in case of accidental deletion Fixes 2i2c-org#2717

yuvipanda mentioned this issue Jun 26, 2023

Set prometheus' data PV's reclaimPolicy to Retain #2718

Closed

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 26, 2023

Set prometheus' data PV's reclaimPolicy to Retain

fa91600

Helps us recover prometheus data in case of accidental deletion Fixes 2i2c-org#2717

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 28, 2023

Set prometheus' data PV's reclaimPolicy to Retain

9de7716

Helps us recover prometheus data in case of accidental deletion Fixes 2i2c-org#2717

damianavila added this to Sprint Board Jul 4, 2023

damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog Jul 4, 2023

damianavila assigned yuvipanda Jul 4, 2023

damianavila moved this to Review / QA 👀 in Sprint Board Jul 4, 2023

damianavila moved this from Review / QA 👀 to In Progress ⚡ in Sprint Board Jul 5, 2023

damianavila removed this from Sprint Board Jul 12, 2023

yuvipanda mentioned this issue Aug 7, 2023

Add gp3 storageclass #2941

Closed

yuvipanda mentioned this issue Sep 9, 2023

Move 2i2c shared prometheus to pd-balanced disk #3112

Merged

consideRatio added the tech:prometheus label Sep 9, 2023

yuvipanda added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 16, 2023

GeorgianaElena mentioned this issue Nov 28, 2023

Q4 Reduced workload goal - Nov 29 Sprint 4 tracking issue #3469

Closed

yuvipanda removed their assignment Jan 25, 2024

yuvipanda added the allocation:internal-eng label Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

yuvipanda commented Jun 26, 2023

Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

Make sure prometheus PVs of all our clusters have deletionPolicy set to retain #2717

Comments

yuvipanda commented Jun 26, 2023