K8ssandra-operator memory leak after failed MedusaBackupJob #1312

adziura-tcloud · 2024-05-13T07:08:51Z

What happened?
After upgrading K8ssandra-operator from 1.7.0 to 1.16.0 I noticed periodic restarts of k8ssandra-operator due to memory leak.
It happened twice right after failed MedusaBackupJob

spec:
  backupType: full
  cassandraDatacenter: gke-dc1
status:
  finished:
  - example-test-1-gke-dc1-r1-sts-2
  - example-test-1-gke-dc1-r1-sts-3
  - example-test-1-gke-dc1-r1-sts-1
  inProgress:
  - example-test-1-gke-dc1-r1-sts-0
  startTime: "2024-05-12T00:30:15Z"

MedusaBackupJob failed because of POD OOM restart (we are using pretty small instances on test env)

Also MedusaBackupSchedule is not working after failed backup - not creating new MedusaBackupJobs

Did you expect to see something different?
Failed backup should not affect operator and next backups.

How to reproduce it (as minimally and precisely as possible):
I think killing one POD during the backup should work

K8ssandra Operator version: 1.16.0
Kubernetes version information: 1.29
Kubernetes cluster kind: GKE
K8ssandra Operator Logs: I see a lot of MedusaBackupJob is still being processed INFO messages

2024-05-13T07:07:34.916Z	INFO	MedusaBackupJob is still being processed	{"controller": "medusabackupjob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupJob", "MedusaBackupJob": {"name":"daily-1715473800","namespace":"cassandra"}, "namespace": "cassandra", "name": "daily-1715473800", "reconcileID": "ace193e0-a3b6-4935-80f7-736b7fbae2a0", "medusabackupjob": "cassandra/daily-1715473800", "Backup": {"namespace": "cassandra", "name": "daily-1715473800"}}

┆Issue is synchronized with this Jira Story by Unito
┆Fix Versions: 2024-10,2024-11
┆Issue Number: K8OP-26

The text was updated successfully, but these errors were encountered:

sync-by-unito · 2024-10-14T10:08:30Z

➤ Radovan Zvoncek commented:

PR to review: thelastpickle/cassandra-medusa#786 ( thelastpickle/cassandra-medusa#786 )

adziura-tcloud added the bug Something isn't working label May 13, 2024

adejanovski added this to K8ssandra May 13, 2024

rzvoncek moved this to In Progress in K8ssandra Jun 18, 2024

rzvoncek self-assigned this Jun 18, 2024

adejanovski added the in-progress Issues in the state 'in-progress' label Jun 18, 2024

rzvoncek linked a pull request Jun 19, 2024 that will close this issue

Make gRPC service report backups as FAILED if lose their callback future thelastpickle/cassandra-medusa#786

Open

rzvoncek moved this from In Progress to Review in K8ssandra Jun 19, 2024

adejanovski added review Issues in the state 'review' and removed in-progress Issues in the state 'in-progress' labels Jun 19, 2024

adejanovski moved this from Review to Ready For Review in K8ssandra Jul 16, 2024

adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed review Issues in the state 'review' labels Jul 16, 2024

adejanovski moved this from Ready For Review to Review in K8ssandra Jul 17, 2024

adejanovski added review Issues in the state 'review' and removed ready-for-review Issues in the state 'ready-for-review' labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8ssandra-operator memory leak after failed MedusaBackupJob #1312

K8ssandra-operator memory leak after failed MedusaBackupJob #1312

adziura-tcloud commented May 13, 2024 •

edited by sync-by-unito bot

Loading

sync-by-unito bot commented Oct 14, 2024

K8ssandra-operator memory leak after failed MedusaBackupJob #1312

K8ssandra-operator memory leak after failed MedusaBackupJob #1312

Comments

adziura-tcloud commented May 13, 2024 • edited by sync-by-unito bot Loading

sync-by-unito bot commented Oct 14, 2024

adziura-tcloud commented May 13, 2024 •

edited by sync-by-unito bot

Loading