Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to Report logs error in experiment Pod on latest/edge #108

Closed
NohaIhab opened this issue Aug 8, 2023 · 32 comments
Closed

Failed to Report logs error in experiment Pod on latest/edge #108

NohaIhab opened this issue Aug 8, 2023 · 32 comments
Labels
bug Something isn't working Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release

Comments

@NohaIhab
Copy link
Contributor

NohaIhab commented Aug 8, 2023

Steps to reproduce

  • deploy katib bundle latest/edge
  • create experiment with grid example
  • wait until a trial pod finishes

Result:
pods go to error at the final stage of communicating results with the logs:

F0808 07:54:54.977885      20 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.152.183.167:65535: i/o timeout"

10.152.183.167 is the IP of katib-db-manager ClusterIP Service.
The same logs can be viewed from the katib-controller container logs.

@NohaIhab NohaIhab added the bug Something isn't working label Aug 8, 2023
@DnPlas
Copy link
Contributor

DnPlas commented Aug 8, 2023

From a quick research, it seems like these kind of issues are associated to the katib-db-manager not being up and ready when running experiments. Do you know if it was in fact active and idle, and the pods were also active and ready when you tried running the experiment?

Similar issue: kubeflow/katib#1517

@i-chvets
Copy link
Contributor

i-chvets commented Aug 14, 2023

Could not deploy katib from latest/edge. Will verify in the complete KF deployment.

Model     Controller          Cloud/Region        Version  SLA          Timestamp
kubeflow  microk8s-localhost  microk8s/localhost  2.9.44   unsupported  16:32:59-04:00

App               Version                Status   Scale  Charm             Channel  Rev  Address         Exposed  Message
katib-controller  res:oci-image@111495a  active       1  katib-controller  edge     341  10.152.183.245  no       
katib-db          mariadb/server:10.3    active       1  mariadb-k8s       stable    35  10.152.183.96   no       ready
katib-db-manager                         waiting      1  katib-db-manager  edge     309  10.152.183.198  no       installing agent
katib-ui                                 waiting      1  katib-ui          edge     320  10.152.183.80   no       installing agent

Unit                 Workload  Agent  Address     Ports             Message
katib-controller/0*  active    idle   10.1.59.81  443/TCP,8080/TCP  
katib-db-manager/0*  error     idle   10.1.59.76                    hook failed: "install"
katib-db/0*          active    idle   10.1.59.80  3306/TCP          ready
katib-ui/0*          blocked   idle   10.1.59.78                    kubernetes resource creation failed

@NohaIhab NohaIhab added the Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release label Oct 4, 2023
@NohaIhab NohaIhab changed the title Error in experiment Pod on latest/edge Failed to Report logs error in experiment Pod on latest/edge Oct 4, 2023
@NohaIhab
Copy link
Contributor Author

NohaIhab commented Oct 4, 2023

@DnPlas katib-db-manager charm is active/idle and the pod is ready
logs from katib-db-manager container:

2023-10-04T08:44:54.097Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-10-04T08:44:54.102Z [katib-db-manager] I1004 08:44:54.102216      43 db.go:32] Using MySQL
2023-10-04T08:44:59.122Z [katib-db-manager] I1004 08:44:59.122380      43 init.go:27] Initializing v1beta1 DB schema
2023-10-04T08:44:59.127Z [katib-db-manager] I1004 08:44:59.127916      43 main.go:113] Start Katib manager: 0.0.0.0:6789
2023-10-04T08:49:25.614Z [pebble] GET /v1/plan?format=yaml 235.252µs 200
2023-10-04T08:54:27.665Z [pebble] GET /v1/plan?format=yaml 149.054µs 200
2023-10-04T08:59:30.039Z [pebble] GET /v1/plan?format=yaml 543.314µs 200
2023-10-04T09:05:06.070Z [pebble] GET /v1/plan?format=yaml 265.883µs 200
2023-10-04T09:09:56.200Z [pebble] GET /v1/plan?format=yaml 172.92µs 200
2023-10-04T09:15:28.309Z [pebble] GET /v1/plan?format=yaml 152.975µs 200
2023-10-04T09:20:14.027Z [pebble] GET /v1/plan?format=yaml 193.914µs 200
2023-10-04T09:26:01.311Z [pebble] GET /v1/plan?format=yaml 195.693µs 200
2023-10-04T09:31:34.287Z [pebble] GET /v1/plan?format=yaml 289.235µs 200
2023-10-04T09:35:34.871Z [pebble] GET /v1/plan?format=yaml 204.887µs 200
2023-10-04T09:40:25.212Z [pebble] GET /v1/plan?format=yaml 337.025µs 200
2023-10-04T09:46:19.174Z [pebble] GET /v1/plan?format=yaml 749.551µs 200
2023-10-04T09:51:20.886Z [pebble] GET /v1/plan?format=yaml 160.222µs 200
2023-10-04T09:57:04.651Z [pebble] GET /v1/plan?format=yaml 201.391µs 200
2023-10-04T10:02:33.420Z [pebble] GET /v1/plan?format=yaml 181.847µs 200
2023-10-04T10:08:26.583Z [pebble] GET /v1/plan?format=yaml 135.547µs 200
2023-10-04T10:13:08.665Z [pebble] GET /v1/plan?format=yaml 154.588µs 200
2023-10-04T10:18:45.564Z [pebble] GET /v1/plan?format=yaml 185.666µs 200
2023-10-04T10:24:19.338Z [pebble] GET /v1/plan?format=yaml 180.573µs 200
2023-10-04T10:28:55.284Z [pebble] GET /v1/plan?format=yaml 139.591µs 200
2023-10-04T10:33:46.465Z [pebble] GET /v1/plan?format=yaml 155.926µs 200
2023-10-04T10:38:54.247Z [pebble] GET /v1/plan?format=yaml 204.563µs 200
2023-10-04T10:44:02.575Z [pebble] GET /v1/plan?format=yaml 193.604µs 200
2023-10-04T10:49:00.315Z [pebble] GET /v1/plan?format=yaml 233.411µs 200
2023-10-04T10:53:06.184Z [pebble] GET /v1/plan?format=yaml 188.678µs 200
2023-10-04T10:58:02.907Z [pebble] GET /v1/plan?format=yaml 172.183µs 200
2023-10-04T11:02:47.260Z [pebble] GET /v1/plan?format=yaml 392.891µs 200
2023-10-04T11:07:02.473Z [pebble] GET /v1/plan?format=yaml 190.738µs 200
2023-10-04T11:11:08.045Z [pebble] GET /v1/plan?format=yaml 196.818µs 200
2023-10-04T11:16:59.296Z [pebble] GET /v1/plan?format=yaml 161.081µs 200
2023-10-04T11:21:03.021Z [pebble] GET /v1/plan?format=yaml 223.954µs 200
2023-10-04T11:25:12.536Z [pebble] GET /v1/plan?format=yaml 207.933µs 200
2023-10-04T11:29:30.744Z [pebble] GET /v1/plan?format=yaml 178.44µs 200
2023-10-04T11:33:54.612Z [pebble] GET /v1/plan?format=yaml 186.62µs 200
2023-10-04T11:39:29.139Z [pebble] GET /v1/plan?format=yaml 319.879µs 200
2023-10-04T11:44:26.934Z [pebble] GET /v1/plan?format=yaml 198.789µs 200
2023-10-04T11:49:00.185Z [pebble] GET /v1/plan?format=yaml 453.48µs 200
2023-10-04T11:54:35.465Z [pebble] GET /v1/plan?format=yaml 202.096µs 200
2023-10-04T11:58:45.953Z [pebble] GET /v1/plan?format=yaml 185.96µs 200
2023-10-04T12:04:30.993Z [pebble] GET /v1/plan?format=yaml 228.332µs 200
2023-10-04T12:10:28.627Z [pebble] GET /v1/plan?format=yaml 16.24712ms 200
2023-10-04T12:16:04.205Z [pebble] GET /v1/plan?format=yaml 261.961µs 200
2023-10-04T12:21:57.655Z [pebble] GET /v1/plan?format=yaml 343.916µs 200

@NohaIhab
Copy link
Contributor Author

I was not able to reproduce this issue, we should re-open the issue if we hit it again and document the steps to reproduce.

@orfeas-k
Copy link
Contributor

Came across this again when running the katib uat notebook. Then, I tested also with grid example and same thing happened there as well. Note though that this happened after I had scaled down the cluster to 0 nodes yesterday before EOD and scaled it up again today.

Environment

latest/edge bundle to EKS 1.25 using Juju 3.1 (and following this guide)

Reproduce

I redeployed to a new cluster with the same set up and looks like the issue didn't come up as well.

Logs/Debugging

Trial logs

This is the last line in the logs in every trial pod spun by the katib experiment.

 F1122 09:45:46.259588      21 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.8.221:65535: i/o timeout"

As @DnPlas pointed out, this is the pod trying to contact katib-db-manager (10.100.8.221).

Full trial logs
I1122 09:45:12.241887      21 main.go:396] Trial Name: cmaes-example-d7xk5bv5 
 I1122 09:45:14.134274      21 main.go:139] 2023-11-22T09:45:14Z INFO     start with arguments Namespace(num_classes=10, num_examples=60000, add_stn=False, image_shape='1, 28, 28', network='mlp', num_layers=3, gpus=None, kv_store='device', num_epochs=1, lr=0.04511033252270099, lr_factor=0.1, lr_step_epochs='10', initializer='default', optimizer='adam', mom=0.9, wd=0.0001, batch_size=64, disp_batches=100, model_prefix=None, save_period=1, monitor=0, load_epoch=None, top_k=0, loss='', test_io=0, dtype='float32', gc_type='none', gc_threshold=0.5, macrobatch_size=0, warmup_epochs=5, warmup_strategy='linear', profile_worker_suffix='', profile_server_suffix='', use_imagenet_data_augmentation=0) 
 I1122 09:45:14.147084      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    Starting new HTTP connection (1): data.mxnet.io:80 
 I1122 09:45:14.165668      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    http://data.mxnet.io:80 "GET /data/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 301 290 
 I1122 09:45:14.167552      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 
 I1122 09:45:14.494427      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881 
 I1122 09:45:14.640593      21 main.go:139] 2023-11-22T09:45:14Z INFO     downloaded http://data.mxnet.io/data/mnist/train-labels-idx1-ubyte.gz into train-labels-idx1-ubyte.gz successfully 
 I1122 09:45:14.640618      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    Starting new HTTP connection (1): data.mxnet.io:80 
 I1122 09:45:14.654983      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    http://data.mxnet.io:80 "GET /data/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 301 290 
 I1122 09:45:14.657541      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 
 I1122 09:45:14.973142      21 main.go:139] 2023-11-22T09:45:14Z DEBUG    http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 200 9912422 
 I1122 09:45:16.658386      21 main.go:139] 2023-11-22T09:45:16Z INFO     downloaded http://data.mxnet.io/data/mnist/train-images-idx3-ubyte.gz into train-images-idx3-ubyte.gz successfully 
 I1122 09:45:17.142916      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    Starting new HTTP connection (1): data.mxnet.io:80 
 I1122 09:45:17.154624      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    http://data.mxnet.io:80 "GET /data/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 301 289 
 I1122 09:45:17.156007      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 
 I1122 09:45:17.480217      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 200 4542 
 I1122 09:45:17.481015      21 main.go:139] 2023-11-22T09:45:17Z INFO     downloaded http://data.mxnet.io/data/mnist/t10k-labels-idx1-ubyte.gz into t10k-labels-idx1-ubyte.gz successfully 
 I1122 09:45:17.482824      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    Starting new HTTP connection (1): data.mxnet.io:80 
 I1122 09:45:17.494636      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    http://data.mxnet.io:80 "GET /data/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 301 289 
 I1122 09:45:17.498002      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 
 I1122 09:45:17.818124      21 main.go:139] 2023-11-22T09:45:17Z DEBUG    http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 200 1648877 
 I1122 09:45:18.822646      21 main.go:139] 2023-11-22T09:45:18Z INFO     downloaded http://data.mxnet.io/data/mnist/t10k-images-idx3-ubyte.gz into t10k-images-idx3-ubyte.gz successfully 
 I1122 09:45:19.249433      21 main.go:139] [09:45:19] ../src/executor/graph_executor.cc:1991: Subgraph backend MKLDNN is activated. 
 I1122 09:45:19.760514      21 main.go:139] 2023-11-22T09:45:19Z INFO     Epoch[0] Batch [0-100]	Speed: 15160.65 samples/sec	accuracy=0.751238 
 I1122 09:45:20.027658      21 main.go:139] 2023-11-22T09:45:20Z INFO     Epoch[0] Batch [100-200]	Speed: 23985.89 samples/sec	accuracy=0.869375 
 I1122 09:45:20.577454      21 main.go:139] 2023-11-22T09:45:20Z INFO     Epoch[0] Batch [200-300]	Speed: 11675.95 samples/sec	accuracy=0.888594 
 I1122 09:45:21.317147      21 main.go:139] 2023-11-22T09:45:21Z INFO     Epoch[0] Batch [300-400]	Speed: 8653.84 samples/sec	accuracy=0.892813 
 I1122 09:45:21.893307      21 main.go:139] 2023-11-22T09:45:21Z INFO     Epoch[0] Batch [400-500]	Speed: 11108.21 samples/sec	accuracy=0.907188 
 I1122 09:45:22.529823      21 main.go:139] 2023-11-22T09:45:22Z INFO     Epoch[0] Batch [500-600]	Speed: 10061.93 samples/sec	accuracy=0.892969 
 I1122 09:45:23.233631      21 main.go:139] 2023-11-22T09:45:23Z INFO     Epoch[0] Batch [600-700]	Speed: 9106.64 samples/sec	accuracy=0.901250 
 I1122 09:45:23.743594      21 main.go:139] 2023-11-22T09:45:23Z INFO     Epoch[0] Batch [700-800]	Speed: 12528.56 samples/sec	accuracy=0.896875 
 I1122 09:45:24.311487      21 main.go:139] 2023-11-22T09:45:24Z INFO     Epoch[0] Batch [800-900]	Speed: 11280.20 samples/sec	accuracy=0.901563 
 I1122 09:45:24.444156      21 main.go:139] 2023-11-22T09:45:24Z INFO     Epoch[0] Train-accuracy=0.878531 
 I1122 09:45:24.444248      21 main.go:139] 2023-11-22T09:45:24Z INFO     Epoch[0] Time cost=5.185 
 I1122 09:45:24.667477      21 main.go:139] 2023-11-22T09:45:24Z INFO     Epoch[0] Validation-accuracy=0.907245 
 F1122 09:45:46.259588      21 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.8.221:65535: i/o timeout"

However, katib-db-manager is up and running. From logs below, we see that katib-db-manager has a hard time talking to the database itself.

katib-db-manager-0 pod's logs
╰─$ kf logs katib-db-manager-0  -c katib-db-manager -f                                                      130 ↵
2023-11-22T08:44:52.297Z [pebble] HTTP API server listening on ":38813".
2023-11-22T08:44:52.297Z [pebble] Started daemon.
2023-11-22T08:47:59.413Z [pebble] GET /v1/plan?format=yaml 161.67µs 200
2023-11-22T08:47:59.417Z [pebble] POST /v1/layers 459.331µs 200
2023-11-22T08:47:59.680Z [pebble] POST /v1/services 250.852481ms 202
2023-11-22T08:47:59.732Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:47:59.745Z [katib-db-manager] I1122 08:47:59.745811      15 db.go:32] Using MySQL
2023-11-22T08:48:00.752Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.070723238s 200
2023-11-22T08:48:04.759Z [katib-db-manager] E1122 08:48:04.752333      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:09.754Z [katib-db-manager] E1122 08:48:09.754499      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:14.753Z [katib-db-manager] E1122 08:48:14.753136      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:19.749Z [katib-db-manager] E1122 08:48:19.749860      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:24.754Z [katib-db-manager] E1122 08:48:24.754108      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:29.751Z [katib-db-manager] E1122 08:48:29.751660      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:34.754Z [katib-db-manager] E1122 08:48:34.754112      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:39.749Z [katib-db-manager] E1122 08:48:39.749735      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:44.751Z [katib-db-manager] E1122 08:48:44.751812      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:49.755Z [katib-db-manager] E1122 08:48:49.755129      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:54.755Z [katib-db-manager] E1122 08:48:54.755890      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:59.751Z [katib-db-manager] E1122 08:48:59.751137      15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:48:59.751Z [katib-db-manager] F1122 08:48:59.751353      15 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:48:59.753Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:48:59.753Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:49:00.283Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:49:00.291Z [katib-db-manager] I1122 08:49:00.291683      20 db.go:32] Using MySQL
2023-11-22T08:49:05.298Z [katib-db-manager] E1122 08:49:05.298234      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:10.297Z [katib-db-manager] E1122 08:49:10.297747      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:15.295Z [katib-db-manager] E1122 08:49:15.295277      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:20.295Z [katib-db-manager] E1122 08:49:20.295061      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:25.296Z [katib-db-manager] E1122 08:49:25.296885      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:30.297Z [katib-db-manager] E1122 08:49:30.297648      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:35.294Z [katib-db-manager] E1122 08:49:35.294778      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:40.302Z [katib-db-manager] E1122 08:49:40.302537      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:45.302Z [katib-db-manager] E1122 08:49:45.302010      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:50.296Z [katib-db-manager] E1122 08:49:50.295829      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:49:55.297Z [katib-db-manager] E1122 08:49:55.297364      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:00.299Z [katib-db-manager] E1122 08:50:00.299450      20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:00.299Z [katib-db-manager] F1122 08:50:00.299482      20 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:50:00.301Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:50:00.301Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:50:00.820Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:50:00.825Z [katib-db-manager] I1122 08:50:00.825169      26 db.go:32] Using MySQL
2023-11-22T08:50:05.831Z [katib-db-manager] E1122 08:50:05.831504      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:10.830Z [katib-db-manager] E1122 08:50:10.830619      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:15.831Z [katib-db-manager] E1122 08:50:15.831916      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:20.832Z [katib-db-manager] E1122 08:50:20.832687      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:25.831Z [katib-db-manager] E1122 08:50:25.831669      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:30.830Z [katib-db-manager] E1122 08:50:30.830287      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:35.833Z [katib-db-manager] E1122 08:50:35.832970      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:40.830Z [katib-db-manager] E1122 08:50:40.830215      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:45.834Z [katib-db-manager] E1122 08:50:45.834781      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:50.832Z [katib-db-manager] E1122 08:50:50.832182      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:50:55.829Z [katib-db-manager] E1122 08:50:55.829946      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:00.833Z [katib-db-manager] E1122 08:51:00.833062      26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:00.833Z [katib-db-manager] F1122 08:51:00.833090      26 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:51:00.834Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:51:00.834Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:51:01.367Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:51:01.375Z [katib-db-manager] I1122 08:51:01.375353      32 db.go:32] Using MySQL
2023-11-22T08:51:06.378Z [katib-db-manager] E1122 08:51:06.378505      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:11.379Z [katib-db-manager] E1122 08:51:11.379054      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:16.379Z [katib-db-manager] E1122 08:51:16.379887      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:21.378Z [katib-db-manager] E1122 08:51:21.378435      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:26.382Z [katib-db-manager] E1122 08:51:26.382196      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:31.381Z [katib-db-manager] E1122 08:51:31.381184      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:36.383Z [katib-db-manager] E1122 08:51:36.383100      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:41.379Z [katib-db-manager] E1122 08:51:41.379286      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:46.381Z [katib-db-manager] E1122 08:51:46.381156      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:51.380Z [katib-db-manager] E1122 08:51:51.380327      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:51:56.378Z [katib-db-manager] E1122 08:51:56.378803      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:01.381Z [katib-db-manager] E1122 08:52:01.381104      32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:01.381Z [katib-db-manager] F1122 08:52:01.381131      32 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:52:01.382Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:52:01.382Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:52:01.931Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:52:01.937Z [katib-db-manager] I1122 08:52:01.937396      39 db.go:32] Using MySQL
2023-11-22T08:52:06.942Z [katib-db-manager] E1122 08:52:06.942269      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:11.942Z [katib-db-manager] E1122 08:52:11.942390      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:16.942Z [katib-db-manager] E1122 08:52:16.942928      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:21.942Z [katib-db-manager] E1122 08:52:21.942191      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:26.946Z [katib-db-manager] E1122 08:52:26.946266      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:31.945Z [katib-db-manager] E1122 08:52:31.945875      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:36.949Z [katib-db-manager] E1122 08:52:36.949689      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:41.941Z [katib-db-manager] E1122 08:52:41.941713      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:46.945Z [katib-db-manager] E1122 08:52:46.945358      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:51.942Z [katib-db-manager] E1122 08:52:51.942058      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:52:56.942Z [katib-db-manager] E1122 08:52:56.942284      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:01.942Z [katib-db-manager] E1122 08:53:01.942660      39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:01.942Z [katib-db-manager] F1122 08:53:01.942765      39 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:53:01.944Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:53:01.944Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:53:02.456Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:53:02.462Z [katib-db-manager] I1122 08:53:02.462338      44 db.go:32] Using MySQL
2023-11-22T08:53:07.466Z [katib-db-manager] E1122 08:53:07.466080      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:10.943Z [pebble] GET /v1/plan?format=yaml 429.068µs 200
2023-11-22T08:53:12.469Z [katib-db-manager] E1122 08:53:12.469373      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:17.466Z [katib-db-manager] E1122 08:53:17.466365      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:22.469Z [katib-db-manager] E1122 08:53:22.469226      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:27.467Z [katib-db-manager] E1122 08:53:27.467448      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:32.471Z [katib-db-manager] E1122 08:53:32.471646      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:37.468Z [katib-db-manager] E1122 08:53:37.468823      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:42.467Z [katib-db-manager] E1122 08:53:42.467445      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:47.468Z [katib-db-manager] E1122 08:53:47.468342      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:52.466Z [katib-db-manager] E1122 08:53:52.465966      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:53:57.473Z [katib-db-manager] E1122 08:53:57.472735      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:54:02.470Z [katib-db-manager] E1122 08:54:02.469959      44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:54:02.470Z [katib-db-manager] F1122 08:54:02.469990      44 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
2023-11-22T08:54:02.472Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255
2023-11-22T08:54:02.472Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2023-11-22T08:54:03.016Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-11-22T08:54:03.020Z [katib-db-manager] I1122 08:54:03.020600      50 db.go:32] Using MySQL
2023-11-22T08:54:08.026Z [katib-db-manager] E1122 08:54:08.026124      50 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:54:13.025Z [katib-db-manager] E1122 08:54:13.025398      50 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused
2023-11-22T08:54:18.044Z [katib-db-manager] I1122 08:54:18.042132      50 init.go:27] Initializing v1beta1 DB schema
2023-11-22T08:54:18.050Z [katib-db-manager] I1122 08:54:18.050549      50 main.go:113] Start Katib manager: 0.0.0.0:6789
2023-11-22T08:58:07.099Z [pebble] GET /v1/plan?format=yaml 173.401µs 200
2023-11-22T09:03:30.999Z [pebble] GET /v1/plan?format=yaml 160.392µs 200
2023-11-22T09:08:01.169Z [pebble] GET /v1/plan?format=yaml 243.954µs 200
2023-11-22T09:12:52.018Z [pebble] GET /v1/plan?format=yaml 203.666µs 200
2023-11-22T09:18:03.036Z [pebble] GET /v1/plan?format=yaml 143.63µs 200
2023-11-22T09:22:20.945Z [pebble] GET /v1/plan?format=yaml 166.369µs 200
2023-11-22T09:27:40.746Z [pebble] GET /v1/plan?format=yaml 149.338µs 200
2023-11-22T09:32:57.568Z [pebble] GET /v1/plan?format=yaml 180.716µs 200
2023-11-22T09:37:26.881Z [pebble] GET /v1/plan?format=yaml 238.442µs 200
2023-11-22T09:43:06.819Z [pebble] GET /v1/plan?format=yaml 382.865µs 200
2023-11-22T09:49:01.062Z [pebble] GET /v1/plan?format=yaml 206.435µs 200
2023-11-22T09:53:32.310Z [pebble] GET /v1/plan?format=yaml 365.722µs 200
2023-11-22T09:57:35.483Z [pebble] GET /v1/plan?format=yaml 176.831µs 200
2023-11-22T10:02:42.040Z [pebble] GET /v1/plan?format=yaml 214.828µs 200
2023-11-22T10:07:16.546Z [pebble] GET /v1/plan?format=yaml 148.969µs 200
2023-11-22T10:12:06.438Z [pebble] GET /v1/plan?format=yaml 180.211µs 200
2023-11-22T10:17:53.290Z [pebble] GET /v1/plan?format=yaml 256.823µs 200
2023-11-22T10:21:09.667Z [katib-db-manager] [mysql] 2023/11/22 10:21:09 packets.go:122: closing bad idle connection: EOF
2023-11-22T10:21:09.668Z [katib-db-manager] [mysql] 2023/11/22 10:21:09 connection.go:158: driver: bad connection

Note that the following errors are present even in the healthy cluster

2023-11-22T15:33:40.738Z [container-agent] 2023-11-22 15:33:40 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection

I also noticed katib-db going to Maintenance state with message offline. It eventually went to active by itself. Looking at its logs:

katib-db-0 pod's logs
2023-11-22T09:30:27.272Z [container-agent] 2023-11-22 09:30:27 WARNING juju-log No relation: certificates
2023-11-22T09:30:27.662Z [container-agent] 2023-11-22 09:30:27 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2023-11-22T09:36:25.451Z [container-agent] 2023-11-22 09:36:25 ERROR juju-log Uncaught exception while in charm code:
2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
2023-11-22T09:36:25.451Z [container-agent]     h.request(req.get_method(), req.selector, req.data, headers,
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1283, in request
2023-11-22T09:36:25.451Z [container-agent]     self._send_request(method, url, body, headers, encode_chunked)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
2023-11-22T09:36:25.451Z [container-agent]     self.endheaders(body, encode_chunked=encode_chunked)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
2023-11-22T09:36:25.451Z [container-agent]     self._send_output(message_body, encode_chunked=encode_chunked)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
2023-11-22T09:36:25.451Z [container-agent]     self.send(msg)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 976, in send
2023-11-22T09:36:25.451Z [container-agent]     self.connect()
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 272, in connect
2023-11-22T09:36:25.451Z [container-agent]     self.sock.connect(self.socket_path)
2023-11-22T09:36:25.451Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory
2023-11-22T09:36:25.451Z [container-agent] 
2023-11-22T09:36:25.451Z [container-agent] During handling of the above exception, another exception occurred:
2023-11-22T09:36:25.451Z [container-agent] 
2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1599, in _request_raw
2023-11-22T09:36:25.451Z [container-agent]     response = self.opener.open(request, timeout=self.timeout)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 519, in open
2023-11-22T09:36:25.451Z [container-agent]     response = self._open(req, data)
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
2023-11-22T09:36:25.451Z [container-agent]     result = self._call_chain(self.handle_open, protocol, protocol +
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
2023-11-22T09:36:25.451Z [container-agent]     result = func(*args)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 286, in http_open
2023-11-22T09:36:25.451Z [container-agent]     return self.do_open(_UnixSocketConnection, req,  # type:ignore
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
2023-11-22T09:36:25.451Z [container-agent]     raise URLError(err)
2023-11-22T09:36:25.451Z [container-agent] urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>
2023-11-22T09:36:25.451Z [container-agent] 
2023-11-22T09:36:25.451Z [container-agent] During handling of the above exception, another exception occurred:
2023-11-22T09:36:25.451Z [container-agent] 
2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 786, in <module>
2023-11-22T09:36:25.451Z [container-agent]     main(MySQLOperatorCharm)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 441, in main
2023-11-22T09:36:25.451Z [container-agent]     _emit_charm_event(charm, dispatcher.event_name)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2023-11-22T09:36:25.451Z [container-agent]     event_to_emit.emit(*args, **kwargs)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 342, in emit
2023-11-22T09:36:25.451Z [container-agent]     framework._emit(event)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 839, in _emit
2023-11-22T09:36:25.451Z [container-agent]     self._reemit(event_path)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 928, in _reemit
2023-11-22T09:36:25.451Z [container-agent]     custom_handler(event)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 717, in _on_update_status
2023-11-22T09:36:25.451Z [container-agent]     if self._handle_potential_cluster_crash_scenario():
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 641, in _handle_potential_cluster_crash_scenario
2023-11-22T09:36:25.451Z [container-agent]     state, role = self._mysql.get_member_state()
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 289, in wrapped_f
2023-11-22T09:36:25.451Z [container-agent]     return self(f, *args, **kw)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 379, in __call__
2023-11-22T09:36:25.451Z [container-agent]     do = self.iter(retry_state=retry_state)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 325, in iter
2023-11-22T09:36:25.451Z [container-agent]     raise retry_exc.reraise()
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 158, in reraise
2023-11-22T09:36:25.451Z [container-agent]     raise self.last_attempt.result()
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
2023-11-22T09:36:25.451Z [container-agent]     return self.__get_result()
2023-11-22T09:36:25.451Z [container-agent]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2023-11-22T09:36:25.451Z [container-agent]     raise self._exception
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 382, in __call__
2023-11-22T09:36:25.451Z [container-agent]     result = fn(*args, **kwargs)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 1972, in get_member_state
2023-11-22T09:36:25.451Z [container-agent]     output = self._run_mysqlcli_script(
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 705, in _run_mysqlcli_script
2023-11-22T09:36:25.451Z [container-agent]     process = self.container.exec(command, timeout=timeout)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/model.py", line 2571, in exec
2023-11-22T09:36:25.451Z [container-agent]     return self._pebble.exec(
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 2374, in exec
2023-11-22T09:36:25.451Z [container-agent]     resp = self._request('POST', '/v1/exec', body=body)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1564, in _request
2023-11-22T09:36:25.451Z [container-agent]     response = self._request_raw(method, path, query, headers, data)
2023-11-22T09:36:25.451Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1612, in _request_raw
2023-11-22T09:36:25.451Z [container-agent]     raise ConnectionError(e.reason)
2023-11-22T09:36:25.451Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory
2023-11-22T09:36:25.451Z [container-agent] 2023-11-22 09:36:25 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection
2023-11-22T09:36:25.865Z [container-agent] 2023-11-22 09:36:25 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
2023-11-22T09:36:27.244Z [container-agent] 2023-11-22 09:36:27 ERROR juju-log Uncaught exception while in charm code:
2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open
2023-11-22T09:36:27.244Z [container-agent]     h.request(req.get_method(), req.selector, req.data, headers,
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1283, in request
2023-11-22T09:36:27.244Z [container-agent]     self._send_request(method, url, body, headers, encode_chunked)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
2023-11-22T09:36:27.244Z [container-agent]     self.endheaders(body, encode_chunked=encode_chunked)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
2023-11-22T09:36:27.244Z [container-agent]     self._send_output(message_body, encode_chunked=encode_chunked)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
2023-11-22T09:36:27.244Z [container-agent]     self.send(msg)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/http/client.py", line 976, in send
2023-11-22T09:36:27.244Z [container-agent]     self.connect()
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 272, in connect
2023-11-22T09:36:27.244Z [container-agent]     self.sock.connect(self.socket_path)
2023-11-22T09:36:27.244Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory
2023-11-22T09:36:27.244Z [container-agent] 
2023-11-22T09:36:27.244Z [container-agent] During handling of the above exception, another exception occurred:
2023-11-22T09:36:27.244Z [container-agent] 
2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1599, in _request_raw
2023-11-22T09:36:27.244Z [container-agent]     response = self.opener.open(request, timeout=self.timeout)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 519, in open
2023-11-22T09:36:27.244Z [container-agent]     response = self._open(req, data)
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
2023-11-22T09:36:27.244Z [container-agent]     result = self._call_chain(self.handle_open, protocol, protocol +
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
2023-11-22T09:36:27.244Z [container-agent]     result = func(*args)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 286, in http_open
2023-11-22T09:36:27.244Z [container-agent]     return self.do_open(_UnixSocketConnection, req,  # type:ignore
2023-11-22T09:36:27.244Z [container-agent]   File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
2023-11-22T09:36:27.244Z [container-agent]     raise URLError(err)
2023-11-22T09:36:27.244Z [container-agent] urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>
2023-11-22T09:36:27.244Z [container-agent] 
2023-11-22T09:36:27.244Z [container-agent] During handling of the above exception, another exception occurred:
2023-11-22T09:36:27.244Z [container-agent] 
2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last):
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 786, in <module>
2023-11-22T09:36:27.244Z [container-agent]     main(MySQLOperatorCharm)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 441, in main
2023-11-22T09:36:27.244Z [container-agent]     _emit_charm_event(charm, dispatcher.event_name)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2023-11-22T09:36:27.244Z [container-agent]     event_to_emit.emit(*args, **kwargs)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 342, in emit
2023-11-22T09:36:27.244Z [container-agent]     framework._emit(event)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 839, in _emit
2023-11-22T09:36:27.244Z [container-agent]     self._reemit(event_path)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 928, in _reemit
2023-11-22T09:36:27.244Z [container-agent]     custom_handler(event)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/src/rotate_mysql_logs.py", line 50, in _rotate_mysql_logs
2023-11-22T09:36:27.244Z [container-agent]     self.charm._mysql._execute_commands(["logrotate", "-f", LOG_ROTATE_CONFIG_FILE])
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 622, in _execute_commands
2023-11-22T09:36:27.244Z [container-agent]     process = self.container.exec(
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/model.py", line 2571, in exec
2023-11-22T09:36:27.244Z [container-agent]     return self._pebble.exec(
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 2374, in exec
2023-11-22T09:36:27.244Z [container-agent]     resp = self._request('POST', '/v1/exec', body=body)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1564, in _request
2023-11-22T09:36:27.244Z [container-agent]     response = self._request_raw(method, path, query, headers, data)
2023-11-22T09:36:27.244Z [container-agent]   File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1612, in _request_raw
2023-11-22T09:36:27.244Z [container-agent]     raise ConnectionError(e.reason)
2023-11-22T09:36:27.244Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory
2023-11-22T09:36:27.419Z [container-agent] 2023-11-22 09:36:27 INFO juju.util.exec exec.go:209 run result: exit status 1
2023-11-22T09:36:27.427Z [container-agent] 2023-11-22 09:36:27 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "update-status" hook
2023-11-22T09:36:32.437Z [container-agent] 2023-11-22 09:36:32 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "update-status" hook
2023-11-22T09:36:33.507Z [container-agent] 2023-11-22 09:36:33 WARNING juju-log No relation: certificates
2023-11-22T09:36:34.771Z [container-agent] 2023-11-22 09:36:34 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2023-11-22T09:37:35.923Z [container-agent] 2023-11-22 09:37:35 INFO juju-log Starting the log rotate manager
2023-11-22T09:37:35.941Z [container-agent] 2023-11-22 09:37:35 INFO juju-log Started log rotate manager process with PID 5269
2023-11-22T09:37:36.276Z [container-agent] 2023-11-22 09:37:36 INFO juju-log Setting up the logrotate configurations
2023-11-22T09:37:36.366Z [container-agent] 2023-11-22 09:37:36 INFO juju-log Adding pebble layer
2023-11-22T09:37:43.555Z [container-agent] 2023-11-22 09:37:43 INFO juju-log Unit workload member-state is offline with member-role unknown
2023-11-22T09:37:43.594Z [container-agent] 2023-11-22 09:37:43 INFO juju-log Attempting reboot from complete outage.
2023-11-22T09:37:47.575Z [container-agent] 2023-11-22 09:37:47 INFO juju.worker.uniter.operation runhook.go:186 ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch)
2023-11-22T09:41:14.745Z [container-agent] 2023-11-22 09:41:14 INFO juju-log Unit workload member-state is online with member-role primary
2023-11-22T09:41:14.745Z [container-agent] 2023-11-22 09:41:14 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection
2023-11-22T09:41:16.739Z [container-agent] 2023-11-22 09:41:16 WARNING juju-log No relation: certificates
2023-11-22T09:41:17.148Z [container-agent] 2023-11-22 09:41:17 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)

Here are also the full kubernetes logs. Not sure what we can conclude from the above. This could be a mysql-k8s charm issue but we can't be sure.

@orfeas-k orfeas-k reopened this Nov 22, 2023
@Daard
Copy link

Daard commented Nov 23, 2023

I have also faced similar issue with katib.

These are the last logs outputs from the katib ui trials:

I1123 08:46:42.370564 14 main.go:139] 2023-11-23T08:46:42Z INFO loss=0.49476566910743713 I1123 08:46:42.370576 14 main.go:139] 2023-11-23T08:46:42Z INFO categorical_accuracy=0.8163449764251709 I1123 08:46:42.370581 14 main.go:139] 2023-11-23T08:46:42Z INFO same_precision=0.6819637417793274 I1123 08:46:42.370621 14 main.go:139] 2023-11-23T08:46:42Z INFO same_recall=0.7400115728378296 I1123 08:46:42.370631 14 main.go:139] 2023-11-23T08:46:42Z INFO val_loss=0.4788762629032135 I1123 08:46:42.370650 14 main.go:139] 2023-11-23T08:46:42Z INFO val_categorical_accuracy=0.8287093043327332 I1123 08:46:42.370728 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_precision=0.6882216930389404 I1123 08:46:42.370845 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_recall=0.7760416865348816 F1123 08:47:06.643975 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.152.183.250:65535: i/o timeout"

@orfeas-k
Copy link
Contributor

@Daard Could you let us know a bit more about the environment and deployment you had during the above error?

@Daard
Copy link

Daard commented Nov 23, 2023

Environment

Used this guide to deploy charmed kubeflow.

Logs

Added logs from the trial already.

What I have done?

I have created custom TFJob which is running and completed successfully. But it does not work as katib experiment.

I have also tested several other experiments configurations from kubeflow/katib documentation. But they have the same behaviour.

@orfeas-k Do you need some additional logs for understanding the reason?

@Daard
Copy link

Daard commented Nov 23, 2023

@orfeas-k After katib experiment deletion the trials remain in my namespace. And I can't delete them also. Even after deletion of all resources which are connected to the experiment (pods, experiments, suggestions).

@orfeas-k
Copy link
Contributor

Could you post all logs from katib-db-manager-0 and katib-db-0 pods? You can use

kubectl -nkubeflow logs katib-db-manager-0  --all-containers
kubectl -nkubeflow logs katib-db-0  --all-containers

@Daard
Copy link

Daard commented Nov 23, 2023

@orfeas-k
Copy link
Contributor

orfeas-k commented Nov 23, 2023

Thank you @Daard. We would like to understand better who exactly is trying to contact katib-db-manager and fails. I think the trial pod has a metrics-collector container as well and we would like to see if it's this one that sends the request that timeouts. Could you post the logs from the trial pod using --all-containers?

@Daard
Copy link

Daard commented Nov 23, 2023

@orfeas-k When I tried to get logs from kunectl I have got this:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS     RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running    16 (5d15h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running    12 (5d16h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running    6 (5d16h ago)    7d23h
resell-lab-0                                       2/2     Running    0                20h
exp-resell-dd-grid-6b5bc56d94-tdt8k                1/1     Running    0                27m
exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-dd-dmnkgt8f-worker-0 > worker.txt
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector

But I can see logs from tfjob and they are similar to logs from kubeflow.katib.trial ui:

 I1123 10:53:59.433666      14 main.go:396] Trial Name: exp-resell-dd-qp9fnnzt 
 I1123 10:54:02.806239      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature label has a shape dim { 
 I1123 10:54:02.806274      14 main.go:139]   size: 1 
 I1123 10:54:02.806314      14 main.go:139] } 
 I1123 10:54:02.806333      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806387      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_agency has a shape dim { 
 I1123 10:54:02.806405      14 main.go:139]   size: 1 
 I1123 10:54:02.806419      14 main.go:139] } 
 I1123 10:54:02.806425      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806438      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_kitchen has a shape dim { 
 I1123 10:54:02.806450      14 main.go:139]   size: 1 
 I1123 10:54:02.806459      14 main.go:139] } 
 I1123 10:54:02.806465      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806474      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_living has a shape dim { 
 I1123 10:54:02.806489      14 main.go:139]   size: 1 
 I1123 10:54:02.806499      14 main.go:139] } 
 I1123 10:54:02.806505      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806513      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_total has a shape dim { 
 I1123 10:54:02.806518      14 main.go:139]   size: 1 
 I1123 10:54:02.806525      14 main.go:139] } 
 I1123 10:54:02.806541      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806557      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_building has a shape dim { 
 I1123 10:54:02.806567      14 main.go:139]   size: 1 
 I1123 10:54:02.806580      14 main.go:139] } 
 I1123 10:54:02.806584      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806728      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_built_year has a shape dim { 
 I1123 10:54:02.806747      14 main.go:139]   size: 1 
 I1123 10:54:02.806762      14 main.go:139] } 
 I1123 10:54:02.806775      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:20.135699      14 main.go:139] /usr/local/lib/python3.8/dist-packages/keras/src/engine/functional.py:639: UserWarning: Input dict contained keys ['left_page_id', 'left_site', 'right_page_id', 'right_site'] which did not match any model input. They will be ignored by the model. 
 I1123 10:54:20.135737      14 main.go:139]   inputs = self._flatten_to_reference_inputs(inputs) 
 I1123 10:54:20.135760      14 main.go:139] WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0300s vs `on_train_batch_end` time: 0.0376s). Check your callbacks. 
 I1123 10:54:23.616630      14 main.go:139] 2023-11-23T10:54:23Z INFO     epoch 1: 
 I1123 10:55:22.620524      14 main.go:139] 2023-11-23T10:55:22Z INFO     categorical_accuracy=0.8045838475227356 
 I1123 10:55:22.620544      14 main.go:139] 2023-11-23T10:55:22Z INFO     same_precision=0.6760939359664917 
 I1123 10:55:22.620563      14 main.go:139] 2023-11-23T10:55:22Z INFO     same_recall=0.727324903011322 
 I1123 10:55:22.620576      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_loss=0.5009434223175049 
 I1123 10:55:22.620695      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_categorical_accuracy=0.8238841891288757 
 I1123 10:55:22.620801      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_same_precision=0.607390284538269 
 I1123 10:55:22.620865      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_same_recall=0.7758111953735352 
 I1123 10:55:24.845616      14 main.go:139] 2023-11-23T10:55:24Z INFO     epoch 27: 
 I1123 10:55:24.845646      14 main.go:139] 2023-11-23T10:55:24Z INFO     loss=0.519189178943634 
 I1123 10:55:24.845659      14 main.go:139] 2023-11-23T10:55:24Z INFO     categorical_accuracy=0.8094089031219482 
 I1123 10:55:24.845672      14 main.go:139] 2023-11-23T10:55:24Z INFO     same_precision=0.6472785472869873 
 I1123 10:55:24.845743      14 main.go:139] 2023-11-23T10:55:24Z INFO     same_recall=0.7418960332870483 
 I1123 10:55:24.845861      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_loss=0.49683240056037903 
 I1123 10:55:24.845900      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_categorical_accuracy=0.8262967467308044 
 I1123 10:55:24.845914      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_same_precision=0.6697459816932678 
 I1123 10:55:24.846021      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_same_recall=0.7651715278625488 
 I1123 10:55:27.488730      14 main.go:139] 2023-11-23T10:55:27Z INFO     epoch 28: 
 I1123 10:55:27.488755      14 main.go:139] 2023-11-23T10:55:27Z INFO     loss=0.517208993434906 
 I1123 10:55:27.488795      14 main.go:139] 2023-11-23T10:55:27Z INFO     categorical_accuracy=0.8075995445251465 
 I1123 10:55:27.488806      14 main.go:139] 2023-11-23T10:55:27Z INFO     same_precision=0.6600853800773621 
 I1123 10:55:27.488857      14 main.go:139] 2023-11-23T10:55:27Z INFO     same_recall=0.7438364624977112 
 I1123 10:55:27.488940      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_loss=0.49721571803092957 
 I1123 10:55:27.489047      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_categorical_accuracy=0.8250904679298401 
 I1123 10:55:27.489056      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_same_precision=0.6697459816932678 
 I1123 10:55:27.489163      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_same_recall=0.7591623067855835 
 I1123 10:55:29.691004      14 main.go:139] 2023-11-23T10:55:29Z INFO     epoch 29: 
 I1123 10:55:29.691056      14 main.go:139] 2023-11-23T10:55:29Z INFO     loss=0.5149072408676147 
 I1123 10:55:29.691077      14 main.go:139] 2023-11-23T10:55:29Z INFO     categorical_accuracy=0.8060916662216187 
 I1123 10:55:29.691171      14 main.go:139] 2023-11-23T10:55:29Z INFO     same_precision=0.6648879647254944 
 I1123 10:55:29.691262      14 main.go:139] 2023-11-23T10:55:29Z INFO     same_recall=0.7316500544548035 
 I1123 10:55:29.691357      14 main.go:139] 2023-11-23T10:55:29Z INFO     val_loss=0.49219810962677 

@orfeas-k
Copy link
Contributor

Hmm could you try and rerun a Katib experiment and provide logs using --all-containers while the pod is up and in error state? IIRC, once the Katib experiment has completed running, the trial's pod is deleted, thus you cannot see the pod in your namespace.

@Daard
Copy link

Daard commented Nov 23, 2023

Do I understand correctly? The trial pod it is separate pod and is not worker pod which was built by me and has such metrics?

 I1123 10:54:23.616630      14 main.go:139] 2023-11-23T10:54:23Z INFO     epoch 1: 
 I1123 10:54:23.616663      14 main.go:139] 2023-11-23T10:54:23Z INFO     loss=1.7035995721817017 
 I1123 10:54:23.616686      14 main.go:139] 2023-11-23T10:54:23Z INFO     categorical_accuracy=0.5815742015838623 
 I1123 10:54:23.616696      14 main.go:139] 2023-11-23T10:54:23Z INFO     same_precision=0.46905016899108887 
 I1123 10:54:23.616722      14 main.go:139] 2023-11-23T10:54:23Z INFO     same_recall=0.40713292360305786 
 I1123 10:54:23.616879      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_loss=99.96876525878906 
 I1123 10:54:23.616904      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_categorical_accuracy=0.6779252290725708 
 I1123 10:54:23.617044      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_same_precision=0.18244802951812744 
 I1123 10:54:23.617147      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_same_recall=0.585185170173645 

Cause during the experiment is running there are only such pods:

ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running    16 (5d15h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running    12 (5d16h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running    6 (5d16h ago)    7d23h
resell-lab-0                                       2/2     Running    0                20h
exp-resell-dd-grid-6b5bc56d94-tdt8k                1/1     Running    0                27m
exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s

There are also trials like this:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs trial.kubeflow.org/exp-resell-dd-qp9fnnzt  
error: no kind "Trial" is registered for version "kubeflow.org/v1beta1" in scheme "pkg/scheme/scheme.go:28"

@orfeas-k
Copy link
Contributor

@Daard I 'll get back to your previous comment soon. In the meantime, could you post the output of juju status katib-db command?

@Daard
Copy link

Daard commented Nov 23, 2023

juju status katib-db

Model     Controller     Cloud/Region     Version  SLA          Timestamp
kubeflow  my-controller  myk8s/localhost  3.1.6    unsupported  13:10:12Z

App       Version                  Status  Scale  Charm      Channel     Rev  Address        Exposed  Message
katib-db  8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s  8.0/stable   99  10.152.183.35  no       

Unit         Workload  Agent  Address      Ports  Message
katib-db/0*  active    idle   10.1.183.72         Primary

@orfeas-k
Copy link
Contributor

Thank you for your effort in debugging this @Daard. The trial pods are pods spun up by the experiment while it's running. IIUC, in your case the experiment pod is exp-resell-dd-grid-6b5bc56d94-tdt8k and the trial pods are:

exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s

Are those completing successfully or go into Error status? We want the logs from those once they are in Error state.

@Daard
Copy link

Daard commented Nov 23, 2023

I will restart experiment soon. But it needs 20-30 minutes to start my workers. Is it normal behaviour by the way? Cause tfjob runs almost instantly.

After trial is failed my workers are gone and in the ui I can see such message:

Failed to find logs for this Trial. 
Make sure you've set "spec.trialTemplate.retain" field to "true" in the Experiment definition.
If this error persists then the Pod's logs are not currently persisted in the cluster.

The yaml output says this:

- type: Failed
      status: 'True'
      reason: 'TrialFailed. Job reason: TFJobFailed'
      message: >-
        Trial has failed. Job message: TFJob my-namespace/exp-resell-dd-dmnkgt8f
        has failed because 1 Worker replica(s) failed.
      lastUpdateTime: '2023-11-23T11:02:53Z'
      lastTransitionTime: '2023-11-23T11:02:53Z'

I will try to increase replicas count. Maybe it will help to get error logs.

@Daard
Copy link

Daard commented Nov 23, 2023

I have got some logs:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS      RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running     16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running     12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running     6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running     0                24m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed   0                3m3s
exp-resell-ee-2pxhvm8q-worker-1                    1/1     Running     0                2m59s
exp-resell-ee-pdrnc575-worker-0                    0/2     Error       2 (28s ago)      3m4s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     Error       3 (30s ago)      3m1s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector
NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running            0                25m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed          0                3m47s
exp-resell-ee-2pxhvm8q-worker-1                    1/1     Running            0                3m43s
exp-resell-ee-pdrnc575-worker-0                    0/2     CrashLoopBackOff   3 (30s ago)      3m48s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     Error              4 (48s ago)      3m45s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running            0                28m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed          0                7m36s
exp-resell-ee-2pxhvm8q-worker-1                    0/1     Completed          0                7m32s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     CrashLoopBackOff   5 (2m25s ago)    7m34s
exp-resell-ee-pdrnc575-worker-0                    0/2     CrashLoopBackOff   5 (2m16s ago)    7m37s

After I increased replicas count some trials were completed. They outputs is similar to tfjob log.

But inside UI I got such output in the end:

I1123 14:38:53.968930    5790 main.go:139] 2023-11-23T14:32:41Z INFO     epoch 31: 
 I1123 14:38:53.968937    5790 main.go:139] 2023-11-23T14:32:41Z INFO     loss=0.5217770934104919 
 I1123 14:38:53.968944    5790 main.go:139] 2023-11-23T14:32:41Z INFO     categorical_accuracy=0.8083534240722656 
 I1123 14:38:53.968949    5790 main.go:139] 2023-11-23T14:32:41Z INFO     same_precision=0.6734258532524109 
 I1123 14:38:53.968956    5790 main.go:139] 2023-11-23T14:32:41Z INFO     same_recall=0.7432273030281067 
 I1123 14:38:53.968963    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_loss=0.5091827511787415 
 I1123 14:38:53.968975    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_categorical_accuracy=0.8226779103279114 
 I1123 14:38:53.968989    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_same_precision=0.6974595785140991 
 F1123 14:38:53.968994    5790 main.go:421] Failed to wait for worker container: unable to find main pid from the process list [1 5790] 
 I1123 14:38:53.969127    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_same_recall=0.7512437701225281 
 I1123 14:38:53.969137    5790 main.go:139] 2023-11-23T14:32:43Z INFO     epoch 32: 

Now the experiment is stuck.

@orfeas-k
Copy link
Contributor

What we need is logs from pods that are in Error state using --all-containers meaning to run kubectl -n my-namespace logs pod-name --all-containers.

@Daard
Copy link

Daard commented Nov 23, 2023

I did not catch error state of the worker only crashloop.

Is it crucial? Or you may help with such logs?

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d20h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d21h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d21h ago)    8d
exp-resell-ff-grid-77955875cb-sl7j4                1/1     Running            0                40m
resell-lab-0                                       2/2     Running            0                16m
exp-resell-ff-5crjfw69-worker-1                    0/1     Pending            0                10m
exp-resell-ff-cc4mjp6j-worker-1                    0/1     Completed          0                10m
exp-resell-ff-cc4mjp6j-worker-0                    0/2     CrashLoopBackOff   6 (2m52s ago)    11m
exp-resell-ff-5crjfw69-worker-0                    0/2     CrashLoopBackOff   6 (22s ago)      10m

kubectl -n my-namespace logs exp-resell-ff-cc4mjp6j-worker-0 --all-containers > error.logs
error.txt

@Daard
Copy link

Daard commented Nov 23, 2023

I have caught error state.

err2.txt

@Daard
Copy link

Daard commented Nov 24, 2023

@orfeas-k Hello again.
I am not familiar with k8s and kubeflow, thus did not notice that I can add --all-containers arg and it would get some impact.
But I have added all logs from error containers in the err2.txt file. Hope it will help to debug katib.

I have faced similar issue with this example .

The trial pods are now stuck in pending state.

I hope you will find the problem. If you need any additional logs I will send easily.

@orfeas-k
Copy link
Contributor

@Daard could you try removing the katib-db charm and redeploying it with the following commands? Let's see if this will unblock new experiments. Please note that this will delete any data related to Katib experiments you 've already ran, so you should avoid doing that if you have any important data for ran experiments.

juju remove-application katib-db 
# Wait for it to be removed
juju deploy mysql-k8s katib-db --channel 8.0/stable --trust --constraints mem=2G
juju relate katib-db-manager:relational-db katib-db:database

@Daard
Copy link

Daard commented Nov 24, 2023

@orfeas-k
Is it normal?
juju_status.txt

@Daard
Copy link

Daard commented Nov 24, 2023

The logs from katib-db-manager:

2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     data = self._lazy_data = self._load()
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 1378, in _load
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2697, in relation_get
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     raw_data_content = self._run(*args, return_output=True, use_json=True)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2618, in _run
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     raise ModelError(e.stderr)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken ops.model.ModelError: ERROR permission denied
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken 
2023-11-24T13:05:31.810Z [container-agent] 2023-11-24 13:05:31 ERROR juju.worker.uniter.operation runhook.go:180 hook "relational-db-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1
2023-11-24T13:05:31.812Z [container-agent] 2023-11-24 13:05:31 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.299Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:09:27.243Z [container-agent] 2023-11-24 13:09:27 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook

@orfeas-k
Copy link
Contributor

Filed an issue in mysql-k8s canonical/mysql-k8s-operator#341

@orfeas-k
Copy link
Contributor

orfeas-k commented Dec 6, 2023

Update

Received a response canonical/mysql-k8s-operator#341 (comment) that mentions that mysql-k8s has an issue scaling up after it was scaled to 0 units and they 're working on a solution. If the root cause of the issue in Katib is katib-db not being responsive to katib-db-manager, then it will be resolved once they push a fix.

@orfeas-k
Copy link
Contributor

orfeas-k commented Jun 13, 2024

After all, this should be the same issue we hit canonical/bundle-kubeflow#893 and is described in detail in canonical/bundle-kubeflow#893 (comment).

Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5873.

This message was autogenerated

@NohaIhab
Copy link
Contributor Author

Hit this issue while defining tests for CKF 1.8/stable in air-gapped (canonical/bundle-kubeflow#918), PR #192 should resolve it.

orfeas-k added a commit that referenced this issue Jun 13, 2024
* fix: Explicitly set `KATIB_DB_MANAGER_SERVICE_PORT` in katib-controller
* katib-db-manager: Remove port config option

Closes canonical/bundle-kubeflow#893, #108
Addressed also part of #184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Kubeflow 1.8 This issue affects the Charmed Kubeflow 1.8 release
Projects
Development

No branches or pull requests

5 participants