-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to Report logs
error in experiment Pod on latest/edge
#108
Comments
From a quick research, it seems like these kind of issues are associated to the katib-db-manager not being up and ready when running experiments. Do you know if it was in fact active and idle, and the pods were also active and ready when you tried running the experiment? Similar issue: kubeflow/katib#1517 |
Could not deploy
|
Failed to Report logs
error in experiment Pod on latest/edge
@DnPlas
|
I was not able to reproduce this issue, we should re-open the issue if we hit it again and document the steps to reproduce. |
Came across this again when running the katib uat notebook. Then, I tested also with grid example and same thing happened there as well. Note though that this happened after I had scaled down the cluster to 0 nodes yesterday before EOD and scaled it up again today. Environment
ReproduceI redeployed to a new cluster with the same set up and looks like the issue didn't come up as well. Logs/DebuggingTrial logsThis is the last line in the logs in every trial pod spun by the katib experiment.
As @DnPlas pointed out, this is the pod trying to contact katib-db-manager (10.100.8.221). Full trial logs
However, katib-db-manager is up and running. From logs below, we see that katib-db-manager-0 pod's logs
Note that the following errors are present even in the healthy cluster
I also noticed katib-db-0 pod's logs
Here are also the full kubernetes logs. Not sure what we can conclude from the above. This could be a mysql-k8s charm issue but we can't be sure. |
I have also faced similar issue with katib. These are the last logs outputs from the katib ui trials:
|
@Daard Could you let us know a bit more about the environment and deployment you had during the above error? |
Environment Used this guide to deploy charmed kubeflow. Logs Added logs from the trial already. What I have done? I have created custom TFJob which is running and completed successfully. But it does not work as katib experiment. I have also tested several other experiments configurations from kubeflow/katib documentation. But they have the same behaviour. @orfeas-k Do you need some additional logs for understanding the reason? |
@orfeas-k After katib experiment deletion the trials remain in my namespace. And I can't delete them also. Even after deletion of all resources which are connected to the experiment (pods, experiments, suggestions). |
Could you post all logs from
|
Thank you @Daard. We would like to understand better who exactly is trying to contact |
@orfeas-k When I tried to get logs from kunectl I have got this:
But I can see logs from tfjob and they are similar to logs from kubeflow.katib.trial ui:
|
Hmm could you try and rerun a Katib experiment and provide logs using |
Do I understand correctly? The trial pod it is separate pod and is not worker pod which was built by me and has such metrics?
Cause during the experiment is running there are only such pods:
There are also trials like this:
|
@Daard I 'll get back to your previous comment soon. In the meantime, could you post the output of |
juju status katib-db
|
Thank you for your effort in debugging this @Daard. The trial pods are pods spun up by the experiment while it's running. IIUC, in your case the experiment pod is
Are those completing successfully or go into Error status? We want the logs from those once they are in Error state. |
I will restart experiment soon. But it needs 20-30 minutes to start my workers. Is it normal behaviour by the way? Cause tfjob runs almost instantly. After trial is failed my workers are gone and in the ui I can see such message:
The yaml output says this:
I will try to increase replicas count. Maybe it will help to get error logs. |
I have got some logs:
After I increased replicas count some trials were completed. They outputs is similar to tfjob log. But inside UI I got such output in the end:
Now the experiment is stuck. |
What we need is logs from pods that are in Error state using |
I did not catch error state of the worker only crashloop. Is it crucial? Or you may help with such logs?
|
I have caught error state. |
@orfeas-k Hello again. I have faced similar issue with this example . The trial pods are now stuck in pending state. I hope you will find the problem. If you need any additional logs I will send easily. |
@Daard could you try removing the
|
@orfeas-k |
The logs from katib-db-manager:
|
Filed an issue in |
UpdateReceived a response canonical/mysql-k8s-operator#341 (comment) that mentions that |
After all, this should be the same issue we hit canonical/bundle-kubeflow#893 and is described in detail in canonical/bundle-kubeflow#893 (comment). |
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5873.
|
Hit this issue while defining tests for CKF 1.8/stable in air-gapped (canonical/bundle-kubeflow#918), PR #192 should resolve it. |
* fix: Explicitly set `KATIB_DB_MANAGER_SERVICE_PORT` in katib-controller * katib-db-manager: Remove port config option Closes canonical/bundle-kubeflow#893, #108 Addressed also part of #184
Steps to reproduce
Result:
pods go to error at the final stage of communicating results with the logs:
10.152.183.167
is the IP ofkatib-db-manager
ClusterIP Service.The same logs can be viewed from the
katib-controller
container logs.The text was updated successfully, but these errors were encountered: