-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX and Execution Environment not synchronized #10211
Comments
Could you give more details from the job? Visit the endpoint |
Both are empty.
I can't post the whole json. I have to cleanup some stuff if you really need it. |
Maybe this is useful for you.
I also double check the content of Can you help me to find the reason so I can provide useful info? FYI: English is not my principal language. I do my best ;) Thx! |
This smells a lot like #9961 - we're actively working on it. |
Can I help by doing something on my env ? I plan to centralized our k8s logs to Elk and start a big playbook and collect everything and analyze it and share it. That can be useful? |
I believe this should be fixed now. Can you try on 19.2.0+ ? |
Do I have to use the default EE coming with awx-operator or I can use my own one ? There is a relation or it's principaly ansible.awx concern. I will run the test next week with 19.2.1 deploy with awx-operator 0.11 Thx! |
@shanemcd I deployed 19.2.2 with awx-operator 0.12.0. (Fresh Install) The bug seems to persist. If I can do more to help solve this problem, please do not hesitate. Here are the details of the job from the API point of view.
Thank you and have a nice day! |
HI, awx Exactly the same issue as described here. Job with 78 tasks 39 hosts, runs fine in GUI for about 10 minutes till line ~1780 after that halts in GUI, but keeps running in automation container.
|
Same issue here. We basically cannot run any long jobs with external postgres cluster:
Any ideas to try for further troubleshooting? This problem is now very old (first reported in March #9594). Seems to be well outlined and reproducible. There are duplicate candidates in #9594, #11027. |
From my experience, there are couple of scenarios in which the problem occurs only with awx-web UI. OPERATOR_VERSION: 0.10.0 In any case, all the long running jobs get completed which can be confirmed from the awx-task logs. And, the error is also same for all of them. The weird part is the logs expose the ssh_password as you can see from the awx_task logs.
2022-01-21 08:09:06,836 DEBUG [929d30f54a6344f995ca21e6e4a0b419] awx.main.tasks job 4670 (running) finished running, producing 1869 events.
2022-01-20 11:35:00,500 ERROR [034450d53ea44ad4b00fcefa2ff1b261] awx.main.tasks job 4654 (running) Post run hook errored. Appreciate, if any fix has been identified or under progress. |
K8s (kubespray) I´m facing a similar problem. We use a playbook on an inventory with 460 hosts. If we limit the execution to fewer hosts, everything works. If we run on the entire inventory, after a few lines the Output on the AWX interface hangs. It recovers after aprox. 10 min, but when the job finishes, the output is incomplete (without the play recap) and there is missing information on the job via api. I found something intresting in the container log, may help to find the cause: the line "awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(x)" apears contantly, until something happens and the line stops showing, at the same time that the Output stops in AWX. After aprox. 10 minutes, it starts to appear again, but this time with a much higher number of events (awx.main.commands.run_callback_receiver JobEvent.objects.bulk_create(45)), like it's trying to compensate. Anyone knows who generate this lines? Looks like the task container is receiving, but who is sending? The logs of the pod automation-job-xxx-yyyyy never stops, but show no errors. Thanks for the attention, and sorry for the bad english. Obs: It's not the entire log for the job, it's just to show the lines I talked about ... |
I made a test in another k8s cluster, this time using docker, and the problem did not occur. Could this problem be related to containerd? |
Just wanted to report this, I tried the latest version of AWX 20.0.1 and AWX-Operator 0.18.0 and the issue still persists. OPERATOR_VERSION: 0.18.0 All jobs running more than two hours is showing as FAILED (in earlier version it was RUNNING not getting FAILED or SUCCESS), even though the playbook run is successful. The incomplete log output issue is fixed by applying the log container size on the K8s. Log output of awx-task container: FAILED JOB - 2 hours 15 minutes 2022-03-18 09:53:03,196 DEBUG [3a5ec3009d054c0d9b4ca736c74f464b] awx.main.dispatch task 510b6172-a41b-41fe-b919-00171ed24b7c starting awx.main.tasks.system.handle_success_and_failure_notifications([164]) Log output of awx-task container: Successful Job - 2 hours 2022-03-17 09:19:57,357 DEBUG [9aabd791dc484662ad15bfc38b59c82f] awx.main.dispatch task c3e572e3-521d-4f6c-90c3-b73407e3bd27 starting awx.main.tasks.system.handle_success_and_failure_notifications(*[160]) |
@chris93111 , I found the solution just before your reply, problem solved! Thanks a lot!! |
Closing this in favor of #11338 |
Hi,
Just wanted to ask whether the job status is showing correctly in the GUI
once it is completed. For me it shows "running" even though the job is
finished in the background and we need to cancel it to make it stop.
I see this problem even on the latest version, do we have any fix or
workaround for this?
Regards,
Vibin
…On Sat, 5 Mar 2022, 01:35 chris93111, ***@***.***> wrote:
@chicoraf <https://github.com/chicoraf> try container-log-max-size=500Mi
in kubelet arg see #11338 <#11338>
—
Reply to this email directly, view it on GitHub
<#10211 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALC7N2SC5NWAS5BQIQB5N5LU6JUH7ANCNFSM45A7K5HQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
ISSUE TYPE
SUMMARY
When executing a Job Template. It seems that the execution environment continues its work in the background but AWX is no longer up to date and at the end of the execution the job template is in error.
ENVIRONMENT
STEPS TO REPRODUCE
I only encounter this problem when I run a Job template with multiple hosts. In my case I run on 100+ hosts. I do not encounter this problem on a small amount of hosts.
My environments are functional and I am available to troubleshoot and find the exact cause.
EXPECTED RESULTS
I expect to receive the complete log of the execution to review and make sure it went well
ACTUAL RESULTS
The job template is in error and the stdout is incomplete. However, I have confirmed that the playbook continues by using the command
kubectl logs awx-job-128-qmx5p -n ansible-awx
and I see that the job template continues but the output does not return to AWX.The text was updated successfully, but these errors were encountered: