Jobs with a lot of output/hosts stops with status "Error" #12685

DaDenniX · 2022-08-18T13:20:21Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi there,

I have two different playbooks:

one with a few tasks (5-10 tasks) but on a lot of hosts (> 500 hosts)
one with a lot of tasks which produce a lot of output (due to a lot of big loops) on one host

Both jobs end with status "Error". After some research and investigating (I really don't have any experience with Kubernetes) I found out, that the worker container in the automation-pod (automation-job-xxxx-yyyyy) stops producing logs (so the output in AWX GUI will also freeze).
The job then still runs some time after the freezing output, but after a few minutes the job failed in status "Error".

It seems to be, that it has nothing to do with the rotating log size (10 MB) I read here on existing issues, because the current logfile has only 6 MB.

deansible01-t:/var/log/pods/awx_awx-7ffd96b58b-9xrmt_2e587bd5-d91a-4c8e-96bc-19ae9d1ca4b9/awx-task # ls -lah 1.log
-rw-r----- 1 root root 6.1M Aug 18 15:14 1.log

The logfile of awx_task said:
2022-08-18T14:59:40.239996397+02:00 stderr F 2022-08-18 12:59:40,239 WARNING [008db9d212c94a569aabe6fac548d42d] awx.main.dispatch job 1018 (error) encountered an error (rc=None), please see task stdout for details.

Our settings in AWX are:

I really don't have a clue what is causing this error. But we really need those playbooks working :-(

AWX version

21.4.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

Firefox

Steps to reproduce

Create a playbook with 4-5 tasks and start them on over 500 hosts. For example:

- name: clean up files
  ansible.builtin.file:
    path: /tmp/ansible_os_infos
    state: absent
  delegate_to: somehost.blafoo.com
  run_once: True

- name: add headers to file
  ansible.builtin.lineinfile:
    path: /tmp/ansible_os_infos
    regexp: "^System"
    line: "System;Distribution;OSMajor;OSRelease"
    create: yes
  delegate_to: somehost.blafoo.com
  run_once: True

# SLES
- set_fact:
    os_information_string: "SUSE Linux Enterprise;{{ ansible_distribution_major_version }} SP{{ ansible_distribution_release }};"
  when: ansible_distribution == "SLES" or ansible_distribution == "SLES_SAP"

# Oracle Linux
- set_fact:
    os_information_string: "Oracle Linux;{{ ansible_distribution_major_version }};{{ ansible_distribution_version }}"
  when: ansible_distribution == "OracleLinux"

- name: save information about system
  ansible.builtin.lineinfile:
    path: /tmp/ansible_os_infos
    regexp: "^{{ inventory_hostname }}"
    line: "{{ inventory_hostname }};{{ os_information_string }}"
  delegate_to: somehost.blafoo.com

Expected results

Get the facts of all hosts and put the OS information in one csv file on somehost

Actual results

Job failed with status "Error"

Additional information

No response

The text was updated successfully, but these errors were encountered:

DaDenniX · 2022-08-29T10:13:21Z

Issue still present in 21.5.0

CWollinger · 2022-08-31T11:26:48Z

Same with AWX 19.4.0

AdityaVishwekar · 2022-09-15T00:28:34Z

I'm experiencing the same issue.

stanislav-zaprudskiy · 2022-10-05T09:09:23Z

This could be ansible/ansible-runner#998

fosterseth · 2022-10-28T18:27:18Z

@AdityaVishwekar @DaDenniX @CWollinger

the log rotation issue is not the logs from the awx-task container, rather the automation-job* pod that the job itself is running in

you'll want to make sure your docker max container size is > how much this playbook is supposed to output

when this problem occurs, do you see the automation-job pod hanging out, or does it get cleaned up?

AWX Team

AdityaVishwekar · 2022-10-28T20:49:20Z

automation-job pod gets cleaned up. Where is the configuration to set docker container size for this pod?

fosterseth · 2022-11-15T20:30:39Z

This sounds like either the timeout issue or log rotation issue we are hoping to address with this PR ansible/receptor#683

@AdityaVishwekar you can change the log rotation size with the docker config, see my comment here on how I did it with minikube, but other k8s clusters might be slightly different

#12644 (comment)

Amrish-Sharma · 2023-08-11T14:29:05Z

We are also facing this issue with v 21.1.0

Is there any workaround to fix this?

edvinaskairys · 2023-08-16T12:30:04Z

hello, seems i got the same problem with AWX-EE 22.4.0. The web version is quite older: AWX 20.0.1
After processing the tasks with bigger output - the Job output to appear - but the task itself is running. After tasks completes, the Job status changes to ERROR.

But the automation images completes:

kubectl -n awx logs -f automation-job-1779-pzf2w

=0 unreachable=0 \u001b[0;31mfailed=1 \u001b[0m skipped=0 rescued=0 ignored=0 \r\n\u001b[0;31mJAYNET01B\u001b[0m : \u001b[0;32mok=4 \u001b[0m changed=0 unreachable=0 \u001b[0;31mfailed=1 \u001b[0m skipped=0 rescued=0 ignored=0 \r\n\u001b[0;33mTEONET01A\u001b[0m : \u001b[0;32mok=12 \u001b[0m \u001b[0;33mchanged=1 \u001b[0m unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 \r\n\u001b[0;32mTEONET01B\u001b[0m : \u001b[0;32mok=5 \u001b[0m changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 ", "start_line": 3723, "end_line": 3730, "runner_ident": "1779", "event": "playbook_on_stats", "job_id": 1779, "pid": 17, "created": "2023-08-16T12:43:42.041638", "parent_uuid": "e246bf20-0ba0-4c5e-a0a3-c1da066a07c8", "event_data": {"playbook": "network_netbox.yml", "playbook_uuid": "e246bf20-0ba0-4c5e-a0a3-c1da066a07c8", "changed": {"TEONET01A": 1}, "dark": {}, "failures": {"JAYNET01A": 1, "JAYNET01B": 1}, "ignored": {}, "ok": {"TEONET01A": 12, "TEONET01B": 5, "JAYNET01A": 4, "JAYNET01B": 4}, "processed": {"TEONET01A": 1, "TEONET01B": 1, "JAYNET01A": 1, "JAYNET01B": 1}, "rescued": {}, "skipped": {}, "artifact_data": {}, "uuid": "67ac4489-acc8-4dea-aa37-069c95c097e6"}} {"status": "failed", "runner_ident": "1779"} {"zipfile": 1333} UEsDBBQAAAAIAKBjEFdYOKgdsAMAAJIKAAAHAAAAY29tbWFuZKVWW5OaSBT+KymfVy5eZtSqrVqEdiQiTTWtibW11dUgRjIILuBcksr+9u3mNoA6qRgfkO5z/853mv7ecaPDgYbbzuTD3x0aJr4TeN1jQF+dKHrs/PGh0+26e8/NX0/8WSjlMpo8do80SfLV1t/t8rddFD8mf/ayhc+fYnwKQy8W/fDJC9MofhX3UZLmdpmvv0oNL3wSvZc0pk80zuShlz4zd4T9O9GL8HoIOv+wbfeZJ105PsbRV89NuQHzwCTfO4vVFCATYGATG6C1rgJiQYTJHGPL5raDQZ/rX9GraSifPhNoAaRgiIgKTYygYQBEloqpPPB/gJGuvpmX/3NoZ25kSRj0BflO6MkD7o/vm8oScBk9pdGBpn4Udr9GTle+vx93j992vecycCP3kUSwanG71D1ORLHwLI2FXv9uMpK4kfVJqyFzS/55KFY8D1aBMboRjZY3BDEsCmiX2IY/r2cOc6RqBdValnkvnLdxkQR5Ukv6ApJE0TRUa1EO5CWD9zG/BZMzjxk/JqOfZNwE51ZqNifh1taeNazwY8+NtcE3pHe6dUaFa3rNHrGmXuNNa96azbyZsr/QpzN03xp14ZRpplvU1QLhHUpfhbV5eFkKnmfjc0piMYhcGoiJ44eT2rpavgmyl3zJHr8/9q05q47CqyxvccNQiYo3VnYQqMIKz7qjzNy09akByAwi5kCFBsyi4Pjk1cUcabIAG6LOgbrQzQeuNKNB0tDSzTUwWXUbsjItBdlAIzNFN4B2ySVTUJb6AhIEVIi0KoTddM3Ks5C+VjAgmoIVoulZgmJ6OIr0+YXw857s+70g2gYut/gIp0TPInIR33lLK98fZl1F8CNQMYu+1m0dmlyw242G4/FIdvoulWRneN8butLWdeSt1JfGnuMMZW/sSr16GYh1bUNmrEqbAFOZFtVWBSyVzwTw+EwzK+1e4r+ytpLC+zTlHGWXA1aUQLfsAnAQvkWhN+nLd4NGRNuel/QhnJsVJMWXvPgsFNoqJ5mKWYV2pm3XVWPv35Mfewd2pUiIGwUBuwGwD2ky+U8UinuKWN/OOb6nsSdeEDdggRyQanQuxYujwGtEyjcuxCgEXuo2txp1KoYxVdQFsYzVg27arYkNfEc8vqb7KOwLYzHxU3ZLo+4j/eIlpU9SJLn1E36BI8wscJiOWL40moA1uMJV1OwiwthY2Jbd1W1oMOZqTeoWcWic+jvqpolYEhWtTJMdBHCp45wzzVkoxaaxKebqTOvHj/8BUEsDBBQAAAAIAHVlEFd3o4ueCAAAAAYAAAAGAAAAc3RhdHVzS0vMzElNAQBQSwMEFAAAAAgAdWUQVw2+1RoDAAAAAQAAAAIAAAByYzMCAFBLAwQUAAAAAAB1ZRBXAAAAAAAAAAAAAAAACwAAAGpvYl9ldmVudHMvUEsBAhQDFAAAAAgAoGMQV1g4qB2wAwAAkgoAAAcAAAAAAAAAAAAAAICBAAAAAGNvbW1hbmRQSwECFAMUAAAACAB1ZRBXd6OLnggAAAAGAAAABgAAAAAAAAAAAAAAgIHVAwAAc3RhdHVzUEsBAhQDFAAAAAgAdWUQVw2+1RoDAAAAAQAAAAIAAAAAAAAAAAAAAICBAQQAAHJjUEsBAhQDFAAAAAAAdWUQVwAAAAAAAAAAAAAAAAsAAAAAAAAAAAAQAMBBJAQAAGpvYl9ldmVudHMvUEsFBgAAAAAEAAQA0gAAAE0EAAAAAA=={"eof": true}

Is there any timeline for this bug ?

EsDmitrii · 2023-08-17T10:58:20Z

I'm here with updates for you
Updated to https://quay.io/repository/ansible/awx-ee?tab=tags&tag=22.7.0
Everything is well at first sight
We will watch and test it for a few days
Will be back with news

jrgoldfinemiddleton · 2023-09-07T19:00:27Z

@EsDmitrii things still seeming to work perfectly?

EsDmitrii · 2023-09-07T19:08:51Z

@EsDmitrii things still seeming to work perfectly?

Hi!
50/50.. some hosts started to work well after upgrade, some hosts keep failing :(
Don't know why. Still trying to find out why.

jrgoldfinemiddleton · 2023-09-07T22:18:24Z

Is it the same hosts/jobs that fail each time for you? Or it changes?

EsDmitrii · 2023-09-09T20:42:16Z

Is it the same hosts/jobs that fail each time for you? Or it changes?

Hi, sorry for the late response
Yes, I still face the issue with one specific huge job that runs on huge inventory
Generally it solved the issue on multiple hosts
I tested changes on two hosts and when it helped, I upgraded other hosts
But some hosts still fails with the same issue
Tried multiple ways to fix it, no luck
I didn’t find any pattern that fixed the issue on several hosts and not fixed on other

EsDmitrii · 2023-12-07T08:42:36Z

Hi all!
Sorry for the late response.
My team and me divided huge jobs to a several smaller ones
Now all works well

github-actions bot added needs_triage type:bug labels Aug 18, 2022

akus062381 removed the needs_triage label Nov 15, 2022

edvinaskairys mentioned this issue Aug 16, 2023

AWX job fails but it completes successfully #14288

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs with a lot of output/hosts stops with status "Error" #12685

Jobs with a lot of output/hosts stops with status "Error" #12685

DaDenniX commented Aug 18, 2022

DaDenniX commented Aug 29, 2022

CWollinger commented Aug 31, 2022

AdityaVishwekar commented Sep 15, 2022

stanislav-zaprudskiy commented Oct 5, 2022

fosterseth commented Oct 28, 2022

AdityaVishwekar commented Oct 28, 2022 •

edited

Loading

fosterseth commented Nov 15, 2022

Amrish-Sharma commented Aug 11, 2023

edvinaskairys commented Aug 16, 2023 •

edited

Loading

EsDmitrii commented Aug 17, 2023

jrgoldfinemiddleton commented Sep 7, 2023

EsDmitrii commented Sep 7, 2023

jrgoldfinemiddleton commented Sep 7, 2023

EsDmitrii commented Sep 9, 2023 •

edited

Loading

EsDmitrii commented Dec 7, 2023

Jobs with a lot of output/hosts stops with status "Error" #12685

Jobs with a lot of output/hosts stops with status "Error" #12685

Comments

DaDenniX commented Aug 18, 2022

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

DaDenniX commented Aug 29, 2022

CWollinger commented Aug 31, 2022

AdityaVishwekar commented Sep 15, 2022

stanislav-zaprudskiy commented Oct 5, 2022

fosterseth commented Oct 28, 2022

AdityaVishwekar commented Oct 28, 2022 • edited Loading

fosterseth commented Nov 15, 2022

Amrish-Sharma commented Aug 11, 2023

edvinaskairys commented Aug 16, 2023 • edited Loading

EsDmitrii commented Aug 17, 2023

jrgoldfinemiddleton commented Sep 7, 2023

EsDmitrii commented Sep 7, 2023

jrgoldfinemiddleton commented Sep 7, 2023

EsDmitrii commented Sep 9, 2023 • edited Loading

EsDmitrii commented Dec 7, 2023

AdityaVishwekar commented Oct 28, 2022 •

edited

Loading

edvinaskairys commented Aug 16, 2023 •

edited

Loading

EsDmitrii commented Sep 9, 2023 •

edited

Loading