Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post Run actions/checkout@v4 failed randomly #10609

Closed
2 of 13 tasks
korrem opened this issue Sep 13, 2024 · 6 comments
Closed
2 of 13 tasks

Post Run actions/checkout@v4 failed randomly #10609

korrem opened this issue Sep 13, 2024 · 6 comments

Comments

@korrem
Copy link

korrem commented Sep 13, 2024

Description

Hi,

For two at least two months we have noticed that our nightly runs a problem that occurs randomly. Sometimes the last step which is Post Run actions/checkout@v4 can take a very long time, up to 15 minutes, after which we get the workflow is either skipped or failed.

For skipped we get error massage Hosted runner encountered an error while running your job. (Error type: Disconnect).. Example can be found here - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10765405236

For failed we get error massage Hosted runner: GitHub Actions 94 has lost communication with the server. Anything in the workflow that terminates the runner's process, deprives it of CPU/memory or blocks network access can cause this error. - here you can see an example - https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10712726268.

We have added a step in which we monitor CPU and RAM consumption. However, so far the highest CPU consumption has been a maximum of 10% and the available RAM is around 6GB after the tests have been completed.
Here you can see our workflow file -> https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/run-e2e-tests.yml and workflow for nightly https://github.com/IMGARENA/multisport-fastpath-scoring-app/blob/develop/.github/workflows/nightly-e2e-tests-without-comparator.yml.

Could you be so kind and help us to resolve this issue?

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Version: 20240908.1.0

Is it regression?

https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/runs/10821818651

Expected behavior

Post Run actions/checkout@v4 step shouldn't take so much time and should finish successfully

Actual behavior

Post Run actions/checkout@v4 step at the end of the workflow takes sometimes even 15 minutes and then fails or skips the whole workflow.

Repro steps

  1. Go to https://github.com/IMGARENA/multisport-fastpath-scoring-app/actions/workflows/nightly-e2e-tests.yml,
  2. Click on Run workflow button,
  3. Select develop branch,
  4. Click on Run Workflow,
@hemanthmanga
Copy link
Contributor

Hi @korrem Thank you for bringing this issue to us. We are looking into this issue and will update you on this issue after investigating.

@Prabhatkumar59
Copy link

Hi @korrem- I am unable to open the url link which you have provided as it shows '404 error'. However, from your description i can clearly see that the issue you are experiencing with the Post Run actions/checkout@v4 step, which randomly takes a long time or fails due to runner disconnection, could be related to various factors like runner resource limitations, network instability, or GitHub service issues.

For you, i am providing some recommendations to help mitigate the problem:-

A. You can add a retry mechanism to the actions/checkout@v4 step to handle random failures. GitHub Actions supports continue-on-error and retry options to prevent the job from completely failing.

- name: Checkout Code
  uses: actions/checkout@v4
  with:
    fetch-depth: 0
  continue-on-error: true

B. You can also try to check if the runner timeout is set too aggressively. Increasing the runner timeout might prevent early termination.
timeout-minutes: 30 # Example to increase timeout if needed
C. Adding to this, If the issue persists and is critical, consider using a self-hosted runner with more control over resource allocation and network stability. This might avoid disconnection errors.

D. Git Shallow Clone: To reduce the time spent in the checkout step, ensure that you're not fetching unnecessary history.

- uses: actions/checkout@v4
  with:
    fetch-depth: 1  # Fetch only the latest commit

E. Also, Since the error mentions loss of communication with the server, add network-related logging or monitoring to see if there are spikes in network latency or drops that might be affecting the workflow.

Hopefully, these changes should help improve the stability of the actions/checkout@v4 step.

@korrem
Copy link
Author

korrem commented Oct 2, 2024

Prabhatkumar59 thanks for your message. I'll try options A and D, and if they don't help then the rest. I will let you know if it helped

@Prabhatkumar59
Copy link

Hi @korrem - Sure let me know, hopefully those changes which I provided to you should help improve the stability.

@Prabhatkumar59
Copy link

Hi @korrem - Since we haven't heard back, we'll assume your issue is resolved and will close this issue for now. Feel free to reach out to us for any other queries. Thanks.

@korrem
Copy link
Author

korrem commented Oct 25, 2024

Hi Prabhatkumar59, Apologies for the long wait with information on the results, unfortunately, none of your advice helped.

  • The continue-on error and retry mechanism didn’t work.
  • The timeout-minutes only increased the running time of the entire workflow. The post-run checkout action usually takes a maximum of 2 seconds, but it didn’t help — it only prolonged the agony.
  • We are blocked by the company from using a self-hosted runner, so this option wasn’t viable.
  • Setting fetch-dept to 1 also didn’t help.
  • We added monitoring steps to check if the internet was somehow disconnected, but the steps before and after the tests indicate that everything is correct.
  • In addition, I have removed unnecessary steps which are usually skippable, but this has not helped either

Strange thing is that I see all steps passed except last one (Post Run actions/checkout@v4) but in logs we are seeing like hosted-runner didn't start entire job
LOGS:
image.
Workflow screenshot
image

I will be grateful for any other help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants