Dealing with flaky GitHub Actions

Author

Hugo Gruson

Published

April 11, 2022

Our team’s work relies a lot on GitHub Actions. Besides the usual workflows to check our code for errors after each push 1, we also have many workflows set up to run on a schedule.

For example, we have scheduled workflow to:

However, with time, we became frustrated because these workflows were unreliable and flaky: they were valid workflows but were failing from time to time for seemingly random reasons. Most of the time, just re-running them fixed the issue. In this blog post, I detail how to limit the number of false-positive failures in your GitHub Action workflows.

For demonstration purposes, let’s look at a simple workflow we might have used before reading this blog post:

on:
  schedule:
    - cron: "0 12 * * *"

jobs:
  scheduled-job:
    runs-on: ubuntu-20.04
    steps:
    - uses: actions/checkout@v2
    
    - uses: r-lib/actions/setup-r@v2

    - name: Install R dependencies
      run: Rscript -e 'install.packages("tidyverse")'

    - run: Rscript 'script.R'
        
    - name: Commit files
      run: |
        git config user.email "action@github.com"
        git config user.name "GitHub Actions"
        git add --all
        git commit -m "New results"
        git push 

Dealing with failing workflows in the moment

Notify the whole team when a scheduled workflow fails

While workflows set up to run on pushes or pull requests will notify the user who committed the changes, scheduled workflows will notify the latest user who modified this workflow, as indicated in the official documentation:

Notifications for scheduled workflows are sent to the user who last modified the cron syntax in the workflow file. For more information, see “Notifications for workflow runs”.

This behaviour is often not desirable when working collaboratively as a team on a project. In this situation, you would like every member of team to be notified. So that everybody can contribute to fix the issue.

There are many ways to circumvent this behaviour, such as adding a step to notify failures on a mailing list or a slack channel 2. In the Epiforecasts team, we decided to keep everything in the open and automatically open an issue when one of our scheduled workflow is failing. This is achieved by creating a file named action-issue-template.md in your .github folder with the following content:

---
title: "{{ env.GITHUB_WORKFLOW }} GitHub Action is failing"
---

See [the action log](https://github.com/{{ env.GITHUB_ACTION_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }}) 
for more details.

and then appending the following instruction at the end of all your workflows:

    - name: Create issue about failure
      if: failure()
      uses: JasonEtco/create-an-issue@v2.5.0
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      with:
        filename: .github/action-issue-template.md

You can see an example of this used in the wild with this issue.

Note

I recommend that you always specify the reason for the failure (and the fix if it’s not a spurious failure as detailed below) when closing the issue. It will serve as a log and with time, it will help you identify which parts of your workflows should be improved.

Re-running workflows manually

When your workflows fail, you might want to re-run them. You have two options here:

Dealing with flaky workflows at the root: possible sources of flakiness and how to fix them

Failure during initial set up

R installation

By default, r-lib/actions/setup-r@v2 installs R from various sources depending on the exact version and operating system:

Any of these URLs can fail for any reason and cause your R installation, and therefore your whole action to fail.

It is possible to reduce this possible source of breakage, at the expense of some flexibility (you cannot install the R version of your choice). Setting the install-r to false will use the R version provided in the GitHub Actions container and not try to install it from external sources:

- uses: r-lib/actions/setup-r@v1
  with:
    install-r: false

But this alone is not enough to remove all calls to external resources. Even when install-r is set to false, the setup-r action checks if the requested version matches the installed version. And, unless specified otherwise, the R version requested by default is 'release', which means an call to an external resource (in this case api.r-hub.io) is required to convert this version ‘number’ into an actual number such as R 4.2.0. If you want to avoid all external calls, you then also have to specify a numeric version number such as:

- uses: r-lib/actions/setup-r@v1
  with:
    install-r: false
    r-version: 4
Note

You can specify a more precise version number but it might be good to only specify the major version number to limit the breakages due to mismatches during the requested and available version. R is very stable within major versions so you’re not likely to have failure due to API changes even if you specify the minor or patch version number.

R packages installation

R packages installation is a common source of failures. This can be caused by an incompatibility between package new versions or by intermittent failure while trying to reach the CRAN-like server.

A good solution to both source of issues if to pin the exact version number and install/load packages from a local cache. This is easily achieved thanks to the renv package.

In practice, rather than manually installing package or using the r-lib/actions/setup-r-dependencies action, you should create a lockfile 3 and use the r-lib/actions/setup-renv action:

- uses: r-lib/actions/setup-renv@v2

Unaccessible HTTP resources

In addition to the R install & cran-like servers, you might use some internet resources in your script. And these resources might be unavailable for a number of reasons. In this case, it is good practice to retry your request. But in a polite way! The web server might be unavailable because it’s already overloaded with requests. Repeatedly retrying would just make the situation worse in this case.

The polite way to retry HTTP requests is to use exponential back off. Each time you one of your request fails, you increase the waiting time until you make a new one.

Fortunately, you do not have to code the retry feature & the exponential back off yourself as it is already implemented in common R packages, such as httr2, via the req_retry() function:

library(httr2)

request("https://httpbin.org/status/500") |>
  req_verbose() |>
  req_retry(max_tries = 3) %>%
  req_perform()
-> GET /status/500 HTTP/2
-> Host: httpbin.org
-> user-agent: httr2/0.2.3 r-curl/5.0.2 libcurl/7.81.0
-> accept: */*
-> accept-encoding: deflate, gzip, br, zstd
-> 
<- HTTP/2 500 
<- date: Tue, 05 Mar 2024 09:54:15 GMT
<- content-type: text/html; charset=utf-8
<- content-length: 0
<- server: gunicorn/19.9.0
<- access-control-allow-origin: *
<- access-control-allow-credentials: true
<- 
Error in `req_perform()`:
! HTTP 500 Internal Server Error.

git repository out of sync

If your workflow takes a long time to run, you might get the following message when you try to commit your results:

To https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe !
[rejected] main -> main (fetch first)
error: failed to push some refs to 'https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe'
hint: Updates were rejected because the remote contains work that you do 
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
Error: Process completed with exit code 1.

As helpfully mentioned in the error message, you need to run git pull ... before pushing to make sure your local git copy is up-to-date. However, if you do this while you have local commits, the default git set-up will create an ugly merge commit. To avoid the merge commit, instead of running a simple git pull ..., you should run git pull --rebase .... Just note that this will not save you if you have merge conflicts.

GitHub itself is out of service

One last option is that GitHub itself, or at least one of its services, is down. You can check this by visiting the dedicated status page: https://www.githubstatus.com/ or even be proactive by subscribing to GitHub status alerts.

This situation should be exceptional and your best bet is probably to wait until everything is back to normal and re-run your failing workflows. If the scheduled job is time sensitive, you can also run it locally.

If this kind of service interruption happens too frequently for your taste but you still like the GitHub Actions syntax, you might want to try spinning your own self-hosted runner.

Final summary: the new and improved workflow

on:
  workflow_dispatch:
  schedule:
    - cron: "0 12 * * *"

jobs:
  scheduled-job:
    runs-on: ubuntu-20.04
    steps:
    - uses: actions/checkout@v2

    - uses: r-lib/actions/setup-r@v2
      with:
        install-r: false
        r-version: 4
        use-public-rspm: true

    - uses: r-lib/actions/setup-renv@v2

    - run: Rscript 'script.R'

    - name: Commit files
      run: |
        git config user.email "action@github.com"
        git config user.name "GitHub Actions"
        git add --all
        git commit -m "New results" || echo "No changes to commit"
        git pull --rebase origin main
        git push

    - name: Create issue about failure
      if: failure() && github.event_name != 'workflow_dispatch'
      uses: JasonEtco/create-an-issue@v2.5.0
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      with:
        filename: .github/action-issue-template.md

Footnotes

  1. You can visit https://github.com/r-lib/actions for a great list of such actions.↩︎

  2. Another good approach is implemented in the cransays repository.↩︎

  3. You can use renv directly (e.g., by calling renv::init()), or other derived packages such as capsule↩︎

Citation

BibTeX citation:
@online{gruson2022,
  author = {Gruson, Hugo},
  title = {Dealing with Flaky {GitHub} {Actions}},
  date = {2022-04-11},
  url = {https://epiforecasts.io/posts/2022-04-11-robust-actions},
  langid = {en}
}
For attribution, please cite this work as:
Gruson, Hugo. 2022. “Dealing with Flaky GitHub Actions.” April 11, 2022. https://epiforecasts.io/posts/2022-04-11-robust-actions.