Our team’s work relies a lot on GitHub Actions. Besides the usual workflows to check our code for errors after each push 1, we also have many workflows set up to run on a schedule.
However, with time, we became frustrated because these workflows were unreliable and flaky: they were valid workflows but were failing from time to time for seemingly random reasons. Most of the time, just re-running them fixed the issue. In this blog post, I detail how to limit the number of false-positive failures in your GitHub Action workflows.
For demonstration purposes, let’s look at a simple workflow we might have used before reading this blog post:
Notify the whole team when a scheduled workflow fails
While workflows set up to run on pushes or pull requests will notify the user who committed the changes, scheduled workflows will notify the latest user who modified this workflow, as indicated in the official documentation:
Notifications for scheduled workflows are sent to the user who last modified the cron syntax in the workflow file. For more information, see “Notifications for workflow runs”.
This behaviour is often not desirable when working collaboratively as a team on a project. In this situation, you would like every member of team to be notified. So that everybody can contribute to fix the issue.
There are many ways to circumvent this behaviour, such as adding a step to notify failures on a mailing list or a slack channel 2. In the Epiforecasts team, we decided to keep everything in the open and automatically open an issue when one of our scheduled workflow is failing. This is achieved by creating a file named action-issue-template.md in your .github folder with the following content:
---title: "{{ env.GITHUB_WORKFLOW }} GitHub Action is failing"---See [the action log](https://github.com/{{ env.GITHUB_ACTION_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }})for more details.
and then appending the following instruction at the end of all your workflows:
You can see an example of this used in the wild with this issue.
Note
I recommend that you always specify the reason for the failure (and the fix if it’s not a spurious failure as detailed below) when closing the issue. It will serve as a log and with time, it will help you identify which parts of your workflows should be improved.
Re-running workflows manually
When your workflows fail, you might want to re-run them. You have two options here:
Any of these URLs can fail for any reason and cause your R installation, and therefore your whole action to fail.
It is possible to reduce this possible source of breakage, at the expense of some flexibility (you cannot install the R version of your choice). Setting the install-r to false will use the R version provided in the GitHub Actions container and not try to install it from external sources:
But this alone is not enough to remove all calls to external resources. Even when install-r is set to false, the setup-r action checks if the requested version matches the installed version. And, unless specified otherwise, the R version requested by default is 'release', which means an call to an external resource (in this case api.r-hub.io) is required to convert this version ‘number’ into an actual number such as R 4.2.0. If you want to avoid all external calls, you then also have to specify a numeric version number such as:
You can specify a more precise version number but it might be good to only specify the major version number to limit the breakages due to mismatches during the requested and available version. R is very stable within major versions so you’re not likely to have failure due to API changes even if you specify the minor or patch version number.
R packages installation
R packages installation is a common source of failures. This can be caused by an incompatibility between package new versions or by intermittent failure while trying to reach the CRAN-like server.
A good solution to both source of issues if to pin the exact version number and install/load packages from a local cache. This is easily achieved thanks to the renv package.
In addition to the R install & cran-like servers, you might use some internet resources in your script. And these resources might be unavailable for a number of reasons. In this case, it is good practice to retry your request. But in a polite way! The web server might be unavailable because it’s already overloaded with requests. Repeatedly retrying would just make the situation worse in this case.
The polite way to retry HTTP requests is to use exponential back off. Each time you one of your request fails, you increase the waiting time until you make a new one.
Fortunately, you do not have to code the retry feature & the exponential back off yourself as it is already implemented in common R packages, such as httr2, via the req_retry() function:
Error in `req_perform()`:
! HTTP 500 Internal Server Error.
git repository out of sync
If your workflow takes a long time to run, you might get the following message when you try to commit your results:
To https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe ![rejected] main -> main (fetch first)error: failed to push some refs to 'https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe'hint: Updates were rejected because the remote contains work that you do hint: not have locally. This is usually caused by another repository pushinghint: to the same ref. You may want to first integrate the remote changeshint: (e.g., 'git pull ...') before pushing again.hint: See the 'Note about fast-forwards' in 'git push --help' for details.Error: Process completed with exit code 1.
As helpfully mentioned in the error message, you need to run git pull ... before pushing to make sure your local git copy is up-to-date. However, if you do this while you have local commits, the default git set-up will create an ugly merge commit. To avoid the merge commit, instead of running a simple git pull ..., you should run git pull --rebase .... Just note that this will not save you if you have merge conflicts.
GitHub itself is out of service
One last option is that GitHub itself, or at least one of its services, is down. You can check this by visiting the dedicated status page: https://www.githubstatus.com/ or even be proactive by subscribing to GitHub status alerts.
This situation should be exceptional and your best bet is probably to wait until everything is back to normal and re-run your failing workflows. If the scheduled job is time sensitive, you can also run it locally.
If this kind of service interruption happens too frequently for your taste but you still like the GitHub Actions syntax, you might want to try spinning your own self-hosted runner.