Skip to content

Commit

Permalink
Merge pull request #13 from kingdonb/test-prerelease
Browse files Browse the repository at this point in the history
Testing link-checker-gpt 0.1.0-beta
  • Loading branch information
Kingdon Barrett authored Aug 21, 2023
2 parents 4fe0593 + 69a5083 commit 9212eb0
Show file tree
Hide file tree
Showing 29 changed files with 403 additions and 175 deletions.
2 changes: 2 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Dockerfile
action.yml
10 changes: 0 additions & 10 deletions .github/scripts/check_summary.sh

This file was deleted.

9 changes: 0 additions & 9 deletions .github/scripts/comment_on_pr.sh

This file was deleted.

37 changes: 37 additions & 0 deletions .github/workflows/docker-image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Docker Image CI

on:
push:
tags:
- '*'

jobs:
build:
runs-on: ubuntu-latest

permissions:
packages: write

steps:
-
name: Checkout
uses: actions/checkout@v3
-
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
-
name: Login to Docker Hub
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
-
name: Build and push
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ghcr.io/kingdonb/link-checker-gpt:${{ github.ref_name }}
cache-from: type=gha
cache-to: type=gha,mode=max
21 changes: 9 additions & 12 deletions .github/workflows/pull-request.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
name: Example Link Checker Action

# This would currently only work if your org is fluxcd, and you're trying to
# check the website repo (and even then it's still kind of a longshot, because
# it is meant to run after the preview build is online, but there's no order
# between jobs AFAICT on GitHub Actions.)
#
# But as an example, this probably has all the essential elements. We could
# test it out straightforwardly in a fork, if forks did netlify builds (sadly
# there is no easy way to make that work without additional netlify accounts)
# As an example, this probably has all the essential elements. We could test it
# out straightforwardly in a fork, if forks did netlify builds (sadly there is
# no easy way to make that work without additional netlify accounts - we can
# try Fermyon Cloud tho, I hear bartholomew docs include a sitemap rhai script)

on: [pull_request]

Expand All @@ -25,8 +21,9 @@ jobs:
uses: actions/checkout@v3

- name: Check Links using Link Checker Action
uses: ./action/ # say instead "uses: kingdonb/link-checker-gpt/action@main"
uses: ./ # say instead "uses: kingdonb/link-checker-gpt@main" or @v1
with:
token: ${{ secrets.GITHUB_TOKEN }} # pass a github token so we can comment
# prNumber: ${{ github.event.pull_request.number }} # the current PR number
prNumber: 1573 # We're testing against this PR until the merge is finished
productionDomain: fluxcd.io
previewDomain: deploy-preview-1573--fluxcd.netlify.app # usually: deploy-preview-${{ github.event.pull_request.number }}--fluxcd.netlify.app
prNumber: 1573 # usually set this to: ${{ github.event.pull_request.number}}
githubToken: ${{ secrets.GITHUB_TOKEN }} # pass a github token so we can comment, check status
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
/action/cache/**
/action/links_data.json
/action/report.csv
/action/preview-report.csv
cache/**
links_data.json
report.csv
Expand Down
27 changes: 27 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM ruby:3.0

# Install the gh cli (TODO: make the action comment on the PR)
RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo 'deb [signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main' | tee /etc/apt/sources.list.d/github-cli.list > /dev/null && \
apt-get update && \
apt-get install -y gh

# Do not: Set the working directory in the container
# per https://docs.github.com/en/actions/creating-actions/dockerfile-support-for-github-actions#workdir
# WORKDIR /linkchecker

# Copy over your application
WORKDIR /opt/link-checker
COPY Gemfile Gemfile.lock /opt/link-checker

# Install Ruby dependencies
RUN gem install bundler -v 2.4.10 && bundle install

COPY . /opt/link-checker/

# Copies your code file from your action repository to the filesystem path `/` of the container
# COPY entrypoint.sh /entrypoint.sh

# Executes `entrypoint.sh` when the Docker container starts up
ENTRYPOINT ["/opt/link-checker/entrypoint.sh"]
17 changes: 11 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,21 @@ SED := $(shell command -v gsed 2> /dev/null || echo sed)
all: main clean-cache preview normalize summary

main:
ruby ./main.rb
bundle exec ruby ./main.rb

clean-cache:
@echo "Cleaning cache and progress data..."
@rm -rf cache
@rm -f links_data.json

preview:
ruby ./main.rb fluxcd.io deploy-preview-1573--fluxcd.netlify.app preview-report.csv false
bundle exec ruby ./main.rb fluxcd.io deploy-preview-1573--fluxcd.netlify.app preview-report.csv false

run_with_preview: preview-report.csv

preview-report.csv:
@echo "Running with preview URL: $(PREVIEW_URL)"
ruby ./main.rb fluxcd.io $(PREVIEW_URL) preview-report.csv false
bundle exec ruby ./main.rb $(PRODUCTION_URL) $(PREVIEW_URL) preview-report.csv false

clean: clean-cache
@rm -f report.csv preview-report.csv pr-summary.csv baseline-unresolved.csv
Expand All @@ -27,13 +27,18 @@ clean: clean-cache
normalize: report.csv preview-report.csv
@# Normalize the main report.csv
@$(SED) -i '1d' report.csv
@PREVIEW_DOMAIN=$(shell if [ -z "$(PREVIEW_URL)" ]; then echo "deploy-preview-1573--fluxcd.netlify.app"; else echo "$(PREVIEW_URL)"; fi) ;\
$(SED) -i "s/fluxcd.io/$$PREVIEW_DOMAIN/1; s/fluxcd.io/$$PREVIEW_DOMAIN/1" report.csv
@PRODUCTION_DOMAIN=$(shell if [ -z "$(PRODUCTION_URL)" ]; then echo "fluxcd.io"; else echo "$(PRODUCTION_URL)"; fi) ;\
PREVIEW_DOMAIN=$(shell if [ -z "$(PREVIEW_URL)" ]; then echo "deploy-preview-1573--fluxcd.netlify.app"; else echo "$(PREVIEW_URL)"; fi) ;\
$(SED) -i "s/https:\/\/$$PRODUCTION_DOMAIN/https:\/\/$$PREVIEW_DOMAIN/1" report.csv ;\
$(SED) -i "s/https:\/\/$$PRODUCTION_DOMAIN/https:\/\/$$PREVIEW_DOMAIN/1" report.csv
@sort -o report.csv report.csv

@# Normalize the preview-report.csv
@$(SED) -i '1d' preview-report.csv
@sort -o preview-report.csv preview-report.csv

summary:
ruby ./lib/summary.rb
bundle exec ruby ./lib/summary.rb

check-summary:
./scripts/check_summary.sh
105 changes: 87 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,100 @@
# Link-Checker GPT

This link checker is so-named because it was mostly written by ChatGPT.
Welcome to the Link-Checker GPT! Crafted with the assistance of ChatGPT, this link checker ensures the integrity of links in your website's content. Although primarily designed for the FluxCD website's preview environments, it's versatile enough to work with most platforms, including Netlify.

It is designed for use with the FluxCD website preview environments:
## Integration as a CI Check

```ruby
Link-Checker GPT is ready to be integrated as a CI check within the fluxcd/website repository. When a PR check flags an error, it's an invitation to refine your links. An associated report is available as a downloadable CSV to guide the necessary corrections. In the future, our bot might also add a comment to your PR, providing a gentle nag that aims to cajole us into eventually reduce the number of bad links in the repo all the way down to zero.

## Integration Guide for `fluxcd/website`

Integrating the Link-Checker GPT into your existing workflow is straightforward. Here's how you can integrate it into the `fluxcd/website` repository:

### Step 1: Add the Action

In your `.github/workflows/` directory (create it if it doesn't exist), add a new workflow file, for instance, `link-check.yml`. You can also add this in an existing workflow.

Within this file, add the following content:

```yaml
name: Link Checker

on:
pull_request:
branches:
- main

jobs:
check-links:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Link Checker GPT
uses: kingdonb/link-checker-gpt@v1-beta # (the v1 tag is still unreleased, we need to test)
with:
productionDomain: fluxcd.io
previewDomain: deploy-preview-${{ github.event.pull_request.number }}--fluxcd.netlify.app
prNumber: ${{ github.event.pull_request.number }}
githubToken: ${{ github.token }}
```
WIP - **TODO**: make this work for other consumers besides fluxcd.io - we have yet to test this on any other site. It should work anywhere that publishes a `sitemap.xml`, (which should be pretty much every important CMS including Jekyll, Hugo, Docsy, Bartholomew, ...)

### Step 2: Configuration

The required parameters are `productionDomain`, the target domain for production (to create a baseline report) and `previewDomain` the target domain for the PR's preview environment, by the convention this can usually be inferred from the PR number. This is the preview URL for the link checker.

Both domains must create a sitemap.xml and populate it.

### Step 3: Commit and Test

Commit the new workflow file and create a new pull request. The Link Checker GPT action should automatically run and validate the links within the website content associated with the PR.

If there are any bad links in the production site, they will be captured in a baseline report for follow-up later. Those links are not counted against a PR. If there are any new bad links in the PR then the check will fail.

(Create a link to an invalid anchor in your PR to test this works, then revert the change before merging it!)

## How it Works

Familiarize yourself with the moving parts in a local clone. This action is Dockerized, but it was not designed to run in Docker, it is a Ruby program and can run on your local workstation. Just run `bundle install` first, then type `make`!

(You will run against PR#1573 but in case you want to use a different PR to check for problems, you can just edit the Makefile, or keep reading to learn how to use this as a GitHub Action.)

To check the links of a preview environment on Netlify, simply run:

```bash
ruby main.rb deploy-preview-1573--fluxcd.netlify.app
```

It may behave differently when run against `fluxcd.io` and the preview site,
but any differences are bugs. We either fix it here, or we fix the reason in
the website itself (probably by replacing an absolute link with a hard domain
reference to fluxcd.io in it.)
This checks for bad links in your PR. But this is only half a check. We don't want you to get blamed for bad links that already were on the site, just because you opened a PR.

So the tool needs to check `fluxcd.io` first, count up those bad links, then discount them from the PR so we can get a valid check output. This way we should guarantee that no new PR ever adds bad links to the FluxCD.io website. Any discrepancies between the reports are considered bugs—either they represent an error in this tool or they can be addressed directly in the website by modifying the links.

There is a baseline report as well as a pr review report that tell what bad links are found, whether they are pre-existing on the site or created by your PR. Those pre-existing ones should be fixed eventually, as well, but they will not count against your PR.

Upon successful execution one single time, a report detailing the link statuses is generated in `report.csv`. You can import this CSV into tools like Google Drive for further analysis and action. The `make summary` process takes the normalized output of the above described two checks, and it returns an error from the `check_summary.sh` script if the build should pass or fail.

## Note on UX: Report Download

In the event of a PR check failure, you can read the report in the failed job output. Initially this workflow was designed to enable the user to access a detailed report in the form of a zipped CSV. This was originally built as a composite workflow, you can still find remnants of this in the commented section of `action.yml`.

Instead, the report now goes out to the workflow/action job log. You can read all the bad links created by your PR there. Any links from the baseline site will not be included in the report unless your PR is spotless. A later version might emit the baseline report when there is no issue created by the PR, to encourage tidying. Then the report will show the baseline issues, but since it was not caused by your PR they will not fail the report.

The primary goal is to maximize the signal to noise ratio and prevent the users from desiring to uninstall this workflow. It should be easy to adopt, and it should never fail the workflow to nag the contributor about issues that their PR didn't create.

**TODO**: We will still figure out a way to expose those baseline errors yet.

Assuming it runs to completion, it will produce a report in report.csv
## Cache Management

I can import this report into Google Drive and mark it up as I fix the links.
The tool incorporates caching initially intended to expedite repeated runs. This could be particularly useful for iterative development. Most runtime errors, especially those from the validate method and anchor checkers, can be debugged efficiently using cached data without re-fetching anything.

This nearly works as a CI check, but we will need to fix many of the links
first, and find a way to make exceptions for any more that cannot be fixed.
However, there's a known issue: the cache isn't always reliable. To ensure accuracy, always run `make clean-cache` between separate executions. The cache is still used to prevent repeated calls out and to avoid the repeated loading of HTML files into memory. As a result, a lot of memory can be used.

### Broken feature: Sitemap Caching
**TODO**: We're considering refining the cache management system. The cache should always be invalidated unless its validity is assured. This feature's primary purpose is for one-time use and might be phased out or redesigned in future versions.

There is a cache, so if you have run the script before the "Visiting links"
step will not be repeated unless you run `make clean` first. This is to help
with iterative development, since most of the runtime errors come from the
validate method and anchor checker, they can be debugged easily from a cache.
The primary issue to grapple now is that we can wait for the preview environment's deploy to become ready once, but cannot guarantee that subsequent runs of the checker are always looking at the latest version. There is no synchronization or coordination between independent jobs, and there is no job configuration for the Netlify preview build (not even sure how this works - it is an externally provided action.)

However, it doesn't work. So make sure if you are running this more than one
time, you always run at least `make clean-cache` between separate executions.
Perhaps we can read the check statuses and wait to proceed with the scan of the preview domain until the Netlify deploy check shows itself as ready.
28 changes: 28 additions & 0 deletions action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: 'Link Checker Action'
description: 'Checks the integrity of links in the PR'
inputs:
# TODO: make this action comment on the PR
# token:
# description: 'GitHub Token'
# required: true
# TODO: take a preview URL as input instead
prNumber:
description: 'Pull Request Number'
required: true
productionDomain:
description: 'Live production site hostname'
required: true
previewDomain:
description: 'Preview site deployment hostname'
required: true
githubToken:
description: 'The gh cli checks preview build deploy status'
required: true
outputs:
pr-summary:
description: 'Summary CSV for problematic links'
baseline-unresolved:
description: 'Baseline unresolved links CSV'
runs:
using: 'docker'
image: 'docker://ghcr.io/kingdonb/link-checker-gpt:v1-beta'
Loading

0 comments on commit 9212eb0

Please sign in to comment.