Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Latest commit

 

History

History
125 lines (91 loc) · 8.84 KB

README.md

File metadata and controls

125 lines (91 loc) · 8.84 KB

Fetch Apache GitHub Actions Statistics

Test the build Fetch GitHub Action queue Fetch Apache Repositories with GA Code style: black

Table of Contents

Context and motivation

For The Apache Software Foundation [ASF] the limit for concurrent jobs in GitHub Actions [GA] equals 180 (usage limits). The GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.

Statistics

Statistics data is fetched in the scheduled action Fetch GitHub Action queue. This action makes series of "snapshots" of GA workflow runs for every ASF repository which uses GA (list of them is stored in matrix.json, described here).

Statistics consists of:

  • json files - workflow runs for every repo in seperate files (described here)
  • csv file - simple statistics in single file (described here)

These files are uploaded as workflow artifact.

Json files

The json files contain list of repository workflow runs in queued and in_progress state. File titles contain timestamp when fetching this list started. The json schema is described in GitHub API documentation here.

CSV file

Single bq.csv file is created and contains simple statistics for all fetched repositories. This file is used in the Fetch GitHub Action queue to efficiently upload data to the BigQuery table.

CSV file headers: repository_owner, repository_name, queued, in_progress, in_progress.

Example content:

repository_owner,repository_name,queued,in_progress,timestamp
apache,airflow,1,3,2020-11-19 17:53:24.139806+00:00
apache,beam,0,1,2020-11-19 17:53:39.171882+00:00

Processing existing json files to csv and pushing it to BigQuery

Helper script scripts/parse_existing_json_files.py can be used to process existing json files into a single csv.

Example use:

gsutil -m cp -r gs://example-bucket-name/apache gcs

python parse_existing_json_files.py \
    --input-dir gcs \
    --output bq_csv.csv

bq load --autodetect \
    --source_format=CSV \
    dataset.table bq_csv.csv

Determining ASF repositories which uses GitHub Actions (matrix.json)

There is no single endpoint to obtain a list of ASF repositories which uses GA and since ASF consists of 2000+ repositories it is not a trivial task to obtain it.

This list of repositories which uses GitHub Actions is stored in matrix.json and can be updated in three ways:

Running python script and action causes many requests on behalf of used GitHub Access tokens which may cause exceeding quota limits.

GitHub Actions Secrets:

Secret Required Description
PERSONAL_ACCESS_TOKEN True Personal GitHub access token(no need for additional permissions, don't have to select any checkboxes) used to authorize requests. It has bigger quota than GITHUB_TOKEN secret.
GCP_PROJECT_ID - Google Cloud Project ID.
BQ_TABLE - BigQuery table reference to which simple statistics will be pushed (e.g. dataset.table).
GCP_SA_KEY - Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery).
GCP_SA_EMAIL - Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery).

Google Cloud Platform infrastructure

All infrastructure components necessary to store statistics in BigQuery were wrapped in ./terraform folder.