-
Notifications
You must be signed in to change notification settings - Fork 240
Gitlab CI Autoscaling Setup
Since we got rid of our Kubernetes cluster (which was also on AWS and so still expensive), we now use a fixed-size collection of Gitlab runners living on our Openstack cluster, which we bought already.
These are set up using methods developed for the vg project.
We have shared Gitlab runners that run one task at a time on 8 cores, set up like this:
SSH_KEY_NAME=anovak-swords
SERVER_NAME=anovak-gitlab-runner-shared-6
FLAVOR=m1.medium
openstack --os-cloud openstack server create --image ubuntu-22.04-LTS-x86_64 --flavor ${FLAVOR} --key-name ${SSH_KEY_NAME} --wait ${SERVER_NAME}
# There is no way to find a free floating IP that already exists without fighting over it.
# Assignment steals the IP if it was already assigned elsewhere.
# See <https://stackoverflow.com/q/36497218>
IP_ID=$(openstack --os-cloud openstack floating ip create ext-net --format value --column id | head -n1)
openstack --os-cloud openstack server add floating ip ${SERVER_NAME} ${IP_ID}
sleep 60
INSTANCE_IP="$(openstack --os-cloud openstack floating ip show ${IP_ID} --column floating_ip_address --format value)"
ssh-keygen -R ${INSTANCE_IP}
ssh ubuntu@${INSTANCE_IP}
sudo su -
RUNNER_TOKEN=
systemctl stop docker.socket || true
systemctl stop docker.service || true
systemctl stop ephemeral-setup.service || true
rm -Rf /var/lib/docker
cat >/etc/systemd/system/ephemeral-setup.service <<'EOF'
[Unit]
Description=bind mounts ephemeral directories
Before=docker.service
Requires=mnt.mount
After=mnt.mount
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=mkdir -p /mnt/ephemeral/var/lib/docker
ExecStart=mkdir -p /var/lib/docker
ExecStart=mount --bind /mnt/ephemeral/var/lib/docker /var/lib/docker
ExecStop=umount /var/lib/docker
[Install]
RequiredBy=docker.service
EOF
systemctl daemon-reload
systemctl enable ephemeral-setup.service
systemctl start docker.socket || true
systemctl start docker.service || true
TASK_MEMORY=25G
TASKS_PER_NODE=1
CPUS_PER_TASK=8
bash -c "export DEBIAN_FRONTEND=noninteractive; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash"
DEBIAN_FRONTEND=noninteractive apt update && sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
DEBIAN_FRONTEND=noninteractive apt install -y docker.io gitlab-runner
gitlab-runner register --non-interactive --url https://ucsc-ci.com --token "${RUNNER_TOKEN}" --limit "${TASKS_PER_NODE}" --executor docker --docker-privileged --docker-memory "${TASK_MEMORY}" --docker-cpus "${CPUS_PER_TASK}" --docker-image docker:dind
sed -i "s/concurrent = 1/concurrent = ${TASKS_PER_NODE}/g" /etc/gitlab-runner/config.toml
echo " output_limit = 40960" >>/etc/gitlab-runner/config.toml
gitlab-runner restart
New runners should probably be set up as shared runners to prevent resources from being reserved but idle.
These have been destroyed and no longer are used.
Server creation at one point used:
SSH_KEY_NAME=anovak-swords
SERVER_NAME=anovak-gitlab-runner-toil-2
openstack --os-cloud openstack server create --image ubuntu-22.04-LTS-x86_64 --flavor m1.huge --key-name ${SSH_KEY_NAME} --wait ${SERVER_NAME}
while true ; do
IP_ID=$(openstack --os-cloud openstack floating ip list --long --status DOWN --network ext-net --format value --column ID | head -n1)
while [[ "${IP_ID}" == "" ]] ; do
openstack --os-cloud openstack floating ip create ext-net
IP_ID=$(openstack --os-cloud openstack floating ip list --long --status DOWN --network ext-net --format value --column ID | head -n1)
done
openstack --os-cloud openstack server add floating ip ${SERVER_NAME} ${IP_ID} || continue
break
done
INSTANCE_IP="$(openstack --os-cloud openstack floating ip show ${IP_ID} --column floating_ip_address --format value)"
sleep 60
ssh ubuntu@${INSTANCE_IP}
TASK_MEMORY=15G
TASKS_PER_NODE=7
bash -c "export DEBIAN_FRONTEND=noninteractive; curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | bash"
DEBIAN_FRONTEND=noninteractive apt update && sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
DEBIAN_FRONTEND=noninteractive apt install -y docker.io gitlab-runner
gitlab-runner register --non-interactive --url https://ucsc-ci.com --token "${RUNNER_TOKEN}" --limit "${TASKS_PER_NODE}" --executor docker --docker-privileged --docker-memory "${TASK_MEMORY}" --docker-image docker:dind
sed -i "s/concurrent = 1/concurrent = ${TASKS_PER_NODE}/g" /etc/gitlab-runner/config.toml
gitlab-runner restart
This is not recommended anymore (use the simpler code on the vg wiki that can't accidentally steal IP addresses due to a race condition). But note the flavor and the lack of a CPU limit for each task.
Since we were spending too much money with our AWS-based CI setup, I deployed a new set of autoscaling Gitlab runners on our Kubernetes cluster, the compute for which we buy in bulk at a lower cost.
To do this, I followed this tutorial on how to do it using a Helm chart. This is pretty easy with Helm 3, since Helm doesn't actually need to be pre-installed on the cluster. The main problems come from the lack of ability to make arbitrary configuration changes; you can only do what the chart supports.
The basic structure is the same as in the AWS deployment: a persistent runner runs (this time as a Kubernetes pod), signs up to do jobs, and then runs the jobs in their own containers (this time as Kubernetes pods).
With this setup, the ENTRYPOINT
of the Docker container that the .gitlab-ci.yml
file asks to run in never runs, so I made some changes to quay.io/vgteam/dind
and quay.io/vgteam/vg_ci_prebake
to provide startdocker
and stopdocker
commands to start/stop the daemon in the container. I added these to the .gitlab-ci.yml
of Toil:
image: quay.io/vgteam/vg_ci_prebake:latest
before_script:
- startdocker || true
...
after_script:
- stopdocker || true
Unfortunately, for reasons I have not yet been able to work out, starting Docker this way in a Kubernetes container requires the container to be privileged. When starting via the ENTRYPOINT
on ordinary non-Kubernetes Docker this isn't the case, so in theory we should be able to overcome it, but I just gave up and let the containers run as privileged, which we can do on our own cluster.
I also had to adjust the Toil tests to not rely on having AWS access via an IAM role assigned to the hosting instances. The Kubernetes pods don't run on machines with IAM roles that they can use, but there's also no way to tell Gitlab to mount the Kubernetes secret we usually use for AWS access inside the CI jobs' containers. I switched everything to use Gitlab-managed secret credential files instead, both for AWS access to actually test AWS, and for the credentials we formerly kept in the AWS secrets manager.
To actually set up the Gitlab runner, I grabbed the runner registration token for the Gitlab entity I wanted to own the runner (in this case, the DataBiosphere organization) and made a values.yml
file with it to configure the Helm chart:
checkInterval: 30
concurrent: 20
imagePullPolicy: Always
rbac:
create: false
serviceAccountName: toil-svc
gitlabUrl: https://ucsc-ci.com/
runnerRegistrationToken: "!!!PASTE TOKEN HERE!!!"
runners:
config: |
[[runners]]
name = "Kubernetes Runner"
output_limit = 40960
[runners.kubernetes]
namespace = "toil"
image = "quay.io/vgteam/vg_ci_prebake"
poll_timeout = 86400
privileged = true
service_account = "toil-svc"
cpu_limit = "4000m"
cpu_request = "4000m"
memory_limit = "15Gi"
memory_request = "15Gi"
ephemeral_storage_limit = "20Gi"
ephemeral_storage_request = "20Gi"
service_cpu_limit = "4000m"
service_cpu_request = "4000m"
service_memory_limit = "8Gi"
service_memory_request = "8Gi"
service_ephemeral_storage_limit = "20Gi"
service_ephemeral_storage_request = "20Gi"
helper_cpu_limit = "500m"
helper_cpu_request = "500m"
helper_memory_limit = "256M"
helper_memory_request = "256M"
helper_ephemeral_storage_limit = "20Gi"
helper_ephemeral_storage_request = "20Gi"
Or in the old format:
cat >values.yml <<EOF
imagePullPolicy: Always
gitlabUrl: "https://ucsc-ci.com/"
runnerRegistrationToken: "!!!PASTE_TOKEN_HERE!!!"
concurrent: 10
checkInterval: 30
rbac:
create: false
serviceAccountName: toil-svc
runners:
image: "quay.io/vgteam/vg_ci_prebake"
privileged: true
pollTimeout: 86400
outputLimit: 40960
namespace: toil
serviceAccountName: toil-svc
builds:
cpuLimit: 4000m
memoryLimit: 15Gi
cpuRequests: 4000m
memoryRequests: 15Gi
services:
cpuLimit: 4000m
memoryLimit: 15Gi
cpuRequests: 4000m
memoryRequests: 15Gi
EOF
I upped the pollTimeout
substantially from what was given in the tutorial, because the runner isn't clever enough to work out when the Kubernetes cluster is busy. It will happily sign up for 10 jobs at once from Gitlab, and then not be able to get any of its pods to start because there's no room. Upping the timeout lets it wait for a long time with the pods in Kubernetes' queue.
Note that for Helm chart 0.23+ you should, and for Helm chart version 1.0+ you MUST, use a new syntax for the configuration, like this example for the VG runner:
checkInterval: 30
concurrent: 10
imagePullPolicy: Always
rbac:
create: false
serviceAccountName: vg-svc
gitlabUrl: https://ucsc-ci.com/
runnerRegistrationToken: "!!!PASTE_TOKEN_HERE!!!"
runners:
cache:
secretName: shared-s3-credentials-literal
config: |
[[runners]]
name = "Kubernetes Runner"
output_limit = 40960
[runners.kubernetes]
namespace = "vg"
image = "quay.io/vgteam/vg_ci_prebake"
poll_timeout = 86400
privileged = true
service_account = "vg-svc"
cpu_limit = "8000m"
cpu_request = "8000m"
memory_limit = "25Gi"
memory_request = "25Gi"
ephemeral_storage_limit = "35Gi"
ephemeral_storage_request = "10Gi"
service_cpu_limit = "4000m"
service_cpu_request = "4000m"
service_memory_limit = "2Gi"
service_memory_request = "2Gi"
service_ephemeral_storage_limit = "35Gi"
service_ephemeral_storage_request = "10Gi"
helper_cpu_limit = "500m"
helper_cpu_request = "500m"
helper_memory_limit = "256M"
helper_memory_request = "256M"
helper_ephemeral_storage_limit = "35Gi"
helper_ephemeral_storage_request = "10Gi"
[runners.cache]
Type = "s3"
Path = "vg_ci/cache"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "vg-data"
BucketLocation = "us-west-2"
Insecure = false
After making the values.yml
file, I used Helm 3 to deploy:
helm install --namespace toil gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner
For this to work, I had to have access to create configmaps
resources on the cluster, which Erich hadn't granted yet.
An Error: Kubernetes cluster unreachable
message can be solved by adding --kubeconfig ~/.kube/path.to.your.config
to the helm
command.
I had to tweak the values.yml
a few times to get it working. To apply the changes, I ran:
helm upgrade --recreate-pods --namespace toil gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner
Note that every time you do this (or the pod restarts) it registers a new runner with Gitlab and gets rid of the old one. So if you had it paused before, it will become unpaused and start running jobs.
Also, to make this work, I had to get Erich to set a sensible default disk request and limit of 10 GB on the namespace. The Helm chart only allows you to set the CPU and memory requests and limits, and was using a disk limit of 500 Mi
, which was much too small. Unfortunately, this setting has to be configured at the namespace level, so it affects anything else in the namespace that doesn't specify its own disk request/limit. This setting is now configured on the toil
and vg
namespaces.
If you need to update the runner (for example, to change a registration token, or to upgrade to a new Gitlab), you can:
- Do
helm -n toil list
to find the release to update. We will usegitlab-toil-kubernetes-runner
in our example. - Get the existing configuration with
helm -n toil get values -o yaml gitlab-toil-kubernetes-runner >values.yml
. - Do
helm repo update
to make sure yourgitlab
repo is up to date. If you are on a different machine than the one you originally deployed from, you might need to add thegitlab
repo as explained in the Gitlab documentation. - Upgrade to the newest version of the chart:
helm -n toil upgrade --recreate-pods gitlab-toil-kubernetes-runner -f values.yml gitlab/gitlab-runner
I have set up an autoscaling Gitlab runner on AWS to run multiple tests in parallel.
I am basically following the tutorial at https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/
The tutorial has you create a "bastion" instance, on which you install the Gitlab Runner, using the "docker+machine" runner type. Then the bastion instance uses Docker Machine to create and destroy other instances to do the actual testing, as needed, but from the Gitlab side it looks like a single "runner" executing multiple tests.
I created a t2.micro
instance named gitlab-ci-bastion
, in the gitlab-ci-runner
security group, with the gitlab-ci-runner
IAM role, using the Ubuntu 18.04 image. I gave it a 20 GB root volume. I protected it from termination. It got IP address 54.218.250.217.
ssh ubuntu@54.218.250.217
I made sure to authorize the "ci" SSH key to access it, in ~/.ssh/authorized_keys.
Then I installed Gitlab Runner and Docker. I had to run each command separately; copy-pasting the whole block did not work.
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash
sudo apt-get -y -q install gitlab-runner
sudo apt-get -y -q install docker.io
sudo usermod -a -G docker gitlab-runner
sudo usermod -a -G docker ubuntu
Then I installed Docker Machine. Version 0.16.1 was current:
curl -L https://github.com/docker/machine/releases/download/v0.16.1/docker-machine-`uname -s`-`uname -m` >/tmp/docker-machine &&
chmod +x /tmp/docker-machine &&
sudo mv /tmp/docker-machine /usr/local/bin/docker-machine
Then I disconnected and ssh-d back in. At that point I could successfully run docker ps
.
Then I went and got the Gitlab registration token from the Gitlab web UI. I decided to register the runner to the DataBiosphere
group, instead of just the Toil project.
Then I registered the Gitlab Runner with the main Gitlab server, using the token instead of ##CENSORED##
.
sudo gitlab-ci-multi-runner register -n \
--url https://ucsc-ci.com/ \
--registration-token ##CENSORED## \
--executor docker+machine \
--description "docker-machine-runner" \
--docker-image "quay.io/vgteam/dind" \
--docker-privileged
As soon as the runner registered with the Gitlab server, I found it in the web UI and paused it, so it wouldn't start trying to run jobs until I had it configured properly.
I also at some point updated the packages on the bastion machine:
sudo apt update && sudo apt upgrade -y
I edited the /etc/gitlab-runner/config.toml
file to actually configure the runner. After a bit of debugging, I got it looking like this.
# Let the runner run 10 jobs in parallel
concurrent = 10
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "docker-machine-runner"
url = "https://ucsc-ci.com/"
# Leave the pre-filled value here from your config.toml, or replace
# with the registration token you are using if copy-pasting this one.
token = "##CENSORED##"
executor = "docker+machine"
# Run no more than 10 machines at a time.
limit = 10
[runners.docker]
tls_verify = false
# We reuse this image because it is Ubuntu with Docker
# available and virtualenv installed.
image = "quay.io/vgteam/vg_ci_prebake"
# t2.xlarge has 16 GB
memory = "15g"
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.machine]
IdleCount = 0
IdleTime = 60
# Max builds per machine before recreating
MaxBuilds = 10
MachineDriver = "amazonec2"
MachineName = "gitlab-ci-machine-%s"
MachineOptions = [
"amazonec2-iam-instance-profile=gitlab-ci-runner",
"amazonec2-region=us-west-2",
"amazonec2-zone=a",
"amazonec2-use-private-address=true",
# Make sure to fill in your own owner details here!
"amazonec2-tags=Owner,anovak@soe.ucsc.edu,Name,gitlab-ci-runner-machine",
"amazonec2-security-group=gitlab-ci-runner",
"amazonec2-instance-type=t2.xlarge",
"amazonec2-root-size=80"
]
To enable this to work, I had to add some IAM policies to the gitlab-ci-runner
role. It already had the AWS built-in AmazonS3ReadOnlyAccess
, to let the tests read test data from S3. I gave it the AWS built-in AmazonEC2FullAccess
to allow the bastion to create the machines. I also gave it gitlab-ci-runner-passrole
, which I had to talk cluster-admin
into creating for me, which allows the bastion to pass on the gitlab-ci-runner
role to the machines it creates. That policy had the following contents:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::719818754276:role/gitlab-ci-runner"
}
]
}
After getting all the policies attached to the role, I rebooted the bastion machine to get it to actually start up the Gitlab Runner daemon:
sudo shutdown -r now
Then when it came back up I unpaused it in the Gitlab web interface, and it started running jobs. A few jobs failed, and to debug them I set the docker image to the vg_ci_prebake
that vg uses (to provide packages like python-virtualenv
) and added python3-dev
to the packages that that image carries.
To make more changes to the image, commit to https://github.com/vgteam/vg_ci_prebake and Quay will automatically rebuild it. If you don't have rights to do that and don't want to wait around for a PR, clone the repo, edit it, and make a new Quay project to build your own version.
One change I have not yet made might be to set a high output_limit
as described in https://stackoverflow.com/a/53541010 in case the CI logs get too long.
I also have not yet destroyed the old shell runner. I want to leave it in place until we are confident in the new system.
It's also useful to connect your Github repo to your Gitlab repo with a web hook on the Github side, to speed up pull mirroring. THe Gitlab docs for how to do this are here, starting from step 4, where you make a token on Gitlab and configure a hook to be called by Github. We've been using a Gitlab user named "vgbot" with sufficient access to each project to refresh the mirroring, and getting access tokens for it using Gitlab administrator powers.