Skip to content

Commit

Permalink
add documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
CCOLLOT committed Dec 21, 2023
1 parent c74dbd1 commit fe321da
Show file tree
Hide file tree
Showing 22 changed files with 1,882 additions and 6 deletions.
202 changes: 198 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,201 @@
# Description

An autodiscovery tool to help you know what and when to update.
Features:
+ automatically discover current software version
+ automatically discover newer versions for the softwares
+ calculate the obsolescence score SLI

![boat](doc/image/grafana-dashboard.png "dashboard")
*Grafana Dashboard displaying upgrade-manager's metrics, pointing out which apps need to be upgrading and their associated versions*

Key Features:
+ **Service discovery**: finds softwares deployed (see software sources supported)
+ **Obsolescence Score SLI (Service Level Indicator) Calculation**: compute each software’s obsolescence score and expose it as a prometheus metric
+ **New eligible release detection**: automatically find the version corresponding to a tailor-made selection logic



### Managing IT system obsolescence is like being in a boat full of holes
![boat](doc/image/boat.png "Boat")

Let's pretend your IT systems are a boat:
+ Each software is a new hole in the boat.
+ Each hole gets wider and wider as new software versions are released.
+ In such a situation, you need to decide:
+ When to patch a hole (meaning updating softwares) in order for the boat not to sink.
+ It means you need to define the minimum size (diameter) of a hole in order for the crew to consider patching it
+ When to row (do projects delivering business value) in order for the boat to make progress.
+ Which hole to patch first, and it probably makes sense to patch the widest.
+ It means you need a simple way to quickly get the list of holes that are the widest.

### How upgrade-manager responds to this issue
In real life, a software's obsolescence (how big the hole is) is hard to measure, since we don't have a clear metric to quantify it.

This is why upgrade-manager computes an **obsolescence score**, exposing it as a Prometheus metric:
+ It represents **how obsolete a system** is, based on different calculation strategies (see more about calculators). It is essentially an indicator to measure the obsolescence at an software-level
+ It can be summed / averaged across all apps to have a company-wide freshness/obsolescence metric.

## Installation
You can use the default Helm chart to deploy upgrade-manager as follows:

```bash
helm repo add qonto oci://public.ecr.aws/qonto
helm install upgrade-manager qonto/upgrade-manager-chart \
-n upgrade-manager --create-namespace
```

## Configuration
upgrade-manager uses yaml config file (the default location is `/app/config/config.yaml`).

For more information about specific sources' configuration block, see [sources](./doc/sources/README.md)

```yaml
global:
interval: 10m # How often upgrade-manager should run the main loop (discovering all softwares, their new versions and compute scores)
aws:
region: us-east-1 # AWS region to look for resources in
sources:
deployments:
- <deploymentConfig>
argocdHelm:
- <argoCDHelmConfig>
filesystemHelm:
- <filesystemHelmConfig>
aws:
eks: <eksConfig>
rds: <rdsConfig>
msk: <mskConfig>
elasticache: <elasticacheConfig>
lambda: <lambdaConfig>
http:
host: 0.0.0.0 # local server address
port: 10000 # local port to listen on
write-timeout: 10
read-timeout: 10
read-header-timeout: 10
```
## Required AWS Privileges when running AWS sources
To automatically discover AWS resources and their newer versions, upgrade-manager needs the following AWS privileges (represented in Terraform HCL):
```json
data "aws_iam_policy_document" "upgrade-manager" {
statement {
sid = "elasticache"

actions = [
"elasticache:DescribeCacheClusters",
"elasticache:DescribeCacheEngineVersions",
]

resources = [
"*",
]
}
statement {
sid = "ecr"

actions = [
"ecr:ListImages",
"ecr:GetAuthorizationToken",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
]

resources = [
"*",
]
}
statement {
sid = "eks"

actions = [
"eks:DescribeAddon",
"eks:DescribeAddonVersions",
"eks:ListClusters",
"eks:DescribeCluster",
"eks:ListAddons",
]

resources = [
"*",
]
}
statement {
sid = "rds"

actions = [
"rds:DescribeDBInstances",
"rds:DescribeDBEngineVersions",
]

resources = [
"*",
]
}
statement {
sid = "lambda"

actions = [
"lambda:ListFunctions",
]

resources = [
"*",
]
}
statement {
sid = "kafka"

actions = [
"kafka:ListClustersV2",
"kafka:GetCompatibleKafkaVersions",
]

resources = [
"*",
]
}
}

```


## Problem Statement: why use upgrade-manager?

As we scale, we deploy more and more softwares. We end up with hundreds of softwares to maintain across many different platforms.

Each of this softwares need to be updated regularly to benefit from security fixes, new features etc.


## Dashboard
To visualize metrics, you can use the dashboard available in the `dashboard/` directory

## Alerting Patterns: deciding when to update softwares
As SREs, we like to define SLOs/SLAs based on SLIs. Among other things, it helps us take actions based on specific thresholds.
Using our **obsolescence score** metric, we can now define thresholds to help us decide when we need to act and update softwares.

Each company has its own way to define alerting rules, but a sane default could be:

1. The obsolescence score for each individual software should be < 100
```yaml
- alert: ObsolecenceScoreTooHigh
expr: upgrade_manager_software_obsolescence_score{isparent="1"} > 99
for: 5m
labels:
team: foo
annotations:
summary: "The software {{ $labels.app }} is obsolete (>99) and needs to be updated"
runbook_url: https://letmegooglethat.com/?q=how+to+update+softwares
```
2. The average obsolescence score for all apps should be < 80
```yaml
- alert: AverageObsolecenceScoreTooHigh
expr: avg(upgrade_manager_software_obsolescence_score{isparent="1"}) < 80
for: 5m
labels:
team: foo
annotations:
summary: "The average obsolescence score across all softwares is too high, softwares need to be updated"
runbook_url: https://letmegooglethat.com/?q=how+to+update+softwares
```
![boat](doc/image/score-based-alerting.png "score-based alerting")
When an alert is triggered, an engineer should acknowledge it and upgrade the applications with the highest obsolescence scores.
Loading

0 comments on commit fe321da

Please sign in to comment.