Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.28] Backports for 2024-04 release cycle #9911

Merged
merged 16 commits into from
Apr 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cmd/cert/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ func main() {
app := cmds.NewApp()
app.Commands = []cli.Command{
cmds.NewCertCommands(
cert.Check,
cert.Rotate,
cert.RotateCA,
),
Expand Down
1 change: 1 addition & 0 deletions cmd/k3s/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ func main() {
cmds.NewCertCommands(
certCommand,
certCommand,
certCommand,
),
cmds.NewCompletionCommand(internalCLIAction(version.Program+"-completion", dataDir, os.Args)),
}
Expand Down
1 change: 1 addition & 0 deletions cmd/server/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ func main() {
secretsencrypt.RotateKeys,
),
cmds.NewCertCommands(
cert.Check,
cert.Rotate,
cert.RotateCA,
),
Expand Down
30 changes: 30 additions & 0 deletions docs/adrs/cert-expiry-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Add Support for Checking and Alerting on Certificate Expiry

Date: 2024-03-26

## Status

Accepted

## Context

The certificates generated by K3s have two lifecycles:
* Certificate authority certificates expire 3650 days (roughly 10 years) from their moment of issuance.
The CA certificates are not automatically renewed, and require manual intervention to extend their validity.
* Leaf certificates (client and server certs) expire 365 days (roughly 1 year) from their moment of issuance.
The certificates are automatically renewed if they are within 90 days of expiring at the time K3s starts.

K3s does not currently expose any information about certificate validity.
There are no metrics, CLI tools, or events that an administrator can use to track when certificates must be renewed or rotated to avoid outages when certificates expire.
The best we can do at the moment is recommend that administrators either restart their nodes regularly to ensure that certificates are renewed within the 90 day window, or manually rotate their certs yearly.

We do not have any guidance around renewing the CA certs, which will be a major undertaking for users as their clusters approach the 10-year mark. We currently have a bit of runway on this issue, as K3s has not been around for 10 years.

## Decision

* K3s will add a CLI command to print certificate validity. It will be grouped alongside the command used to rotate the leaf certificates (`k3s certificate rotate`).
* K3s will add an internal controller that maintains metrics for certificate expiration, and creates Events when certificates are about to or have expired.

## Consequences

This will require additional documentation, CLI subcommands, and QA work to validate the process steps.
220 changes: 111 additions & 109 deletions go.mod

Large diffs are not rendered by default.

91 changes: 50 additions & 41 deletions go.sum

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion manifests/coredns.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ spec:
k8s-app: kube-dns
containers:
- name: coredns
image: %{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-coredns-coredns:1.10.1
image: "%{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-coredns-coredns:1.10.1"
imagePullPolicy: IfNotPresent
resources:
limits:
Expand Down
5 changes: 3 additions & 2 deletions manifests/local-storage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ spec:
effect: "NoSchedule"
containers:
- name: local-path-provisioner
image: %{SYSTEM_DEFAULT_REGISTRY}%rancher/local-path-provisioner:v0.0.26
image: "%{SYSTEM_DEFAULT_REGISTRY}%rancher/local-path-provisioner:v0.0.26"
imagePullPolicy: IfNotPresent
command:
- local-path-provisioner
Expand All @@ -92,6 +92,7 @@ kind: StorageClass
metadata:
name: local-path
annotations:
defaultVolumeType: "local"
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rancher.io/local-path
volumeBindingMode: WaitForFirstConsumer
Expand Down Expand Up @@ -155,5 +156,5 @@ data:
spec:
containers:
- name: helper-pod
image: %{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-library-busybox:1.36.1
image: "%{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-library-busybox:1.36.1"
imagePullPolicy: IfNotPresent
2 changes: 1 addition & 1 deletion manifests/metrics-server/metrics-server-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ spec:
emptyDir: {}
containers:
- name: metrics-server
image: %{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-metrics-server:v0.7.0
image: "%{SYSTEM_DEFAULT_REGISTRY}%rancher/mirrored-metrics-server:v0.7.0"
args:
- --cert-dir=/tmp
- --secure-port=10250
Expand Down
13 changes: 7 additions & 6 deletions manifests/traefik.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,30 @@ metadata:
name: traefik-crd
namespace: kube-system
spec:
chart: https://%{KUBERNETES_API}%/static/charts/traefik-crd-25.0.2+up25.0.0.tgz
chart: https://%{KUBERNETES_API}%/static/charts/traefik-crd-25.0.3+up25.0.0.tgz
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: traefik
namespace: kube-system
spec:
chart: https://%{KUBERNETES_API}%/static/charts/traefik-25.0.2+up25.0.0.tgz
chart: https://%{KUBERNETES_API}%/static/charts/traefik-25.0.3+up25.0.0.tgz
set:
global.systemDefaultRegistry: "%{SYSTEM_DEFAULT_REGISTRY_RAW}%"
valuesContent: |-
podAnnotations:
prometheus.io/port: "8082"
prometheus.io/scrape: "true"
deployment:
podAnnotations:
prometheus.io/port: "8082"
prometheus.io/scrape: "true"
providers:
kubernetesIngress:
publishedService:
enabled: true
priorityClassName: "system-cluster-critical"
image:
repository: "rancher/mirrored-library-traefik"
tag: "2.10.5"
tag: "2.10.7"
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
Expand Down
2 changes: 2 additions & 0 deletions package/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ RUN apk add -U ca-certificates tar zstd tzdata
COPY build/out/data.tar.zst /
RUN mkdir -p /image/etc/ssl/certs /image/run /image/var/run /image/tmp /image/lib/modules /image/lib/firmware && \
tar -xa -C /image -f /data.tar.zst && \
echo "root:x:0:0:root:/:/bin/sh" > /image/etc/passwd && \
echo "root:x:0:" > /image/etc/group && \
cp /etc/ssl/certs/ca-certificates.crt /image/etc/ssl/certs/ca-certificates.crt

FROM scratch as collect
Expand Down
2 changes: 1 addition & 1 deletion pkg/agent/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ func get(ctx context.Context, envInfo *cmds.Agent, proxy proxy.Proxy) (*config.N
// If the supervisor and externally-facing apiserver are not on the same port, tell the proxy where to find the apiserver.
if controlConfig.SupervisorPort != controlConfig.HTTPSPort {
isIPv6 := utilsnet.IsIPv6(net.ParseIP([]string{envInfo.NodeIP.String()}[0]))
if err := proxy.SetAPIServerPort(ctx, controlConfig.HTTPSPort, isIPv6); err != nil {
if err := proxy.SetAPIServerPort(controlConfig.HTTPSPort, isIPv6); err != nil {
return nil, errors.Wrapf(err, "failed to setup access to API Server port %d on at %s", controlConfig.HTTPSPort, proxy.SupervisorURL())
}
}
Expand Down
16 changes: 10 additions & 6 deletions pkg/agent/containerd/containerd.go
Original file line number Diff line number Diff line change
Expand Up @@ -353,19 +353,23 @@ func prePullImages(ctx context.Context, client *containerd.Client, imageClient r
scanner := bufio.NewScanner(imageList)
for scanner.Scan() {
name := strings.TrimSpace(scanner.Text())
if _, err := imageClient.ImageStatus(ctx, &runtimeapi.ImageStatusRequest{

if status, err := imageClient.ImageStatus(ctx, &runtimeapi.ImageStatusRequest{
Image: &runtimeapi.ImageSpec{
Image: name,
},
}); err == nil {
}); err == nil && status.Image != nil && len(status.Image.RepoTags) > 0 {
logrus.Infof("Image %s has already been pulled", name)
if image, err := imageService.Get(ctx, name); err != nil {
errs = append(errs, err)
} else {
images = append(images, image)
for _, tag := range status.Image.RepoTags {
if image, err := imageService.Get(ctx, tag); err != nil {
errs = append(errs, err)
} else {
images = append(images, image)
}
}
continue
}

logrus.Infof("Pulling image %s", name)
if _, err := imageClient.PullImage(ctx, &runtimeapi.PullImageRequest{
Image: &runtimeapi.ImageSpec{
Expand Down
35 changes: 13 additions & 22 deletions pkg/agent/loadbalancer/loadbalancer.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ import (

// server tracks the connections to a server, so that they can be closed when the server is removed.
type server struct {
// This mutex protects access to the connections map. All direct access to the map should be protected by it.
mutex sync.Mutex
address string
healthCheck func() bool
connections map[net.Conn]struct{}
}

Expand All @@ -31,7 +34,9 @@ type serverConn struct {
// actually balance connections, but instead fails over to a new server only
// when a connection attempt to the currently selected server fails.
type LoadBalancer struct {
mutex sync.Mutex
// This mutex protects access to servers map and randomServers list.
// All direct access to the servers map/list should be protected by it.
mutex sync.RWMutex
proxy *tcpproxy.Proxy

serviceName string
Expand Down Expand Up @@ -123,26 +128,9 @@ func New(ctx context.Context, dataDir, serviceName, serverURL string, lbServerPo
}
logrus.Infof("Running load balancer %s %s -> %v [default: %s]", serviceName, lb.localAddress, lb.ServerAddresses, lb.defaultServerAddress)

return lb, nil
}

func (lb *LoadBalancer) SetDefault(serverAddress string) {
lb.mutex.Lock()
defer lb.mutex.Unlock()

_, hasOriginalServer := sortServers(lb.ServerAddresses, lb.defaultServerAddress)
// if the old default server is not currently in use, remove it from the server map
if server := lb.servers[lb.defaultServerAddress]; server != nil && !hasOriginalServer {
defer server.closeAll()
delete(lb.servers, lb.defaultServerAddress)
}
// if the new default server doesn't have an entry in the map, add one
if _, ok := lb.servers[serverAddress]; !ok {
lb.servers[serverAddress] = &server{connections: make(map[net.Conn]struct{})}
}
go lb.runHealthChecks(ctx)

lb.defaultServerAddress = serverAddress
logrus.Infof("Updated load balancer %s default server address -> %s", lb.serviceName, serverAddress)
return lb, nil
}

func (lb *LoadBalancer) Update(serverAddresses []string) {
Expand All @@ -166,15 +154,18 @@ func (lb *LoadBalancer) LoadBalancerServerURL() string {
return lb.localServerURL
}

func (lb *LoadBalancer) dialContext(ctx context.Context, network, address string) (net.Conn, error) {
func (lb *LoadBalancer) dialContext(ctx context.Context, network, _ string) (net.Conn, error) {
lb.mutex.RLock()
defer lb.mutex.RUnlock()

startIndex := lb.nextServerIndex
for {
targetServer := lb.currentServerAddress

server := lb.servers[targetServer]
if server == nil || targetServer == "" {
logrus.Debugf("Nil server for load balancer %s: %s", lb.serviceName, targetServer)
} else {
} else if server.healthCheck() {
conn, err := server.dialContext(ctx, network, targetServer)
if err == nil {
return conn, nil
Expand Down
74 changes: 67 additions & 7 deletions pkg/agent/loadbalancer/servers.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"net/url"
"os"
"strconv"
"time"

"github.com/k3s-io/k3s/pkg/version"
http_dialer "github.com/mwitkow/go-http-dialer"
Expand All @@ -17,6 +18,7 @@ import (

"github.com/sirupsen/logrus"
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/apimachinery/pkg/util/wait"
)

var defaultDialer proxy.Dialer = &net.Dialer{}
Expand Down Expand Up @@ -73,7 +75,11 @@ func (lb *LoadBalancer) setServers(serverAddresses []string) bool {

for addedServer := range newAddresses.Difference(curAddresses) {
logrus.Infof("Adding server to load balancer %s: %s", lb.serviceName, addedServer)
lb.servers[addedServer] = &server{connections: make(map[net.Conn]struct{})}
lb.servers[addedServer] = &server{
address: addedServer,
connections: make(map[net.Conn]struct{}),
healthCheck: func() bool { return true },
}
}

for removedServer := range curAddresses.Difference(newAddresses) {
Expand Down Expand Up @@ -106,8 +112,8 @@ func (lb *LoadBalancer) setServers(serverAddresses []string) bool {
}

func (lb *LoadBalancer) nextServer(failedServer string) (string, error) {
lb.mutex.Lock()
defer lb.mutex.Unlock()
lb.mutex.RLock()
defer lb.mutex.RUnlock()

if len(lb.randomServers) == 0 {
return "", errors.New("No servers in load balancer proxy list")
Expand Down Expand Up @@ -162,10 +168,12 @@ func (s *server) closeAll() {
s.mutex.Lock()
defer s.mutex.Unlock()

logrus.Debugf("Closing %d connections to load balancer server", len(s.connections))
for conn := range s.connections {
// Close the connection in a goroutine so that we don't hold the lock while doing so.
go conn.Close()
if l := len(s.connections); l > 0 {
logrus.Infof("Closing %d connections to load balancer server %s", len(s.connections), s.address)
for conn := range s.connections {
// Close the connection in a goroutine so that we don't hold the lock while doing so.
go conn.Close()
}
}
}

Expand All @@ -178,3 +186,55 @@ func (sc *serverConn) Close() error {
delete(sc.server.connections, sc)
return sc.Conn.Close()
}

// SetDefault sets the selected address as the default / fallback address
func (lb *LoadBalancer) SetDefault(serverAddress string) {
lb.mutex.Lock()
defer lb.mutex.Unlock()

_, hasOriginalServer := sortServers(lb.ServerAddresses, lb.defaultServerAddress)
// if the old default server is not currently in use, remove it from the server map
if server := lb.servers[lb.defaultServerAddress]; server != nil && !hasOriginalServer {
defer server.closeAll()
delete(lb.servers, lb.defaultServerAddress)
}
// if the new default server doesn't have an entry in the map, add one
if _, ok := lb.servers[serverAddress]; !ok {
lb.servers[serverAddress] = &server{
address: serverAddress,
healthCheck: func() bool { return true },
connections: make(map[net.Conn]struct{}),
}
}

lb.defaultServerAddress = serverAddress
logrus.Infof("Updated load balancer %s default server address -> %s", lb.serviceName, serverAddress)
}

// SetHealthCheck adds a health-check callback to an address, replacing the default no-op function.
func (lb *LoadBalancer) SetHealthCheck(address string, healthCheck func() bool) {
lb.mutex.Lock()
defer lb.mutex.Unlock()

if server := lb.servers[address]; server != nil {
logrus.Debugf("Added health check for load balancer %s: %s", lb.serviceName, address)
server.healthCheck = healthCheck
} else {
logrus.Errorf("Failed to add health check for load balancer %s: no server found for %s", lb.serviceName, address)
}
}

// runHealthChecks periodically health-checks all servers. Any servers that fail the health-check will have their
// connections closed, to force clients to switch over to a healthy server.
func (lb *LoadBalancer) runHealthChecks(ctx context.Context) {
wait.Until(func() {
lb.mutex.RLock()
defer lb.mutex.RUnlock()
for _, server := range lb.servers {
if !server.healthCheck() {
defer server.closeAll()
}
}
}, time.Second, ctx.Done())
logrus.Debugf("Stopped health checking for load balancer %s", lb.serviceName)
}
Loading
Loading