Skip to content

Commit 570c531

Browse files
olevskirokroskaraledeganoPanaetius
authored
feat: gateway refactor RFC (#20)
--------- Co-authored-by: Rok Roškar <rok.roskar@sdsc.ethz.ch> Co-authored-by: aledegano <40891147+aledegano@users.noreply.github.com> Co-authored-by: Ralf Grubenmann <ralf.grubenmann@protonmail.com>
1 parent 98b396d commit 570c531

File tree

1 file changed

+225
-0
lines changed

1 file changed

+225
-0
lines changed
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
- Start Date: 09-12-2022
2+
- Status: Proposed
3+
4+
# Gateway refactor
5+
6+
## Summary
7+
8+
This proposes a full rewrite of the current gateway service. In addition, it also proposes
9+
changes to the way that access control to Jupyter servers is handled. Currently, each Jupyter
10+
server establishes a session with the user. But going forward each user will have a session
11+
with the gateway which will be used to control access into each Jupyter server.
12+
The main motivation for changing the access control for Jupyter servers is the desire to run
13+
each Jupyter server in its own subdomain. The problem with unique subdomains is that each Jupyter
14+
server's subdomain would have to be registered in Keycloak so that the Oauth2-proxy in each
15+
Jupyter server can function properly. But if the Oauth session is handled by the gateway we can
16+
have a single callback address in Keycloak as we currently do.
17+
18+
## Motivation
19+
20+
Reasons for rewriting the gateway:
21+
- the current code base is hard to maintain
22+
- one service could replace both traefik and the gateway
23+
- problems with managing sessions across different services/clients
24+
- using golang should result in more efficient use of resources and improved performance
25+
26+
## Requirements
27+
28+
We currently have two separate issues in which the gateway plays an integral part:
29+
30+
- managing access from external entities (browsers, command line tools) to Renku resources
31+
- we need to ensure that all requests hitting our internal endpoints have appropriate identities and authorizations
32+
- supporting access from (internal) Renku services to external protected entities
33+
- in this case we are using tokens generated by external services and managed by Renku to access these resources
34+
35+
## Design Detail
36+
37+
### Gateway
38+
39+
The gateway has to have sessions with different clients. Some of these are:
40+
- User servers (i.e. the sidecar container needs be able to swap some kind of ID
41+
or access token for a Gitlab token)
42+
- Actual users logging in through the UI
43+
- Actual users logging in through the CLI
44+
45+
Using JWT access tokens for this purpose is not feasible. This is because access
46+
tokens from Keycloak expire usually within an hour or several hours. This means
47+
that anyone (i.e. users, services or jupyter servers) who has to remain authenticated
48+
for a long time with the gateway needs to also get a refresh token and worry about
49+
using this to have valid access tokens. Accepting an expired access token is not secure.
50+
51+
The gateway should instead have server-side sessions with every "client". For each such
52+
session the gateway will issue a random, hard to guess ID (i.e. a [session ID](https://en.wikipedia.org/wiki/Session_ID))
53+
that will be stored either in a cookie or simply as a secret. This is exactly how
54+
sessions are handled by the ui-server. The cookie or ID is presented with every
55+
request to the gateway. The gateway looks this ID up in redis and if there is a match
56+
then any tokens/credentials stored under that key can be added to requests initiated by that ID.
57+
58+
These sessions are managed/started as follows:
59+
- when a user logs in they get a session
60+
- when a user server is started the server gets a redis-based session
61+
- when a login happens through the CLI there is a also a session
62+
- anonymous users also have redis-based sessions but these are empty
63+
since there is no additional credentials that need to be managed for anonymous users
64+
65+
The gateway then has the following responsibilities:
66+
- store authentication tokens
67+
- update stored authentication tokens
68+
- refresh authentication tokens before expiration
69+
- destroy session IDs upon predetermined events (e.g. logout)
70+
71+
### Centralized Jupyter server authentication
72+
73+
### Diagrams
74+
75+
1. Creating a user server
76+
77+
```mermaid
78+
graph TD
79+
Ingress -- 1<br>Session ID --> Gateway
80+
81+
subgraph GatewayGroup[ ]
82+
Gateway
83+
DB[(DB)]
84+
Gateway -- 2<br>Session ID --> DB
85+
Gateway -- 3<br>Server ID --> DB
86+
DB -- 4<br>KC access token<br>Server ID --> Gateway
87+
end
88+
89+
Gateway -- 5<br>KC access token<br>KC ID token<br>Server ID--> Notebook[Notebook service]
90+
Notebook -- 6<br>KC user ID<br>Server ID--> Server[User server]
91+
```
92+
93+
2. Accessing a user server
94+
95+
```mermaid
96+
graph TD
97+
Ingress -- 1<br>Session ID --> Gateway
98+
99+
subgraph GatewayGroup[ ]
100+
Gateway
101+
DB[(DB)]
102+
Gateway -- 2<br>Session ID --> DB
103+
DB -- 3<br>KC access token --> Gateway
104+
end
105+
106+
Gateway -- 4<br>KC access token --> Proxy
107+
108+
subgraph Server[User server]
109+
Proxy -- 5 --> Jupyter
110+
end
111+
```
112+
113+
3. User server sidecar accessing a protected service
114+
115+
```mermaid
116+
graph TD
117+
Server[User server sidecar]
118+
Gitlab
119+
120+
subgraph GatewayGroup[ ]
121+
Gateway
122+
DB[(DB)]
123+
end
124+
125+
Server -- 1<br>Server ID --> Gateway
126+
Gateway -- 2<br>Server ID --> DB
127+
DB -- 3<br>Git access token --> Gateway
128+
Gateway -- 4<br>Git access token --> Gitlab
129+
```
130+
131+
4. Sticky sessions for a Renku service
132+
133+
```mermaid
134+
graph TD
135+
Ingress
136+
137+
subgraph CoreGroup[ ]
138+
CoreTraefik[Traefik]
139+
Core[Core service]
140+
end
141+
142+
subgraph GatewayGroup[ ]
143+
Gateway
144+
DB[(DB)]
145+
end
146+
147+
Ingress -- 1<br>Session ID --> CoreTraefik
148+
CoreTraefik -- 2<br>Session ID<br>Sticky session --> Gateway
149+
Gateway -- 3<br>Session ID --> DB
150+
DB -- 4<br>KC access token<br>Git access token --> Gateway
151+
Gateway -- 5<br>KC access token<br>Git access token<br>Sticky session --> Core
152+
```
153+
154+
5. Requests from the ui-server
155+
156+
Here the gateway will simply require fully valid Keycloak access tokens
157+
(which the ui-server already has). If this is implemented, the ui-server
158+
will be the only component that will be coming to the gateway with Keycloak
159+
access tokens. All others will be coming with a session ID. So to further
160+
tighten security the gateway can accept Keycloak access token only with the
161+
`ui-server` client as audience and reject all others.
162+
163+
## Drawbacks
164+
165+
- More traffic will go through the gateway
166+
- The gateway will have sessions with multiple clients
167+
168+
## Rationale and Alternatives
169+
170+
1. Why can't we simply use JWT access tokens everywhere?
171+
JWT access tokens from Keycloak expire and it is unsafe to accept expired
172+
access tokens. So if we go this route then everyone who is authenticated
173+
is forced to refresh their access token. But in order to refresh an access token
174+
the services need to receive additional credentials. In addition the services
175+
now much periodically check and worry about keeping their tokens refreshed. This
176+
should not be the responsibilities of renku components but rather on the gateway.
177+
The only expectation of Renku components we should have is that they verify a provided
178+
access token and confirm i) it is valid and ii) has the expected audience and/or claims.
179+
180+
2. Why can't we use the same session ID for users and servers?
181+
The main reason is that these are expected to have different lifetimes.
182+
When a user logs out then their session is removed. But they may leave servers running
183+
that require access to credentials so that they continue operating properly (even
184+
when their creator has logged out). The server session with the gateway will be
185+
removed when the server is shut down.
186+
187+
3. Why not simply require a user/service to present a Keycloak user ID for authentication
188+
instead of the randomly generated session ID?
189+
Using the never-changing Keycloak user IDs for authentication is dangerous. This means
190+
that the moment someone finds out the ID for a user they can impersonate them. Whereas
191+
using temporary randomly generated session IDs is safer and even if the ID is stolen
192+
the user can be impersonated only for the duration of the session. Renku admins also
193+
have control over these sessions and they can force a specific user to be logged out. Whereas
194+
with static Keycloak user IDs the admin would have to fully remove or disable the user in
195+
Keycloak in order to log them out.
196+
197+
4. Aren't we overlapping with resposibilities and functionality with the ui-server
198+
which already has a session with the user?
199+
Yes we are. But I am not sure how much of a problem this is. The functionality described here
200+
has to be implemented somewhere and the ui-server does not provide this currently.
201+
202+
## HTTPS Endpoints Served by the proposed design
203+
- Callback for logging into Keycloak
204+
- Callback for logging into Gitlab
205+
- Login endpoint
206+
- Logout endpoint
207+
- Session refresh endpoint
208+
209+
## Unresolved questions
210+
211+
Issues to be resolved through the RFC process before this is merged:
212+
- ~~agreement on the resposiblities of the gateway~~
213+
- ~~agreement on the design and decisions~~
214+
- ~~whether to use more persistent storage for some sessions that need to be
215+
longer-lived (i.e. sessions for Jupyter servers)~~
216+
217+
Issues to resolve through the implementation of this RFC:
218+
- potential elimination of traefik in the gateway (all golang web frameworks support reverse proxying out of the box)
219+
- elimination of the git-proxy in sessions (i.e. by using a session ID
220+
the gateway can inject the user's gitlab credentials in requests)
221+
222+
Features out of scope:
223+
- The centralized authentication service for sessions can be implemented as a separate service.
224+
But I think that the fragmentation of the resposiblity of token management/swapping and authentication
225+
should not be spread across multiple services but rather belong to a single centralized entity.

0 commit comments

Comments
 (0)