|
| 1 | +- Start Date: 09-12-2022 |
| 2 | +- Status: Proposed |
| 3 | + |
| 4 | +# Gateway refactor |
| 5 | + |
| 6 | +## Summary |
| 7 | + |
| 8 | +This proposes a full rewrite of the current gateway service. In addition, it also proposes |
| 9 | +changes to the way that access control to Jupyter servers is handled. Currently, each Jupyter |
| 10 | +server establishes a session with the user. But going forward each user will have a session |
| 11 | +with the gateway which will be used to control access into each Jupyter server. |
| 12 | +The main motivation for changing the access control for Jupyter servers is the desire to run |
| 13 | +each Jupyter server in its own subdomain. The problem with unique subdomains is that each Jupyter |
| 14 | +server's subdomain would have to be registered in Keycloak so that the Oauth2-proxy in each |
| 15 | +Jupyter server can function properly. But if the Oauth session is handled by the gateway we can |
| 16 | +have a single callback address in Keycloak as we currently do. |
| 17 | + |
| 18 | +## Motivation |
| 19 | + |
| 20 | +Reasons for rewriting the gateway: |
| 21 | +- the current code base is hard to maintain |
| 22 | +- one service could replace both traefik and the gateway |
| 23 | +- problems with managing sessions across different services/clients |
| 24 | +- using golang should result in more efficient use of resources and improved performance |
| 25 | + |
| 26 | +## Requirements |
| 27 | + |
| 28 | +We currently have two separate issues in which the gateway plays an integral part: |
| 29 | + |
| 30 | +- managing access from external entities (browsers, command line tools) to Renku resources |
| 31 | + - we need to ensure that all requests hitting our internal endpoints have appropriate identities and authorizations |
| 32 | +- supporting access from (internal) Renku services to external protected entities |
| 33 | + - in this case we are using tokens generated by external services and managed by Renku to access these resources |
| 34 | + |
| 35 | +## Design Detail |
| 36 | + |
| 37 | +### Gateway |
| 38 | + |
| 39 | +The gateway has to have sessions with different clients. Some of these are: |
| 40 | +- User servers (i.e. the sidecar container needs be able to swap some kind of ID |
| 41 | +or access token for a Gitlab token) |
| 42 | +- Actual users logging in through the UI |
| 43 | +- Actual users logging in through the CLI |
| 44 | + |
| 45 | +Using JWT access tokens for this purpose is not feasible. This is because access |
| 46 | +tokens from Keycloak expire usually within an hour or several hours. This means |
| 47 | +that anyone (i.e. users, services or jupyter servers) who has to remain authenticated |
| 48 | +for a long time with the gateway needs to also get a refresh token and worry about |
| 49 | +using this to have valid access tokens. Accepting an expired access token is not secure. |
| 50 | + |
| 51 | +The gateway should instead have server-side sessions with every "client". For each such |
| 52 | +session the gateway will issue a random, hard to guess ID (i.e. a [session ID](https://en.wikipedia.org/wiki/Session_ID)) |
| 53 | +that will be stored either in a cookie or simply as a secret. This is exactly how |
| 54 | +sessions are handled by the ui-server. The cookie or ID is presented with every |
| 55 | +request to the gateway. The gateway looks this ID up in redis and if there is a match |
| 56 | +then any tokens/credentials stored under that key can be added to requests initiated by that ID. |
| 57 | + |
| 58 | +These sessions are managed/started as follows: |
| 59 | +- when a user logs in they get a session |
| 60 | +- when a user server is started the server gets a redis-based session |
| 61 | +- when a login happens through the CLI there is a also a session |
| 62 | +- anonymous users also have redis-based sessions but these are empty |
| 63 | +since there is no additional credentials that need to be managed for anonymous users |
| 64 | + |
| 65 | +The gateway then has the following responsibilities: |
| 66 | +- store authentication tokens |
| 67 | +- update stored authentication tokens |
| 68 | +- refresh authentication tokens before expiration |
| 69 | +- destroy session IDs upon predetermined events (e.g. logout) |
| 70 | + |
| 71 | +### Centralized Jupyter server authentication |
| 72 | + |
| 73 | +### Diagrams |
| 74 | + |
| 75 | +1. Creating a user server |
| 76 | + |
| 77 | +```mermaid |
| 78 | +graph TD |
| 79 | + Ingress -- 1<br>Session ID --> Gateway |
| 80 | + |
| 81 | + subgraph GatewayGroup[ ] |
| 82 | + Gateway |
| 83 | + DB[(DB)] |
| 84 | + Gateway -- 2<br>Session ID --> DB |
| 85 | + Gateway -- 3<br>Server ID --> DB |
| 86 | + DB -- 4<br>KC access token<br>Server ID --> Gateway |
| 87 | + end |
| 88 | + |
| 89 | + Gateway -- 5<br>KC access token<br>KC ID token<br>Server ID--> Notebook[Notebook service] |
| 90 | + Notebook -- 6<br>KC user ID<br>Server ID--> Server[User server] |
| 91 | +``` |
| 92 | + |
| 93 | +2. Accessing a user server |
| 94 | + |
| 95 | +```mermaid |
| 96 | +graph TD |
| 97 | + Ingress -- 1<br>Session ID --> Gateway |
| 98 | + |
| 99 | + subgraph GatewayGroup[ ] |
| 100 | + Gateway |
| 101 | + DB[(DB)] |
| 102 | + Gateway -- 2<br>Session ID --> DB |
| 103 | + DB -- 3<br>KC access token --> Gateway |
| 104 | + end |
| 105 | + |
| 106 | + Gateway -- 4<br>KC access token --> Proxy |
| 107 | +
|
| 108 | + subgraph Server[User server] |
| 109 | + Proxy -- 5 --> Jupyter |
| 110 | + end |
| 111 | +``` |
| 112 | + |
| 113 | +3. User server sidecar accessing a protected service |
| 114 | + |
| 115 | +```mermaid |
| 116 | +graph TD |
| 117 | + Server[User server sidecar] |
| 118 | + Gitlab |
| 119 | +
|
| 120 | + subgraph GatewayGroup[ ] |
| 121 | + Gateway |
| 122 | + DB[(DB)] |
| 123 | + end |
| 124 | +
|
| 125 | + Server -- 1<br>Server ID --> Gateway |
| 126 | + Gateway -- 2<br>Server ID --> DB |
| 127 | + DB -- 3<br>Git access token --> Gateway |
| 128 | + Gateway -- 4<br>Git access token --> Gitlab |
| 129 | +``` |
| 130 | + |
| 131 | +4. Sticky sessions for a Renku service |
| 132 | + |
| 133 | +```mermaid |
| 134 | +graph TD |
| 135 | + Ingress |
| 136 | + |
| 137 | + subgraph CoreGroup[ ] |
| 138 | + CoreTraefik[Traefik] |
| 139 | + Core[Core service] |
| 140 | + end |
| 141 | + |
| 142 | + subgraph GatewayGroup[ ] |
| 143 | + Gateway |
| 144 | + DB[(DB)] |
| 145 | + end |
| 146 | + |
| 147 | + Ingress -- 1<br>Session ID --> CoreTraefik |
| 148 | + CoreTraefik -- 2<br>Session ID<br>Sticky session --> Gateway |
| 149 | + Gateway -- 3<br>Session ID --> DB |
| 150 | + DB -- 4<br>KC access token<br>Git access token --> Gateway |
| 151 | + Gateway -- 5<br>KC access token<br>Git access token<br>Sticky session --> Core |
| 152 | +``` |
| 153 | + |
| 154 | +5. Requests from the ui-server |
| 155 | + |
| 156 | +Here the gateway will simply require fully valid Keycloak access tokens |
| 157 | +(which the ui-server already has). If this is implemented, the ui-server |
| 158 | +will be the only component that will be coming to the gateway with Keycloak |
| 159 | +access tokens. All others will be coming with a session ID. So to further |
| 160 | +tighten security the gateway can accept Keycloak access token only with the |
| 161 | +`ui-server` client as audience and reject all others. |
| 162 | + |
| 163 | +## Drawbacks |
| 164 | + |
| 165 | +- More traffic will go through the gateway |
| 166 | +- The gateway will have sessions with multiple clients |
| 167 | + |
| 168 | +## Rationale and Alternatives |
| 169 | + |
| 170 | +1. Why can't we simply use JWT access tokens everywhere? |
| 171 | +JWT access tokens from Keycloak expire and it is unsafe to accept expired |
| 172 | +access tokens. So if we go this route then everyone who is authenticated |
| 173 | +is forced to refresh their access token. But in order to refresh an access token |
| 174 | +the services need to receive additional credentials. In addition the services |
| 175 | +now much periodically check and worry about keeping their tokens refreshed. This |
| 176 | +should not be the responsibilities of renku components but rather on the gateway. |
| 177 | +The only expectation of Renku components we should have is that they verify a provided |
| 178 | +access token and confirm i) it is valid and ii) has the expected audience and/or claims. |
| 179 | + |
| 180 | +2. Why can't we use the same session ID for users and servers? |
| 181 | +The main reason is that these are expected to have different lifetimes. |
| 182 | +When a user logs out then their session is removed. But they may leave servers running |
| 183 | +that require access to credentials so that they continue operating properly (even |
| 184 | +when their creator has logged out). The server session with the gateway will be |
| 185 | +removed when the server is shut down. |
| 186 | + |
| 187 | +3. Why not simply require a user/service to present a Keycloak user ID for authentication |
| 188 | +instead of the randomly generated session ID? |
| 189 | +Using the never-changing Keycloak user IDs for authentication is dangerous. This means |
| 190 | +that the moment someone finds out the ID for a user they can impersonate them. Whereas |
| 191 | +using temporary randomly generated session IDs is safer and even if the ID is stolen |
| 192 | +the user can be impersonated only for the duration of the session. Renku admins also |
| 193 | +have control over these sessions and they can force a specific user to be logged out. Whereas |
| 194 | +with static Keycloak user IDs the admin would have to fully remove or disable the user in |
| 195 | +Keycloak in order to log them out. |
| 196 | + |
| 197 | +4. Aren't we overlapping with resposibilities and functionality with the ui-server |
| 198 | +which already has a session with the user? |
| 199 | +Yes we are. But I am not sure how much of a problem this is. The functionality described here |
| 200 | +has to be implemented somewhere and the ui-server does not provide this currently. |
| 201 | + |
| 202 | +## HTTPS Endpoints Served by the proposed design |
| 203 | +- Callback for logging into Keycloak |
| 204 | +- Callback for logging into Gitlab |
| 205 | +- Login endpoint |
| 206 | +- Logout endpoint |
| 207 | +- Session refresh endpoint |
| 208 | + |
| 209 | +## Unresolved questions |
| 210 | + |
| 211 | +Issues to be resolved through the RFC process before this is merged: |
| 212 | +- ~~agreement on the resposiblities of the gateway~~ |
| 213 | +- ~~agreement on the design and decisions~~ |
| 214 | +- ~~whether to use more persistent storage for some sessions that need to be |
| 215 | +longer-lived (i.e. sessions for Jupyter servers)~~ |
| 216 | + |
| 217 | +Issues to resolve through the implementation of this RFC: |
| 218 | +- potential elimination of traefik in the gateway (all golang web frameworks support reverse proxying out of the box) |
| 219 | +- elimination of the git-proxy in sessions (i.e. by using a session ID |
| 220 | +the gateway can inject the user's gitlab credentials in requests) |
| 221 | + |
| 222 | +Features out of scope: |
| 223 | +- The centralized authentication service for sessions can be implemented as a separate service. |
| 224 | +But I think that the fragmentation of the resposiblity of token management/swapping and authentication |
| 225 | +should not be spread across multiple services but rather belong to a single centralized entity. |
0 commit comments