Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR for osquery extension, runner, and instance refactor #1905

Merged
merged 8 commits into from
Oct 25, 2024
92 changes: 92 additions & 0 deletions docs/architecture/2024-10-23_osquery_refactor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Osquery extension, runner, and instance refactor

## Status

2024-10-23: ADR submitted to team.

## Context

We want to be able to run between 0 and n osquery instances; currently, we run exactly one. In the short term, we want to be able to split up where Kolide queries run versus where tenant queries run, so that if a tenant creates a query that causes issues with osquery, the Kolide queries will still run. (This amounts to running 2 osquery instances.) In the longer term, we want to support multitenant installations. This will mean 1 osquery instance for Kolide queries, and 1 osquery instance per tenant.

### Current implementation

```mermaid
---
title: Current state
---
erDiagram
RUNGROUP ||--|| EXTENSION : runs
EXTENSION ||--|| JSONRPCCLIENT : uses
RUNGROUP ||--|| OSQUERYRUNNER : runs
OSQUERYRUNNER ||--|| OSQUERYINSTANCE: runs
OSQUERYINSTANCE ||--|{ EXTENSIONSERVER : runs
OSQUERYINSTANCE ||--|| EXTENSIONCLIENT : runs
EXTENSIONSERVER ||--|| EXTENSION : forwards
OSQUERYINSTANCE ||--|| STATS: owns
OSQUERYRUNNER ||--|| STATS : updates
```

In our current implementation, the extension is created by `runLauncher` and run via our rungroup, and passed into the osquery runner for use when creating the instances. To run multiple osquery instances using our _current_ implementation, we have two choices:

1. We could share one osquery extension with all osquery instances
2. We could create multiple extensions inside `runLauncher` and pass them into the runner to distribute one per osquery instance

Ultimately, I rejected both of these options as unsuitable.

We don't want to share the osquery extension among all osquery instances -- we will be unable to identify which osquery instance is talking to the extension, which makes this option untenable. Additionally, we may run into socket contention issues.
RebeccaMahany marked this conversation as resolved.
Show resolved Hide resolved

Creating multiple extensions in `runLauncher` and passing them into the runner requires some tedious mapping to make sure we tie the correct extension to the correct osquery instance. This option would also require a full launcher restart if the runner needed to increase the number of osquery instances running (for example, if a single-tenant launcher enrolls with a new tenant to become a multi-tenant launcher); I think requiring a full restart is unnecessary and undesirable in this case.

Therefore, we will want to refactor our current implementation with the following goals in mind:

1. We want one osquery extension per osquery instance
2. We want to be able to add new osquery instances without requiring a full launcher restart
3. We do not want onerously complex setup for the osquery instances or their extensions, since our osquery process management and osquery data exchange is already a complicated system

## Decision

To simplify the creation of multiple osquery instances, and to make it possible to increase the number of osquery instances running without performing a full launcher restart, we will move the osquery extension ownership to the osquery instance.

This will also give us the opportunity to make the separation of responsibility between the runner and the instance clearer. The instance will create its own extension, extension server, etc. The runner will be responsible for the instances' state. This means the osquery instance stats should be owned by the runner, since the runner already has the responsibility of updating them.

```mermaid
---
title: Refactored (one osquery instance)
---
erDiagram
RUNGROUP ||--|| OSQUERYRUNNER : runs
OSQUERYRUNNER ||--|| OSQUERYINSTANCE: runs
OSQUERYINSTANCE ||--|| EXTENSION : owns
EXTENSION ||--|| JSONRPCCLIENT : uses
OSQUERYINSTANCE ||--|{ EXTENSIONSERVER : runs
OSQUERYINSTANCE ||--|| EXTENSIONCLIENT : runs
EXTENSIONSERVER ||--|| EXTENSION : forwards
OSQUERYRUNNER ||--|| STATS: updates
```

The below diagram shows the desired future state -- it is identical to the above diagram except for the relationship cardinalities (e.g. `OSQUERYRUNNER` runs 0 or more of `OSQUERYINSTANCE`, rather than exactly 1 `OSQUERYINSTANCE`).

```mermaid
---
title: Future state (0-n osquery instances)
---
erDiagram
RUNGROUP ||--|| OSQUERYRUNNER : runs
OSQUERYRUNNER ||--o{ OSQUERYINSTANCE: runs
OSQUERYINSTANCE ||--|| EXTENSION : owns
EXTENSION }o--|| JSONRPCCLIENT : uses
OSQUERYINSTANCE ||--|{ EXTENSIONSERVER : runs
OSQUERYINSTANCE ||--|| EXTENSIONCLIENT : runs
EXTENSIONSERVER ||--|| EXTENSION : forwards
OSQUERYRUNNER ||--o{ STATS: updates
```

## Consequences

The osquery instance will now be responsible for managing extension health -- if the extension exits, the instance will need to take corrective action. (This is potentially a better outcome than before, since now extension shutdown can be remediated without a full launcher restart.)
RebeccaMahany marked this conversation as resolved.
Show resolved Hide resolved

This decision does preserve a shared JSONRPC client to talk to Kolide SaaS. If we determine this is not optimal, it would be fairly easy to update to a model where we have one JSONRPC client per extension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any known reasons to share vs break out per extension? or we're just not aware of a reason to duplicate the client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we're just not aware of a reason to duplicate the client

Basically this! I am imagining it's the sort of thing that I'll have a stronger opinion on once we get further through implementation. But either way, not hard to change after the fact.


## Changelog

2024-10-23: Initial draft of ADR.
Loading