From ebcb2676611097c5256707508555833a79b95b32 Mon Sep 17 00:00:00 2001 From: Stef Piatek Date: Thu, 12 Dec 2024 14:45:42 +0000 Subject: [PATCH 1/4] Start documenting design choices --- .../0001-multiservice-architecture.md | 28 ++++++++ .../0002-message-processing.md | 47 ++++++++++++ .../decision-record/0003-dicom-processing.md | 44 ++++++++++++ .../0004-multiple-project-configuration.md | 65 +++++++++++++++++ .../0005-project-based-study-routing.md | 50 +++++++++++++ docs/design/decision-record/index.md | 39 ++++++++++ docs/design/decision-record/template.md | 72 +++++++++++++++++++ 7 files changed, 345 insertions(+) create mode 100644 docs/design/decision-record/0001-multiservice-architecture.md create mode 100644 docs/design/decision-record/0002-message-processing.md create mode 100644 docs/design/decision-record/0003-dicom-processing.md create mode 100644 docs/design/decision-record/0004-multiple-project-configuration.md create mode 100644 docs/design/decision-record/0005-project-based-study-routing.md create mode 100644 docs/design/decision-record/index.md create mode 100644 docs/design/decision-record/template.md diff --git a/docs/design/decision-record/0001-multiservice-architecture.md b/docs/design/decision-record/0001-multiservice-architecture.md new file mode 100644 index 000000000..c879f8153 --- /dev/null +++ b/docs/design/decision-record/0001-multiservice-architecture.md @@ -0,0 +1,28 @@ +# Multi-service Architecture + +* Status: accepted +* Deciders: Original PIXL team +* Date: 2023-11-01 (retrospectively) + +## Context and Problem Statement + +We want a software solution that has distinct functionality. How do we structure those services. + +## Decision Drivers + +* Will need to allow for concurrent processing of tasks +* Lingua franca of the team is python + +## Considered Options + +* Multi-service architecture +* Monolith architecture + +## Decision Outcome + +Chosen option: "Multi-service architecture", because it allows us to: + +- Break up our code into logical packages and services +- Get around a single python process being blocked by the global interpreter lock, as each service runs in its own process +- Forces us to consider where code should go, and restrict services from using other services code +- Works well with UCLH's restriction to deploy services using docker and docker compose, open to extend into kubernetes should we want that diff --git a/docs/design/decision-record/0002-message-processing.md b/docs/design/decision-record/0002-message-processing.md new file mode 100644 index 000000000..03601aaa5 --- /dev/null +++ b/docs/design/decision-record/0002-message-processing.md @@ -0,0 +1,47 @@ +# Message-based Processing of Images + +* Status: accepted +* Deciders: Original PIXL team, most recent changes: Stef Piatek & Paul Smith +* Date: 2024-12-12 + +## Context and Problem Statement + +- We need a way to buffer the messages awaiting processing. + We expect hundreds to tens of thousands of imaging studies to be requested per project, + and want to find each study in the source systems individually. + + +## Decision Drivers + +- Be able to process multiple research projects at the same time +- Should be persistent if services are taken down +- Allow for a secondary DICOM source to be used if study isn't found in the primary +- Limit the total number of images that are being processed for a given source system +- Studies that have already been successfully exported for a project should not get processed again + +## Considered Options + +* Use of database alone to schedule run, with the CLI driving a long-running job to process all studies in research project +* Use of queues to buffer requests that the `imaging-api` processes, database to track the status of exported studies + +## Decision Outcome + +Chosen option: `Queue for buffering requests, database to track status of export`, +because it fulfills all requirements and allows us to invest in use of generic technologies. + +## Pros and Cons of the Options + +### Database alone + + +* Good, simple to set up and use. Single solution for buffering requests and tracking status +* Bad, bespoke solution where we could use a technology that can be transferrable to other situations +* Bad, complex setup to limit total queries per source system + +### Queue with database for status + + +* Good, fulfills all requirements fairly easily, creating a queue for primary and another for secondary imaging sources +* Good, because we have previously invested in rabbitmq as a technology +* Bad, extra services to manage and extra development +* Bad, because original implementation was broken and required effort to fix. Though we learned more about the libraries we're using. diff --git a/docs/design/decision-record/0003-dicom-processing.md b/docs/design/decision-record/0003-dicom-processing.md new file mode 100644 index 000000000..b70bcd71c --- /dev/null +++ b/docs/design/decision-record/0003-dicom-processing.md @@ -0,0 +1,44 @@ +# DICOM server and processing + +* Status: accepted +* Deciders: Original PIXL team +* Date: 2023-11-01 (retrospectively) + +## Context and Problem Statement + +We need to have a DICOM server to query DICOM images, store them, anonymise them and export them. + + +## Decision Drivers + +* Will need to have a robust DICOM server that has been in use +* Keep original studies in a cache locally to reduce use of clinical imaging systems if failures in anonymisation or export +* The team's lingua franca is Python +* Per-project anonymisation profiles and custom hashing of fields will require plugins to be written for anonymisation +* UCLH infrastructure allows for running docker, but we don't have admin accounts and cannot install software directly onto the machines. + +## Considered Options + +* XNAT server +* Orthanc server + +## Decision Outcome + +Chosen option: `Orthanc`, +because it's relatively lightweight, under development and allows for python-based extensions to be written. + +## Pros and Cons of the Options + +### XNAT + +* Good, ARC has a history of using this in the medical imaging subgroup. +* Good, widely regarded +* Bad, heavyweight and has many more features than we need to use. May take longer to learn and deploy +* Bad, does not allow for python-based plugins to be used for anonymisation without getting into running docker in docker + +### Orthanc + +* Good, has been battle tested +* Good, has DICOMWeb plugin to allow for export via that modality +* Good, allows for python-based plugins and running in docker +* Bad, no previous usage within ARC. Will be teething problems diff --git a/docs/design/decision-record/0004-multiple-project-configuration.md b/docs/design/decision-record/0004-multiple-project-configuration.md new file mode 100644 index 000000000..5c4f2647e --- /dev/null +++ b/docs/design/decision-record/0004-multiple-project-configuration.md @@ -0,0 +1,65 @@ +# Multiple-project configuration + +* Status: accepted +* Deciders: Milan Malfait, Peter Tsrunchev, Jeremy Stein, Stef Piatek +* Date: 2024-03-05 + +Technical Story: [PIXL can take multiple projects](https://github.com/SAFEHR-data/PIXL/issues/330) + +## Context and Problem Statement + +Each project should be able to define its own anonymisation profile and export destinations. + +## Decision Drivers + +* When hashing a given field for an imaging study for two different projects, each project's hash should be different. + This it to avoid inappropriate linkage of data between research projects. +* Secure storage of secrets, especially for connection to destinations and per-project hashing salts + +## Considered Options + +* file-based: env files for docker or structured files for secrets +* self-hosted secret storage service +* azure keyvault + +## Decision Outcome + +Chosen option: "azure keyvault", because its low hassle and gives us good control of secret access. + +### Positive Consequences + +* Single place for secrets, which multiple deployments can access +* Can have secrets which expire, so even if compromised we limit the amount of secret leakage possible +* Per-project salts stored in a separate keyvault than the export endpoints +* Only need to define the destination type, with the keyvalut defining all the connection details + +### Negative Consequences + +* Requires other implementations to set up their own azure storage accounts or develop new secret management +* Developers also have to update a `.env` file for running the system test +* Slight increase in cost, can be slightly offset by caching credentials + +## Pros and Cons of the Options + +### File-based + +* Good, simple to do +* Bad, will keep on expanding as time goes on which can be a pain to maintain +* Bad, no access control beyond unix permissions + +### Self-hosted secret storage + +* Good, fine-grained access control possible +* Good, free in terms of upfront cost +* Bad, another service to maintain - residual costs + +### Azure keyvault + +[example | description | pointer to more information | …] + +* Good, fine-grained access control possible +* Bad, slight increase in cost + +## Links + +* Routing of studies based on projects in [ADR-0005](0005-project-based-study-routing.md) diff --git a/docs/design/decision-record/0005-project-based-study-routing.md b/docs/design/decision-record/0005-project-based-study-routing.md new file mode 100644 index 000000000..3b2686ead --- /dev/null +++ b/docs/design/decision-record/0005-project-based-study-routing.md @@ -0,0 +1,50 @@ +# Project-based study routing + +* Status: accepted +* Deciders: Stef Piatek, Paul Smith +* Date: 2024-11-27 + +Technical Story: [PIXL can take multiple projects](https://github.com/SAFEHR-data/PIXL/issues/330) + +## Context and Problem Statement + +Each study sent to `orthanc-anon` needs to be de-identified using the project-specific configuration. +We need a way to pass along this information along with the DICOM file. + +## Considered Options + +* DICOM tag: Adding a custom DICOM tag for the project name +* custom REST API endpoint: Creating a custom REST API endpoint for `orthanc-anon` to pull the data from `orthanc-raw` + +## Decision Outcome + +Chosen option: "custom REST API endpoint", because the project tag updating was causing `orthanc-raw` to crash, +and we were no longer using study stability to control export. + +## Pros and Cons of the Options + +### DICOM tag + +Add private creator group to instances as they arrive, and a dummy value in the custom tag. +Once the study has been pulled from the DICOM source, update the tag with the filename stem of the project config. + +* Good, because you can see which study was most recently exported +* Bad, `orthanc-raw` started crashing when updating the DICOM tag via the orthanc API on large studies +* Bad, because we were having to update the DICOM tag without changing the instance UIDs. + So that we can check for missing instances for studies which already exist in `orthanc-raw` +* Bad, studies appeared where instances didn't have their custom DICOM tag updates +* Bad, could have a race condition where the same study is trying to be exported by two projects + +### Custom REST API endpoint + +Once the study has been pulled from the DICOM source, send a REST request to `orthanc-anon` with the study UID and project. +`orthanc-anon` then adds this job to a threadpool and returns a 200 status to the client. + +* Good, keeping original instances unaltered +* Good, thread-pooling allows for faster de-identification +* Good, simpler flow +* Bad, we can't alter the queue of de-identification jobs short of taking down `orthanc-anon` + +## Links + +* Related to multiple project configuration [ADR-0004](0004-multiple-project-configuration.md) diff --git a/docs/design/decision-record/index.md b/docs/design/decision-record/index.md new file mode 100644 index 000000000..57caedd91 --- /dev/null +++ b/docs/design/decision-record/index.md @@ -0,0 +1,39 @@ +# Architectural Decision Log + +This log lists the architectural decisions for PIXL. + +Using the [Markdown Architectural Decision Records](https://adr.github.io/madr/) + +Regenerate the content by using `adr-log -i` from this directory. +You can install it via `npm install -g adr-log`, removing the template + + + +* [ADR-0001](0001-multiservice-architecture.md) - Multi-service Architecture +* [ADR-0002](0002-message-processing.md) - Message-based Processing of Images +* [ADR-0003](0003-dicom-processing.md) - DICOM server and processing +* [ADR-0004](0004-multiple-project-configuration.md) - Multiple-project configuration +* [ADR-0005](0005-project-based-study-routing.md) - Project-based study routing + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/design/decision-record/template.md b/docs/design/decision-record/template.md new file mode 100644 index 000000000..b779044d8 --- /dev/null +++ b/docs/design/decision-record/template.md @@ -0,0 +1,72 @@ +# [Template: short title of solved problem and solution] + +* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0105](0105-example.md)] +* Deciders: [list everyone involved in the decision] +* Date: [YYYY-MM-DD when the decision was last updated] + +Technical Story: [description | ticket/issue URL] + +## Context and Problem Statement + +[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.] + +## Decision Drivers + +* [driver 1, e.g., a force, facing concern, …] +* [driver 2, e.g., a force, facing concern, …] +* … + +## Considered Options + +* [option 1] +* [option 2] +* [option 3] +* … + +## Decision Outcome + +Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)]. + +### Positive Consequences + +* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …] +* … + +### Negative Consequences + +* [e.g., compromising quality attribute, follow-up decisions required, …] +* … + +## Pros and Cons of the Options + +### [option 1] + +[example | description | pointer to more information | …] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* … + +### [option 2] + +[example | description | pointer to more information | …] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* … + +### [option 3] + +[example | description | pointer to more information | …] + +* Good, because [argument a] +* Good, because [argument b] +* Bad, because [argument c] +* … + +## Links + +* [Link type] [Link to ADR] +* … From b60dbbe6456bc909485837e093cd735e776df737 Mon Sep 17 00:00:00 2001 From: Stef Piatek Date: Thu, 12 Dec 2024 14:50:34 +0000 Subject: [PATCH 2/4] Be explicit about secrets for multiple projects --- .../decision-record/0004-multiple-project-configuration.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/design/decision-record/0004-multiple-project-configuration.md b/docs/design/decision-record/0004-multiple-project-configuration.md index 5c4f2647e..5b7510e7a 100644 --- a/docs/design/decision-record/0004-multiple-project-configuration.md +++ b/docs/design/decision-record/0004-multiple-project-configuration.md @@ -8,7 +8,10 @@ Technical Story: [PIXL can take multiple projects](https://github.com/SAFEHR-dat ## Context and Problem Statement -Each project should be able to define its own anonymisation profile and export destinations. +Each project should be able to define its own anonymisation profile and export destinations. +How can we store the secrets. + +![pixl-multi-project-config.png](../diagrams/pixl-multi-project-config.png) ## Decision Drivers From c6cc969b51f7ab76e99a7488cac0493d27787584 Mon Sep 17 00:00:00 2001 From: Stef Piatek Date: Thu, 12 Dec 2024 14:50:47 +0000 Subject: [PATCH 3/4] Add design choices index --- index.md | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 index.md diff --git a/index.md b/index.md new file mode 100644 index 000000000..badac46d3 --- /dev/null +++ b/index.md @@ -0,0 +1,43 @@ + + + + +* [ADR-1](cli/README.md) - PIXL Driver + Command line interface +* [ADR-2](cli/tests/README.md) - +* [ADR-3](CODE_OF_CONDUCT.md) - Contributor Covenant Code of Conduct +* [ADR-4](CONTRIBUTING.md) - Contributing to `PIXL`. +* [ADR-5](docs/design/bigger_picture.md) - The bigger picture +* [ADR-0001](docs/design/decision-record/0001-multiservice-architecture.md) - Multi-service Architecture +* [ADR-0002](docs/design/decision-record/0002-message-processing.md) - Message-based Processing of Images +* [ADR-0003](docs/design/decision-record/0003-dicom-processing.md) - DICOM server and processing +* [ADR-0004](docs/design/decision-record/0004-secrets-storage.md) - Secrets Storage +* [ADR-6](docs/design/decision-record/index.md) - Architectural Decision Log +* [ADR-7](docs/design/decision-record/template.md) - [short title of solved problem and solution] +* [ADR-8](docs/file_types/parquet_files.md) - Parquet files you might encounter throughout PIXL +* [ADR-9](docs/services/ftp-server.md) - FTPS server +* [ADR-10](docs/services/pixl_database.md) - The PIXL database +* [ADR-11](docs/setup/azure-keyvault.md) - Azure Keyvault setup +* [ADR-12](docs/setup/developer.md) - Developer setup +* [ADR-13](docs/setup/uclh-infrastructure-setup.md) - UCLH Infrastructure setup instructions +* [ADR-14](hasher/README.md) - Hasher API +* [ADR-15](orthanc/orthanc-anon/docs/DicomServiceViaAAD.md) - Retrieving an access token for the DICOM service using and Azure AD application +* [ADR-16](orthanc/orthanc-anon/README.md) - Orthanc Anon +* [ADR-17](orthanc/orthanc-raw/README.md) - Orthanc Raw +* [ADR-18](orthanc/README.md) - ORTHANC instances +* [ADR-19](pixl_core/README.md) - Core +* [ADR-20](pixl_dcmd/README.md) - PIXL DICOM de-identifier +* [ADR-21](pixl_export/README.md) - PIXL Export API +* [ADR-22](pixl_imaging/alembic/README.md) - Alembic configuration +* [ADR-23](pixl_imaging/README.md) - PIXL Imaging API +* [ADR-24](postgres/README.md) - PIXL pipeline database +* [ADR-25](pytest-pixl/README.md) - PIXL pytest plugin +* [ADR-26](README.md) - [pixl-ci](https://github.com/SAFEHR-data/PIXL/actions/workflows/main.yml/badge.svg)](https://github.com/SAFEHR-data/PIXL/actions/workflows/main.yml) +* [ADR-27](test/README.md) - PIXL System Tests + + + + + + + + From c5c8046f16411a45a024612e4aca410d9cea09ce Mon Sep 17 00:00:00 2001 From: Stef Piatek Date: Mon, 23 Dec 2024 10:56:37 +0000 Subject: [PATCH 4/4] Add data export decision --- .../decision-record/0006-data-export.md | 45 +++++++++++++++++++ docs/design/decision-record/index.md | 22 +-------- 2 files changed, 46 insertions(+), 21 deletions(-) create mode 100644 docs/design/decision-record/0006-data-export.md diff --git a/docs/design/decision-record/0006-data-export.md b/docs/design/decision-record/0006-data-export.md new file mode 100644 index 000000000..127714c01 --- /dev/null +++ b/docs/design/decision-record/0006-data-export.md @@ -0,0 +1,45 @@ +# Export of parquet files and DICOM data + +* Status: accepted +* Deciders: Haroon Chughtai, Jeremy Stein, Milan Malfait, Ruaridh Gollifer, Stef Piatek +* Date: 2024-02-26 + +## Context and Problem Statement + +The pipeline needs to be able to export DICOM images and structured data files to different endpoints. + +## Decision Drivers + +* We expect that some projects will have more data than we can store locally. Will need a rolling export of images +* We will need to be able to export images and structured data via FTPS in an automated fashion +* We will need to be able to export images via DICOMWeb + + +## Considered Options + +* Shared python library for exporting of data, used in `orthanc-anon` and the `pixl` CLI. +* `export-api` service, which can export both DICOM and structured data files. + +## Decision Outcome + +Chosen options: "`export-api` service", for clear separation of responsibilities. + +## Pros and Cons of the Options + +### Shared python library + +Add private creator group to instances as they arrive, and a dummy value in the custom tag. +Once the study has been pulled from the DICOM source, update the tag with the filename stem of the project config. + +* Good, one less service to maintain +* Good, export via DICOMWeb is using the orthanc API already +* Bad, duplication of implementation for export +* Bad, duplication of areas where secrets are used + +### `export-api` service + +Instead of shared library the code would be in the service alone. + +* Good, single service that will access all secrets and orchestrate exports +* Good, allows caching of export secrets in a long-running service +* Bad, would require extra code for interacting with the service from the CLI for parquet export diff --git a/docs/design/decision-record/index.md b/docs/design/decision-record/index.md index 57caedd91..ef2f7031f 100644 --- a/docs/design/decision-record/index.md +++ b/docs/design/decision-record/index.md @@ -14,26 +14,6 @@ You can install it via `npm install -g adr-log`, removing the template * [ADR-0003](0003-dicom-processing.md) - DICOM server and processing * [ADR-0004](0004-multiple-project-configuration.md) - Multiple-project configuration * [ADR-0005](0005-project-based-study-routing.md) - Project-based study routing +* [ADR-0006](0006-data-export.md) - Export of parquet files and DICOM data - - - - - - - - - - - - - - - - - - - - -