Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor message serialisation and deserialisation #197

Merged
merged 36 commits into from
Dec 20, 2023

Conversation

milanmlft
Copy link
Member

@milanmlft milanmlft commented Dec 19, 2023

I encountered some friction while adding the project_name and omop_es_timestamp fields to the RabbitMQ messages from the log file for UCLH-Foundry/the-rolling-skeleton#159 because the serialise() function was getting quite bloated and ruff started to complain about too many parameters. I also had to change code in many different places, indicating information leakage.

So I decided to create dedicated classes for Message in core (while also getting rid of the utils name). The Message class is responsible for defining which fields we expect to be present and has a serialisation method.
The serialisation and deserialisation are handled by jsonpickle.

In addition, serialisation is only done at the last instance when publishing to the queue, to avoid repeated serialisation/deserialisation steps, by passing around Message objects instead of bytes.

Hope this will make future changes a bit more streamlined.
Would love to receive some feedback on this! 🙏

Warning

This PR is meant to be merged in #186

Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fun, and agreed one defining a message type for publishing to the queue.

I have some thoughts - let me know what you think (open for discussion, may not necessarily be the best solution)

cli/src/pixl_cli/main.py Outdated Show resolved Hide resolved
pixl_core/src/core/patient_queue/message.py Show resolved Hide resolved
pixl_core/src/core/patient_queue/message.py Outdated Show resolved Hide resolved
pixl_core/src/core/patient_queue/message.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring this Milan, looks a lot better and out abstractions are nicer

cli/src/pixl_cli/main.py Show resolved Hide resolved
cli/tests/test_messages_from_parquet.py Outdated Show resolved Hide resolved
pixl_core/src/core/patient_queue/producer.py Outdated Show resolved Hide resolved
cli/tests/test_messages_from_parquet.py Outdated Show resolved Hide resolved
pixl_core/src/core/patient_queue/message.py Outdated Show resolved Hide resolved
pixl_core/src/core/patient_queue/message.py Show resolved Hide resolved
Comment on lines -76 to +77
await callback(message.body)
await callback(deserialise(message.body))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

This hasn't been tested as it's not coved by any of the current unit tests.

Requires better integration tests, so should probably be done in https://github.com/UCLH-Foundry/the-rolling-skeleton/issues/70

@milanmlft milanmlft merged commit 90b6672 into rmg/parquet_inp Dec 20, 2023
7 checks passed
@milanmlft milanmlft deleted the mm/serialised-message-class branch December 20, 2023 17:36
milanmlft added a commit that referenced this pull request Dec 21, 2023
* Add -e to pip install command in cli test

* Added flag for csv=True and ran tests for cli

* Added test for parquet directory and messages_from_parquet func

* Use resources fixture to find test parquest files. Make input options
clearer. Tests will still fail due to missing private parquet files.

* Add synthetic "private" OMOP data taken from the issue

* Add dependency for reading parquet files

* Add procedure occurrence id column to message. Use pandas merge instead
of join.

* Remove option for CSV cohort file

* formatting changes

* Linting fixes

* Fix other uses of modified method

* ruff fix

* Downstream class needs new attribute as well

* Add type annotations to `test_messages_from_parquet()`

* Add project name and OMOP ES timestamp to rabbitmq messages

* Update `serialise()` callers with new arguments

* Add `project_name` and `omop_es_timestamp` fields to `ImagingStudy` class

* Refactor message serialisation and deserialisation (#197)

* Refactor message serialization and deserialization

Addiing `Message` and `SerialisedMessage` classes in attempt to improve information hiding and decoupling.

* Rename `utils.py` -> `message.py`

* Add `decode()` method for `SerialisedMessage`

* Update docstring

* Use new classes in message testing

* Refactor message processing in the CLI

* Refactor `process_message` to use `SerialisedMessage` class in EHR API

* Refactor `process_message` to use `SerialisedMessage` class in imaging API

* Fix `ImagingStudy` initalisation in  `ImagingStudy.from_message()`

* Fix imports

* Fix test: access serialised message bodies

* Turn `Message` into a `dataclass`

* Fix failing tests

* Use `jsonpickle` for (de)serializing messages

This also removes the need for the `SerialisedMessage` class

* Fix `test_deserialise_datetime()` so it uses the `Message` class to assert the `study_datetime`

* Add `study_datetime` property for `Message`

* No need to test deserialising individual fields, already covered by `test_deserialise()` which deserialises the entire object

* Remove `study_date_from_serialised()`, use the class attribute `study_datetime` instead

* Revert "Add `study_datetime` property for `Message`"

This reverts commit 8719153.

* Remove `Messages` class, use `list[Message]` instead

* Add type checking for messages parsed from parquet input

* Update `test_messages_from_parquet()` to use JSON strings instead of bytes

* Update `PixlProducer.publish()` to use a list of Message objects and handle serialisation

* Convert JSON string to bytes when serialising

* Revert "Update `test_messages_from_parquet()` to use JSON strings instead of bytes"

This reverts commit 0e4fce4.

* `PixlProducer.publish()` should take a `list[Message]` as input in tests

* Update EHR API to use new `Message` design

* Update imaging API to use new `Message` design

* Update deserialise function to accept bytes-encoded JSON string

* Assert messages against list of `Message`s

* Print dataclass in logs

* `jsonpickle.decode()` can handle bytes so no need to decode first

Also add a note about why we ignore ruff rule S301

* Make `deserialisable` a keyword only argument

* Copilot forgot to convert dates to datetimes 🥲

* Refactor PixlConsumer run method to accept Message object as callback parameter and deserialise

* Update consumer in `test_subscriber` to accept Message object instead of bytes

* Format README

* Update documentation with new parquet inputs

* Update system test to use parquet files

* Move `extract_summary.json` one level up and update docs

* Change `parquet-dir` to a click argument instead of option

* `click.argument` doesn't have a `help` parameter

Document `parquet-dir` in the docstrings instead

* Slim down parquet test files to only those needed

* Raise more appropriate exceptions for missing file or directory

* Print log message about messages created during regular runs

* Update error message for parquet files

---------

Co-authored-by: ruaridhg <ruaridhg@users.noreply.github.com>
Co-authored-by: Jeremy Stein <j.stein@ucl.ac.uk>
Co-authored-by: Milan Malfait <m.malfait@ucl.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants