Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new covid_symptom__nlp_results_term_exists task #275

Merged
merged 1 commit into from
Sep 29, 2023

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Sep 26, 2023

  • Adds a new covid_symptom__nlp_results_term_exists task, which uses
    the "termexists" model for polarity checking cTAKES rather than the
    previous "negation" model. This task will largely be used to compare
    the performance of the two models.
  • Rename some docker compose targets, like the etl-support profile
    into the covid-symptom profile (the thinking is that we'll have
    study-specific sets of services that you might want on or off,
    depending on what you're doing)
  • Refactors some NLP code to use shared base classes.
  • Remove covid_symptom__nlp_results from the default set of tasks.
    Study-specific tasks should have to be requested.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@mikix mikix force-pushed the mikix/term-exists branch 3 times, most recently from 7625b54 to 61e9761 Compare September 29, 2023 12:24
Comment on lines -52 to +55
- etl-support
- etl-support-gpu
- covid-symptom
- covid-symptom-gpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK notable change number 1: how do you feel about this direction?

I think I mentioned wanting to toy with this - not a global "run all the services I might ever need" but rather a more targeted "run the services for the studies I'm caring about right now".

It's a more complicated user story, but I feel like the "run everything" approach quickly falls over as we add studies.

You had mentioned anticipating moving to a more comprehensive meta-runner like Kubernetes or something yeah? I'm not opposed, but in the meantime, I'm thinking something this could be a happy medium.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i don't hate this given the addition of a new task, and also that folks may not often be running this now.

class CovidSymptomNlpResultsTask(tasks.EtlTask):
"""Covid Symptom study task, to generate symptom lists from ED notes using NLP"""
class BaseCovidSymptomNlpResultsTask(tasks.BaseNlpTask):
"""Covid Symptom study task, to generate symptom lists from ED notes using cTAKES + a polarity check"""
Copy link
Contributor Author

@mikix mikix Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notable change number 2: A lot of this PR is refactoring, sorry.

I make a new BaseNlpTask for all NLP jobs (shared by the hugging face test task and the covid tasks) PLUS a base covid symptoms test here that both the (nearly-identical) covid tasks inherit from).

There's no meaningful actual change in the refactoring, except that this covid base task can now pass in a different polarity model param to the NLP code. But the end result of the task is identical (though I did bump its task_version number because I also bumped the negation transformer to 0.6.1).

And the tests when you get to them are also refactored a bit - I added a new shared BaseEtlTest class for all the tests that care about running a full ETL and some refactoring to handle the changes from dropping the covid tests from the default list of ETL jobs (keep reading for that change).

@@ -52,7 +54,6 @@ def get_default_tasks() -> list[type[AnyTask]]:
ObservationTask,
ProcedureTask,
ServiceRequestTask,
covid_symptom.CovidSymptomNlpResultsTask, # TODO: remove from default list at some point
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notable change number 3: dropping this default task. It feels natural to drop study-specific tasks from the default set. (I don't know if folks really even use the default vs listing the tasks they care about, but still.)

from cumulus_etl.etl.tasks.base import EtlTask, OutputTable


class BaseNlpTask(EtlTask):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the base NLP task I mentioned before. It basically just handles looping over the DocRefs and extracting the notes. With some boilerplate code for handling group fields if the subclass wants that.

"pyarrow < 13",
"pyarrow < 14",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, just bumped it while I was in here because I saw they put out a 13.0.0 that dropped python 3.7 support.

@mikix mikix force-pushed the mikix/term-exists branch 5 times, most recently from 44cdbb4 to dc25c5e Compare September 29, 2023 13:31
@mikix mikix changed the title WIP: feat: add new covid_symptom__nlp_results_term_exists task feat: add new covid_symptom__nlp_results_term_exists task Sep 29, 2023
@mikix mikix marked this pull request as ready for review September 29, 2023 13:33
Comment on lines -52 to +55
- etl-support
- etl-support-gpu
- covid-symptom
- covid-symptom-gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i don't hate this given the addition of a new task, and also that folks may not often be running this now.

cumulus_etl/etl/tasks/nlp_task.py Show resolved Hide resolved
Comment on lines 49 to 51
There is also a second optional task `--task covid_symptom__nlp_results_term_exists`,
which just uses a different polarity cNLP transformer (`termexists` rather than `negation`).
You likely don't need both, but they may be interesting to compare.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we want to treat this as a in development style command? i.e. maybe it doesn't live here until we've decided to use it in a non-experimental way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

So this task is not like, in-development or experimental in the sense of a WIP that's likely to change. Once this lands, I feel like it's a full, normal task. But it is "less than" a normal task because it's kind of a special case task to help Tim compare the new transformer's performance. But I guess all study tasks are kind of special case tasks like that.

But here's this task's current status in this PR:

  1. is not run by default
  2. can only be run by name (i.e. not tagged at all -- --task-filter=covid_symptom or --task-filter=gpu do not hit it)
  3. is documented on the study's docs

Is your thinking that 1 and 2 are good (i.e. "lowlightling" this task) and 3 should also not mention the task to lowlight it even further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm open to that, just trying to clarify what you are thinking about when marking experimental -- or do you want like active barriers like a flag --enable-experimental or something)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in my head, i'm drawing the line at 'this is something we are using internally while doing study validation' versus 'this is something a partner site might be explicitly asked to run' - and I think this is mostly a 3 level comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair yeah - I dropped mention from the docs 👍

- Adds a new covid_symptom__nlp_results_term_exists task, which uses
  the "termexists" model for polarity checking cTAKES rather than the
  previous "negation" model. This task will largely be used to compare
  the performance of the two models.
- Rename some docker compose targets, like the etl-support profile
  into the covid-symptom profile (the thinking is that we'll have
  study-specific sets of services that you might want on or off,
  depending on what you're doing)
- Refactors some NLP code to use shared base classes.
- Remove covid_symptom__nlp_results from the default set of tasks.
  Study-specific tasks should have to be requested.
@mikix mikix merged commit de2bcb9 into main Sep 29, 2023
2 checks passed
@mikix mikix deleted the mikix/term-exists branch September 29, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants