-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add new covid_symptom__nlp_results_term_exists task #275
Conversation
7625b54
to
61e9761
Compare
- etl-support | ||
- etl-support-gpu | ||
- covid-symptom | ||
- covid-symptom-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK notable change number 1: how do you feel about this direction?
I think I mentioned wanting to toy with this - not a global "run all the services I might ever need" but rather a more targeted "run the services for the studies I'm caring about right now".
It's a more complicated user story, but I feel like the "run everything" approach quickly falls over as we add studies.
You had mentioned anticipating moving to a more comprehensive meta-runner like Kubernetes or something yeah? I'm not opposed, but in the meantime, I'm thinking something this could be a happy medium.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't hate this given the addition of a new task, and also that folks may not often be running this now.
class CovidSymptomNlpResultsTask(tasks.EtlTask): | ||
"""Covid Symptom study task, to generate symptom lists from ED notes using NLP""" | ||
class BaseCovidSymptomNlpResultsTask(tasks.BaseNlpTask): | ||
"""Covid Symptom study task, to generate symptom lists from ED notes using cTAKES + a polarity check""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notable change number 2: A lot of this PR is refactoring, sorry.
I make a new BaseNlpTask for all NLP jobs (shared by the hugging face test task and the covid tasks) PLUS a base covid symptoms test here that both the (nearly-identical) covid tasks inherit from).
There's no meaningful actual change in the refactoring, except that this covid base task can now pass in a different polarity model param to the NLP code. But the end result of the task is identical (though I did bump its task_version number because I also bumped the negation transformer to 0.6.1).
And the tests when you get to them are also refactored a bit - I added a new shared BaseEtlTest class for all the tests that care about running a full ETL and some refactoring to handle the changes from dropping the covid tests from the default list of ETL jobs (keep reading for that change).
@@ -52,7 +54,6 @@ def get_default_tasks() -> list[type[AnyTask]]: | |||
ObservationTask, | |||
ProcedureTask, | |||
ServiceRequestTask, | |||
covid_symptom.CovidSymptomNlpResultsTask, # TODO: remove from default list at some point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notable change number 3: dropping this default task. It feels natural to drop study-specific tasks from the default set. (I don't know if folks really even use the default vs listing the tasks they care about, but still.)
from cumulus_etl.etl.tasks.base import EtlTask, OutputTable | ||
|
||
|
||
class BaseNlpTask(EtlTask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the base NLP task I mentioned before. It basically just handles looping over the DocRefs and extracting the notes. With some boilerplate code for handling group fields if the subclass wants that.
"pyarrow < 13", | ||
"pyarrow < 14", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, just bumped it while I was in here because I saw they put out a 13.0.0 that dropped python 3.7 support.
44cdbb4
to
dc25c5e
Compare
dc25c5e
to
c97b07f
Compare
- etl-support | ||
- etl-support-gpu | ||
- covid-symptom | ||
- covid-symptom-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't hate this given the addition of a new task, and also that folks may not often be running this now.
docs/studies/covid-symptom.md
Outdated
There is also a second optional task `--task covid_symptom__nlp_results_term_exists`, | ||
which just uses a different polarity cNLP transformer (`termexists` rather than `negation`). | ||
You likely don't need both, but they may be interesting to compare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: do we want to treat this as a in development style command? i.e. maybe it doesn't live here until we've decided to use it in a non-experimental way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
So this task is not like, in-development or experimental in the sense of a WIP that's likely to change. Once this lands, I feel like it's a full, normal task. But it is "less than" a normal task because it's kind of a special case task to help Tim compare the new transformer's performance. But I guess all study tasks are kind of special case tasks like that.
But here's this task's current status in this PR:
- is not run by default
- can only be run by name (i.e. not tagged at all --
--task-filter=covid_symptom
or--task-filter=gpu
do not hit it) - is documented on the study's docs
Is your thinking that 1 and 2 are good (i.e. "lowlightling" this task) and 3 should also not mention the task to lowlight it even further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm open to that, just trying to clarify what you are thinking about when marking experimental -- or do you want like active barriers like a flag --enable-experimental
or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in my head, i'm drawing the line at 'this is something we are using internally while doing study validation' versus 'this is something a partner site might be explicitly asked to run' - and I think this is mostly a 3 level comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair yeah - I dropped mention from the docs 👍
- Adds a new covid_symptom__nlp_results_term_exists task, which uses the "termexists" model for polarity checking cTAKES rather than the previous "negation" model. This task will largely be used to compare the performance of the two models. - Rename some docker compose targets, like the etl-support profile into the covid-symptom profile (the thinking is that we'll have study-specific sets of services that you might want on or off, depending on what you're doing) - Refactors some NLP code to use shared base classes. - Remove covid_symptom__nlp_results from the default set of tasks. Study-specific tasks should have to be requested.
c97b07f
to
62aa293
Compare
the "termexists" model for polarity checking cTAKES rather than the
previous "negation" model. This task will largely be used to compare
the performance of the two models.
into the covid-symptom profile (the thinking is that we'll have
study-specific sets of services that you might want on or off,
depending on what you're doing)
Study-specific tasks should have to be requested.
Checklist
docs/
) needs to be updated