Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make rate limiting work with RxClass #331

Open
jrlegrand opened this issue Nov 22, 2024 · 5 comments
Open

Make rate limiting work with RxClass #331

jrlegrand opened this issue Nov 22, 2024 · 5 comments
Assignees

Comments

@jrlegrand
Copy link
Member

jrlegrand commented Nov 22, 2024

Problem Statement

See related branch jrlegrand/rxclass-rework.

RxClass API has a rate limit of 20 calls / second.

There's about 123,246 API calls.

[2024-11-22, 01:14:28 CST] {logging_mixin.py:137} INFO - URL List created of length: 123246

I'm no mathematician, but 20 calls / second x 60 seconds / minute = 1200 calls / minute. 123,246 / 1200 calls / minute = 103 minutes or exactly 1 hour and 43 minutes.

When I run my branch locally, it runs for 1 hour and 43 minutes and errors out with the error below.

[2024-11-22, 02:58:01 CST] {taskinstance.py:1768} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/decorators/base.py", line 217, in execute
    return_value = super().execute(context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 192, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/rxclass/dag_tasks.py", line 41, in extract_rxclass
    rxclasses = response['rxclassDrugInfoList']['rxclassDrugInfo']
KeyError: 'rxclassDrugInfoList'

As I'm writing this, I think the issue is more about what happens after the API calls have completed - seeing as the time it ran is appropriate based on my #math above and the error seems to be about a KeyError.

Either way, this is not working - maybe the problem isn't with my rate limiting code, but either way it would be great to have other eyes on this.

Criteria for Success

RxClass DAG runs in about 1 hour 45 minutes and does not error out.

Additional Information

https://lhncbc.nlm.nih.gov/RxNav/TermsofService.html

@saywurdson
Copy link
Collaborator

@jrlegrand it looks like the reason why the code is failing is because it does not handle cases where the API response lacks the 'rxclassDrugInfoList' key, (meaning there is no class data associated with the concept) leading to a KeyError. The problem is happening in the process_concept function. We just need to figure out how to handle this situation more elegantly.

Potential solutions off the top of my head:

  1. We can add some kind of logging to identify the concepts that don't have class data in the terminal
  2. We can just skip these concepts
  3. We can add them to the final table but all the className and other class info will be blank.

Let me know what you think would be the best solution moving forward and I'll see how I can fix the code so that it works

jrlegrand added a commit that referenced this issue Nov 23, 2024
Fixes #331

Checked for rxclassDrugInfoList in response
before trying to process response.

The API piece runs in about 1:45 and overall
runs in about 2:30 which means the loading takes
about 45 min.

Where the key doesn't exist, the object returned
is usually completely empty {}. One weird
thing is spot checking the results in the database
doesn't exactly line up with RxClass UI online.
@jrlegrand
Copy link
Member Author

I pushed up some code to the branch. It works - see my most recent commit message. It runs in 2.5 hours which could be optimized I'm sure. I noticed when the key doesn't exist, it returns an empty object {}. Also I spot checked against RxClass for may_treat "Multiple Myeloma" and SageRx had 3 fewer IN drugs than the RxClass UI online. These ones were missing in SageRx.
IN 3639 doxorubicin
IN 612937 interferon alfa-n3
IN 72257 interferon beta-1b
https://mor.nlm.nih.gov/RxClass/search?query=Multiple%20Myeloma%7CDISEASE&searchBy=class&sourceIds=&drugSources=atc1-4%7Catcprod%2Cepc%7Cdailymed%2Cdisease%7Cmedrt%2Cchem%7Cdailymed%2Cmoa%7Cdailymed%2Cpe%7Cdailymed%2Cpk%7Cmedrt%2Ctc%7Cfmtsme%2Cva%7Cva%2Cdispos%7Csnomedct%2Cstruct%7Csnomedct%2Ctherap%7Csnomedct%2Cschedule%7Crxnorm

@jrlegrand
Copy link
Member Author

Hmm... I don't see an IN listed for the may_treat Multiple Myeloma relationship in the API (I'm only seeing the PIN) so maybe it's not an issue with our code. Maybe it's some weird thing with RxClass UI?

API https://rxnav.nlm.nih.gov/REST/rxclass/class/byRxcui.json?rxcui=612937

NOTE: the only may-treat relation is a PIN with RXCUI 72258.

I suspect what the RxClass UI is doing is mapping PIN to IN if an IN doesn't already exist in the list. In other words, I see a lot of PINs that kind of have "sister" INs... except for these 3. They only show up as PINs. But you can map from PIN to IN to get the IN if that's preferred.

RxClass https://mor.nlm.nih.gov/RxClass/search?query=elotuzumab&searchBy=drug&sourceIds=&drugSources=atc1-4%7Catcprod%2Cepc%7Cdailymed%2Cdisease%7Cmedrt%2Cchem%7Cdailymed%2Cmoa%7Cdailymed%2Cpe%7Cdailymed%2Cpk%7Cmedrt%2Ctc%7Cfmtsme%2Cva%7Cva%2Cdispos%7Csnomedct%2Cstruct%7Csnomedct%2Ctherap%7Csnomedct%2Cschedule%7Crxnorm

RxNav https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=72258

@jrlegrand
Copy link
Member Author

Number of rows by rela_source
image

@saywurdson
Copy link
Collaborator

saywurdson commented Nov 25, 2024

#333 - potential optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants