Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1972 replace covidcast #2004

Closed
wants to merge 0 commits into from
Closed

1972 replace covidcast #2004

wants to merge 0 commits into from

Conversation

aysim319
Copy link
Contributor

@aysim319 aysim319 commented Jul 26, 2024

Description

refactored covidcast that under the hood uses a for loop for each day to grab signals

Changelog

Itemize code/test/documentation changes and files added/removed.

  • replaced instances of covidcast.signal and covidcast.metadata with respective epidata api calls

Associated Issue(s)

  • Addresses #(1972)
  • Addresses #(1931)
  • Addresses #(1987)

@aysim319
Copy link
Contributor Author

aysim319 commented Jul 30, 2024

The refactoring takes more time than the original, possibly because the refactored grabs the whole data instead of truncated data, so longer processing time to go through the validator

cprofile_google_symptoms.txt
refactored_cprofile_google_symptoms.txt

edited:
removing the issues as it's not used in the other signal calls in google symptoms and sircomplainsalot and it did go after and the refactored optimizer has the same result as main

runs: 171.824 secs
opt_cprofile_google_symptoms.txt
opt_profile_google_symptoms.log

runs: 265.703
profile_google_symptoms.log
cprofile_google_symptoms.txt

@aysim319 aysim319 force-pushed the 1972-replace-covidcast branch 2 times, most recently from 1ff5866 to 6e22db8 Compare July 30, 2024 22:45
@aysim319 aysim319 linked an issue Jul 31, 2024 that may be closed by this pull request
@melange396
Copy link
Contributor

@aysim319 can you explain your previous comment a little more? what do each of those files represent? the first two files appear to be telling me that this branch runs a little bit slower (281s) than the dev branch (267s)... but what operation are they performing (ie, what commands and arguments did you use to run these samples)?

_delphi_utils_python/delphi_utils/covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/delphi_utils/covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/delphi_utils/covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/delphi_utils/covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/delphi_utils/validator/datafetcher.py Outdated Show resolved Hide resolved
_delphi_utils_python/delphi_utils/validator/dynamic.py Outdated Show resolved Hide resolved
_delphi_utils_python/tests/test_covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/tests/test_covidcast_wrapper.py Outdated Show resolved Hide resolved
_delphi_utils_python/tests/test_covidcast_wrapper.py Outdated Show resolved Hide resolved
@aysim319
Copy link
Contributor Author

@aysim319 can you explain your previous comment a little more? what do each of those files represent? the first two files appear to be telling me that this branch runs a little bit slower (281s) than the dev branch (267s)... but what operation are they performing (ie, what commands and arguments did you use to run these samples)?

The first time I ran the comparison the profiler was taking more time because of the issues param that I thought I also needed to figure out to format to make the call to epidata. I later found out I didn't need to pass along the issues param which fixed both the speed and the difference between the validator result.

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure the tests we have are solid, so made some a few change requests.

testing_utils/check_covidcast_port.py Outdated Show resolved Hide resolved
testing_utils/check_covidcast_port.py Outdated Show resolved Hide resolved
testing_utils/check_covidcast_port.py Outdated Show resolved Hide resolved
testing_utils/check_covidcast_port.py Outdated Show resolved Hide resolved
google_symptoms/delphi_google_symptoms/run.py Outdated Show resolved Hide resolved
@aysim319
Copy link
Contributor Author

aysim319 commented Aug 9, 2024

Couldn't find a way to elegantly run the whole thing while having 2 seperate logs so I ran the first half where it's calling with covidcast api and then saved the resulted into parquet with the rest commented it out,
covidcast_signal.log this took about an hour

then ran the whole thing to get the logs: where the only logs except the initial metadata run is just from the epidata
epidata_signal.log this took about 30 minutes

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as correctness goes, this is looking good to me. I tested locally and the new port is parsing things identically to the old covidcast function (FWIW, the client logs aren't that important here, we just really want to make sure the DataFrame outputs from our API calls are absolutely identical).

I convinced myself that we don't need to test every signal: the covidcast response schema is the same for every signal, so testing a single signal API query gets you 99% coverage (as long as the query returns a representative data subset). The only snag is that time_value can be in two different formats (date or epiweek), so as long as we test NCHS along with the other sources, we get full coverage. I tweaked the test to test a single signal per source, so it runs much faster now (thank you for doing the comprehensive runs nonetheless, it's nice to have that extra safety!).

I also found and fixed some anti-patterns in the covidcast code you ported over, specifically _parse_datetimes. Should be a bit faster. I made an error here and fixing it made the code a lot uglier. I think clarity is more important, so I reverted it.

TODO:

  • fix conflicts
  • test with CI (for some reason CI isn't running?)


response = Epidata.covidcast_meta()

if response["result"] != 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melange396 would this do the trick? or still add the conditional anyway?

@aysim319
Copy link
Contributor Author

aysim319 commented Sep 13, 2024

Screwed up with rebasing instead of merge: Continuation from last comment #2056

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

replace the python covidcast client in validator
3 participants