-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[resource indexer] include dataset redirects in index #791
Conversation
* Produces a list of dataset redirects. These are defined in a stand-alone | ||
* function so that they can be easily reused by the resource indexer to | ||
* associate "old" filenames with their new dataset path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Why a function and not just a static list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few reasons:
- I like the organizational structure it imposes. This includes using a docstring which is clear that it applies to all contents (and thus harder to miss).
- It's easy to add checks later on if we want to - e.g. paths must start with
/
, or must not end with/
etc. - More generally (although not really applicable in this case), I prefer deferred execution rather than execution upon import. Writ large we pay a sizeable performance cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I don't find those compelling in this case, FWIW.
- You can use a JSDoc comment with plain old variables.
- Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a
.map()
instead of erroring.) - Deferred execution can be nice, but as you say is not applicable here at all.
These are defined in a stand-alone function so that they can be easily reused by the resource indexer to associate "old" filenames with their new dataset path. Express routes can include patterns and regexes but these are hard to parse outside of express so I've converted to plain strings. It was possible to parse the old routes via `path-to-regexp` or by instantiating a small express app and iterating over the routing stack, but both felt brittle and more complexity than its worth. URL queries are now preserved for these redirects for all datasets. This is a change in behaviour for the monkeypox URLs - as an example "nextstrain.org/monkeypox/mpxv?c=region" will now redirect to "nextstrain.org/mpox/all-clades?c=region". There is a very sizeable caveat to this however, because if the redirect path described here is subsequently redirected by our `canonicalizeDataset` middleware the query will be discarded at that point. In practice, all of the ncov paths and 1/4 of the mpox paths described here are subsequently redirected and so the URL query is not preserved. <#792> tracks this.
While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.
Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack². To use a couple of examples representative of the redirects we use: 1. the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12. 2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would be resolved to this, such as "ncov", "ncov/gisaid") contained periods without any snapshots because we were using the url "ncov/global". These datasets (versions of "ncov_global.json") are now included as snapshots for "ncov/gisaid/global/6m". ¹ <#783> ² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
47efb0a
to
c127c4c
Compare
These redirects don't consider snapshot URLs, e.g. "http://localhost:5000/monkeypox/mpxv?c=region" redirects appropriately, "http://localhost:5000/monkeypox/mpxv@2023-01-26?c=region" 404s (despite there being a Update: Functionality added in fc330b1 |
This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`
* Produces a list of dataset redirects. These are defined in a stand-alone | ||
* function so that they can be easily reused by the resource indexer to | ||
* associate "old" filenames with their new dataset path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I don't find those compelling in this case, FWIW.
- You can use a JSDoc comment with plain old variables.
- Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a
.map()
instead of erroring.) - Deferred execution can be nice, but as you say is not applicable here at all.
Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².
To use mpox as an example: the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12.
Our usage of
ncov_global.json
(similarly for other regions) is a lot more complex, because we didn't make clean dataset name switches like monkeypox/mpox.Looking at the current live index (i.e. before this PR) the
ncov.json
dataset stops being uploaded 2020-04-23 and then starts being uploaded again on 2020-09-03, leaving a large gap in the snapshot history. When we do considerncov_global.json
(this PR), we fill in this gap with 99 snapshots. Great!However
ncov.json
continued to be uploaded through 2021-02-15. In cases such as this the indexer will pick the last uploaded version in a given day. However looking at the data it's clearncov.json
was not being rebuilt, simply re-uploaded. For instance, here are nextstrain.org URLs you can view the data in:2020-09-03: ncov.json, ncov_global.json
2021-02-15: ncov.json, ncov_global.json
Looking at this data it's clear that we should drop any
ncov.json
datasets after 2020-04-23, whichI'll add to this PR nowupdate: no need to programmatically drop, see next message in this PRCloses #784
¹ #783
² https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939