Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[resource indexer] include dataset redirects in index #791

Merged
merged 4 commits into from
Feb 7, 2024

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Feb 1, 2024

Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².

To use mpox as an example: the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12.

Our usage of ncov_global.json (similarly for other regions) is a lot more complex, because we didn't make clean dataset name switches like monkeypox/mpox.

Looking at the current live index (i.e. before this PR) the ncov.json dataset stops being uploaded 2020-04-23 and then starts being uploaded again on 2020-09-03, leaving a large gap in the snapshot history. When we do consider ncov_global.json (this PR), we fill in this gap with 99 snapshots. Great!

However ncov.json continued to be uploaded through 2021-02-15. In cases such as this the indexer will pick the last uploaded version in a given day. However looking at the data it's clear ncov.json was not being rebuilt, simply re-uploaded. For instance, here are nextstrain.org URLs you can view the data in:

Looking at this data it's clear that we should drop any ncov.json datasets after 2020-04-23, which I'll add to this PR now update: no need to programmatically drop, see next message in this PR

Closes #784

¹ #783
² https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939

@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-inde-fp0zqu February 1, 2024 22:53 Inactive
@jameshadfield
Copy link
Member Author

Here's all the days where we had multiple datasets of different filenames where we have to choose one to represent the "snapshot" for the day. The links will show the dataset in Auspice. The first filename is the one the indexer will use for the day (as it was the most recently updated). The takeaways:

  1. ncov.json never appears in the index after 2020-04-07, despite it being regularly re-uploaded to S3. Good.
  2. 2022-04-30 was the first day we switched from ncov/open/:region to ncov/open/:region/:timeWindow and we uploaded both sets that day. The datasets are quite different, but the indexer is taking the 6m version which is correct.

@jameshadfield jameshadfield marked this pull request as ready for review February 2, 2024 01:28
@jameshadfield jameshadfield requested a review from a team February 2, 2024 01:28
Comment on lines +124 to +129
* Produces a list of dataset redirects. These are defined in a stand-alone
* function so that they can be easily reused by the resource indexer to
* associate "old" filenames with their new dataset path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Why a function and not just a static list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons:

  • I like the organizational structure it imposes. This includes using a docstring which is clear that it applies to all contents (and thus harder to miss).
  • It's easy to add checks later on if we want to - e.g. paths must start with /, or must not end with / etc.
  • More generally (although not really applicable in this case), I prefer deferred execution rather than execution upon import. Writ large we pay a sizeable performance cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't find those compelling in this case, FWIW.

  • You can use a JSDoc comment with plain old variables.
  • Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a .map() instead of erroring.)
  • Deferred execution can be nice, but as you say is not applicable here at all.

src/redirects.js Outdated Show resolved Hide resolved
These are defined in a stand-alone function so that they can be easily
reused by the resource indexer to associate "old" filenames with their
new dataset path.

Express routes can include patterns and regexes but these are hard to
parse outside of express so I've converted to plain strings. It was
possible to parse the old routes via `path-to-regexp` or by
instantiating a small express app and iterating over the routing stack,
but both felt brittle and more complexity than its worth.

URL queries are now preserved for these redirects for all datasets. This
is a change in behaviour for the monkeypox URLs - as an example
"nextstrain.org/monkeypox/mpxv?c=region" will now redirect to
"nextstrain.org/mpox/all-clades?c=region". There is a very sizeable
caveat to this however, because if the redirect path described here is
subsequently redirected by our `canonicalizeDataset` middleware the
query will be discarded at that point. In practice, all of the ncov
paths and 1/4 of the mpox paths described here are subsequently
redirected and so the URL query is not preserved.
<#792> tracks this.
While creating a recent timeline of how S3 object changes end up as
(on-server) resource index changes
<#790> this is the
logging that I really wanted.
Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².

To use a couple of examples representative of the redirects we use:
1. the dataset (URL) path "mpox/all-clades" now includes previous
   versions which were named "monkeypox_mpxv.json", thus extending the
   snapshot history of this dataset from 2023-09-23 to 2022-06-12.
2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would
   be resolved to this, such as "ncov", "ncov/gisaid") contained periods
   without any snapshots because we were using the url "ncov/global".
   These datasets (versions of "ncov_global.json") are now included as
   snapshots for "ncov/gisaid/global/6m".

¹ <#783>
² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 6, 2024

These redirects don't consider snapshot URLs, e.g. "http://localhost:5000/monkeypox/mpxv?c=region" redirects appropriately, "http://localhost:5000/monkeypox/mpxv@2023-01-26?c=region" 404s (despite there being a monkeypox_mpxv.json uploaded that day. I'll fix this up.

Update: Functionality added in fc330b1

This functionality should have been implemented as part of
<#783> but that PR
didn't consider the hardcoded redirects in `src/redirects.js`
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-inde-fp0zqu February 6, 2024 23:52 Inactive
Comment on lines +124 to +129
* Produces a list of dataset redirects. These are defined in a stand-alone
* function so that they can be easily reused by the resource indexer to
* associate "old" filenames with their new dataset path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't find those compelling in this case, FWIW.

  • You can use a JSDoc comment with plain old variables.
  • Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a .map() instead of erroring.)
  • Deferred execution can be nice, but as you say is not applicable here at all.

@jameshadfield jameshadfield merged commit 9bc6c47 into master Feb 7, 2024
5 checks passed
@jameshadfield jameshadfield deleted the james/indexer-redirects branch February 7, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[indexer] incorrect redirect behaviour
3 participants