[resource indexer] include dataset redirects in index #791

jameshadfield · 2024-02-01T22:53:40Z

Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².

To use mpox as an example: the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12.

Our usage of ncov_global.json (similarly for other regions) is a lot more complex, because we didn't make clean dataset name switches like monkeypox/mpox.

Looking at the current live index (i.e. before this PR) the ncov.json dataset stops being uploaded 2020-04-23 and then starts being uploaded again on 2020-09-03, leaving a large gap in the snapshot history. When we do consider ncov_global.json (this PR), we fill in this gap with 99 snapshots. Great!

However ncov.json continued to be uploaded through 2021-02-15. In cases such as this the indexer will pick the last uploaded version in a given day. However looking at the data it's clear ncov.json was not being rebuilt, simply re-uploaded. For instance, here are nextstrain.org URLs you can view the data in:

2020-09-03: ncov.json, ncov_global.json
2021-02-15: ncov.json, ncov_global.json

Looking at this data it's clear that we should drop any ncov.json datasets after 2020-04-23, which ~~I'll add to this PR now~~ update: no need to programmatically drop, see next message in this PR

Closes #784

¹ #783
² https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939

jameshadfield · 2024-02-02T01:28:00Z

Here's all the days where we had multiple datasets of different filenames where we have to choose one to represent the "snapshot" for the day. The links will show the dataset in Auspice. The first filename is the one the indexer will use for the day (as it was the most recently updated). The takeaways:

ncov.json never appears in the index after 2020-04-07, despite it being regularly re-uploaded to S3. Good.
2022-04-30 was the first day we switched from ncov/open/:region to ncov/open/:region/:timeWindow and we uploaded both sets that day. The datasets are quite different, but the indexer is taking the 6m version which is correct.

2022-04-30| ncov_open_europe_6m.json, ncov_open_europe.json
2022-04-30| ncov_open_global_6m.json, ncov_open_global.json
2022-04-30| ncov_open_north-america_6m.json, ncov_open_north-america.json
2022-04-30| ncov_open_asia_6m.json, ncov_open_asia.json
2022-04-30| ncov_open_oceania_6m.json, ncov_open_oceania.json
2022-04-30| ncov_open_africa_6m.json, ncov_open_africa.json
2022-04-30| ncov_open_south-america_6m.json, ncov_open_south-america.json
2021-02-15| ncov_global.json, ncov.json
2021-02-08| ncov_global.json, ncov.json
2021-02-04| ncov_global.json, ncov.json
2021-02-02| ncov_global.json, ncov.json
2021-01-29| ncov_global.json, ncov.json
2021-01-27| ncov_global.json, ncov.json
2021-01-25| ncov_global.json, ncov.json
2021-01-21| ncov_global.json, ncov.json
2021-01-19| ncov_global.json, ncov.json
2021-01-16| ncov_global.json, ncov.json
2021-01-13| ncov_global.json, ncov.json
2021-01-11| ncov_global.json, ncov.json
2021-01-07| ncov_global.json, ncov.json
2021-01-05| ncov_global.json, ncov.json
2021-01-04| ncov_global.json, ncov.json
2021-01-01| ncov_global.json, ncov.json
2020-12-28| ncov_global.json, ncov.json
2020-12-20| ncov_global.json, ncov.json
2020-12-18| ncov_global.json, ncov.json
2020-12-16| ncov_global.json, ncov.json
2020-12-10| ncov_global.json, ncov.json
2020-12-08| ncov_global.json, ncov.json
2020-12-04| ncov_global.json, ncov.json
2020-11-30| ncov_global.json, ncov.json
2020-11-27| ncov_global.json, ncov.json
2020-11-26| ncov_global.json, ncov.json
2020-11-24| ncov_global.json, ncov.json
2020-11-20| ncov_global.json, ncov.json
2020-11-18| ncov_global.json, ncov.json
2020-11-16| ncov_global.json, ncov.json
2020-11-14| ncov_global.json, ncov.json
2020-11-12| ncov_global.json, ncov.json
2020-11-10| ncov_global.json, ncov.json
2020-11-06| ncov_global.json, ncov.json
2020-11-04| ncov_global.json, ncov.json
2020-11-02| ncov_global.json,`ncov.json
2020-10-29| ncov_global.json, ncov.json
2020-10-27| ncov_global.json, ncov.json
2020-10-23| ncov_global.json, ncov.json
2020-10-21| ncov_global.json, ncov.json
2020-10-19| ncov_global.json, ncov.json
2020-10-15| ncov_global.json, ncov.json
2020-10-13| ncov_global.json, ncov.json
2020-10-09| ncov_global.json, ncov.json
2020-10-07| ncov_global.json, ncov.json
2020-10-05| ncov_global.json, ncov.json
2020-10-01| ncov_global.json, ncov.json
2020-09-29| ncov_global.json, ncov.json
2020-09-25| ncov_global.json, ncov.json
2020-09-23| ncov_global.json, ncov.json
2020-09-21| ncov_global.json, ncov.json
2020-09-03| ncov_global.json, ncov.json
2020-04-23| ncov_global.json, ncov.json
2020-04-21| ncov_global.json, ncov.json
2020-04-20| ncov_global.json, ncov.json
2020-04-19| ncov_global.json, ncov.json
2020-04-18| ncov_global.json, ncov.json
2020-04-17| ncov_global.json, ncov.json
2020-04-16| ncov_global.json, ncov.json
2020-04-15| ncov_global.json, ncov.json
2020-04-14| ncov_global.json, ncov.json
2020-04-12| ncov_global.json, ncov.json
2020-04-11| ncov_global.json, ncov.json
2020-04-10| ncov_global.json, ncov.json
2020-04-09| ncov_global.json, ncov.json
2020-04-08| ncov_global.json, ncov.json

tsibley · 2024-02-05T22:09:49Z

src/redirects.js

+ * Produces a list of dataset redirects. These are defined in a stand-alone
+ * function so that they can be easily reused by the resource indexer to
+ * associate "old" filenames with their new dataset path.


non-blocking

Why a function and not just a static list?

A few reasons:

I like the organizational structure it imposes. This includes using a docstring which is clear that it applies to all contents (and thus harder to miss).

It's easy to add checks later on if we want to - e.g. paths must start with /, or must not end with / etc.

More generally (although not really applicable in this case), I prefer deferred execution rather than execution upon import. Writ large we pay a sizeable performance cost.

Hmm. I don't find those compelling in this case, FWIW.

You can use a JSDoc comment with plain old variables.

Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a .map() instead of erroring.)

Deferred execution can be nice, but as you say is not applicable here at all.

src/redirects.js

These are defined in a stand-alone function so that they can be easily reused by the resource indexer to associate "old" filenames with their new dataset path. Express routes can include patterns and regexes but these are hard to parse outside of express so I've converted to plain strings. It was possible to parse the old routes via `path-to-regexp` or by instantiating a small express app and iterating over the routing stack, but both felt brittle and more complexity than its worth. URL queries are now preserved for these redirects for all datasets. This is a change in behaviour for the monkeypox URLs - as an example "nextstrain.org/monkeypox/mpxv?c=region" will now redirect to "nextstrain.org/mpox/all-clades?c=region". There is a very sizeable caveat to this however, because if the redirect path described here is subsequently redirected by our `canonicalizeDataset` middleware the query will be discarded at that point. In practice, all of the ncov paths and 1/4 of the mpox paths described here are subsequently redirected and so the URL query is not preserved. <#792> tracks this.

While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.

Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack². To use a couple of examples representative of the redirects we use: 1. the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12. 2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would be resolved to this, such as "ncov", "ncov/gisaid") contained periods without any snapshots because we were using the url "ncov/global". These datasets (versions of "ncov_global.json") are now included as snapshots for "ncov/gisaid/global/6m". ¹ <#783> ² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>

jameshadfield · 2024-02-06T21:03:42Z

These redirects don't consider snapshot URLs, e.g. "http://localhost:5000/monkeypox/mpxv?c=region" redirects appropriately, "http://localhost:5000/monkeypox/mpxv@2023-01-26?c=region" 404s (despite there being a monkeypox_mpxv.json uploaded that day. I'll fix this up.

Update: Functionality added in fc330b1

This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`

tsibley · 2024-02-07T18:21:23Z

src/redirects.js

+ * Produces a list of dataset redirects. These are defined in a stand-alone
+ * function so that they can be easily reused by the resource indexer to
+ * associate "old" filenames with their new dataset path.


Hmm. I don't find those compelling in this case, FWIW.

You can use a JSDoc comment with plain old variables.

Checks can be done on a plain old variable too, done immediately after. (Or in the example of slashes handling like you give, the values can just be canonicalized with a .map() instead of erroring.)

Deferred execution can be nice, but as you say is not applicable here at all.

nextstrain-bot temporarily deployed to nextstrain-s-james-inde-fp0zqu February 1, 2024 22:53 Inactive

jameshadfield marked this pull request as ready for review February 2, 2024 01:28

jameshadfield requested a review from a team February 2, 2024 01:28

tsibley approved these changes Feb 5, 2024

View reviewed changes

jameshadfield added 3 commits February 7, 2024 09:42

[resource index] improve logging

8ebcb4c

While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.

jameshadfield force-pushed the james/indexer-redirects branch from 47efb0a to c127c4c Compare February 6, 2024 20:56

nextstrain-bot temporarily deployed to nextstrain-s-james-inde-fp0zqu February 6, 2024 20:56 Inactive

Top-level redirects handle versions

fc330b1

This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`

nextstrain-bot temporarily deployed to nextstrain-s-james-inde-fp0zqu February 6, 2024 23:52 Inactive

tsibley mentioned this pull request Feb 7, 2024

Preserve URL queries across redirects #793

Merged

tsibley approved these changes Feb 7, 2024

View reviewed changes

jameshadfield merged commit 9bc6c47 into master Feb 7, 2024
5 checks passed

jameshadfield deleted the james/indexer-redirects branch February 7, 2024 21:06

jameshadfield mentioned this pull request Feb 7, 2024

[CI] False positive build failures #794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[resource indexer] include dataset redirects in index #791

[resource indexer] include dataset redirects in index #791

jameshadfield commented Feb 1, 2024 •

edited

Loading

jameshadfield commented Feb 2, 2024

tsibley Feb 5, 2024

jameshadfield Feb 6, 2024

tsibley Feb 7, 2024

jameshadfield commented Feb 6, 2024 •

edited

Loading

tsibley Feb 7, 2024

[resource indexer] include dataset redirects in index #791

[resource indexer] include dataset redirects in index #791

Conversation

jameshadfield commented Feb 1, 2024 • edited Loading

jameshadfield commented Feb 2, 2024

tsibley Feb 5, 2024

Choose a reason for hiding this comment

jameshadfield Feb 6, 2024

Choose a reason for hiding this comment

tsibley Feb 7, 2024

Choose a reason for hiding this comment

jameshadfield commented Feb 6, 2024 • edited Loading

tsibley Feb 7, 2024

Choose a reason for hiding this comment

jameshadfield commented Feb 1, 2024 •

edited

Loading

jameshadfield commented Feb 6, 2024 •

edited

Loading