Surface previous dataset versions across URL redirects #783

jameshadfield · 2024-01-17T22:36:56Z

Closes #777. See that issue or the commit messages for full details.

Example URLs which are enabled via this work:

The earliest nCoV dataset we have is from 2020-01-19, and was back then accessed via /ncov. Now /ncov@2020-01-19 redirects to /ncov/gisaid/global/6m@2020-01-19 and loads that dataset.

We can access the /dengue/denv1 datasets prior to Jan 2024 via either /dengue/denv1@YYYY-MM-DD or /dengue/denv1/genome@YYYY-MM-DD, with the former redirecting to the latter.

Testing

Best done locally, via something like

mkdir -p devData
node resourceIndexer/main.js --output devData/index.json --save-inventories \
  --indent --resourceTypes dataset --collections core # needs AWS creds to access s3://nextstrain-inventories
RESOURCE_INDEX="devData/index.json" node server.js --verbose

The heroku review app will not correctly use this functionality as the live index has not been updated with the changes in this PR. If you want a non-local way to test this let me know and I'll make it happen. (You can test a subset of the functionality, e.g. /dengue/denv1@2024-01-03 works as expected.)

src/endpoints/sources.js

resourceIndexer/core-url-remapping.js

I've supported using local manifest+inventory files from the start as it makes for a much nicer dev experience than fetching from S3 each time. This adds the ability to easily save a local copy of the latest inventory data from S3. This both makes getting started much easier and provides an easy way to update your local data.

This is to allow the resource indexer to import `convertManifestJsonToAvailableDatasetList` without the side-effects of running code on import and setting `global.availableDatasets`. No code changes, just extracting 2 functions into a new file.

following the approach the server uses to redirect URLs. For instance, we change the resource name for "dengue/denv1" to "dengue/denv1/genome" because the former URL is redirected to the latter. This allows the server to access historical datasets for the URL "dengue/denv1/genome" which now include all "dengue/denv1" ("dengue_denv1.json") datasets. Due to the way the server currently works this will not allow historical version access for "dengue/denv1" URLs because the instantiation of the Dataset (Resource) instance happens before the resolve/redirecting happens, and the "dengue/denv1" dataset is considered invalid because it's not present in the index _and_ a version descriptor is being provided.

The automated CI tests were using the correct index

The desired behaviour is that (e.g.) dengue/denv1@version redirects to dengue/denv1/genome@version, similarly to how the non-versioned URL currently behaves. This failed because of three inter-related aspects: 1. The previous commit removed (e.g.) "dengue/denv1" from the index, choosing instead to place that data under the "dengue/denv1/genome" ID. 2. The server instantiates the Resource (Dataset) class for the request before resolving/redirecting the URL. During this instantiation the requested version was checked against the index and since the resource wasn't present in the index this failed (NotFound error). This was solved by deferring the version lookup until after the URL resolving/redirecting stage. 3. The redirect URL didn't include the version descriptor.

tsibley · 2024-01-19T22:22:26Z

test/date_descriptor.test.js

+    /* test Charon API */
+    const charonUrl = (path, sidecar) =>
+      `${BASE_URL}/charon/getDataset?prefix=${path}${sidecar ? `&type=root-sequence` : ''}`;
+
+    /* note that the redirect URL of charon requests is constructed by creating
+    a URL object (the RESTful API does not) which results in '/' and '@' being
+    escaped in the query. This isn't necessary, but it's also fine to do so. */
+
+    it(`Charon main dataset: ${fromUrl} goes to ${toUrl} `, async () => {
+      const res = await fetch(charonUrl(fromUrl, false), {redirect: 'manual'});
+      expect(res.status).toEqual(307);
+      expect(decodeURIComponent(res.headers.get('location'))).toEqual(charonUrl(toUrl, false));


non-blocking

These are only tests, so non-blocking, but this is the wrong way 'round in terms of URL encoding/decoding. The Location header is a URL, not a single URL component, so the more appropriate function is decodeURI not decodeURIComponent. Instead of decoding the Location URL at all though, the right way 'round fix is making charonUrl properly encode the URL it generates.

Changes via <nextstrain/nextstrain.org#783>

Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack². To use a couple of examples representative of the redirects we use: 1. the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12. 2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would be resolved to this, such as "ncov", "ncov/gisaid") contained periods without any snapshots because we were using the url "ncov/global". These datasets (versions of "ncov_global.json") are now included as snapshots for "ncov/gisaid/global/6m". ¹ <#783> ² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>

This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-uw9ge9 January 17, 2024 22:37 Inactive

jameshadfield requested a review from a team January 17, 2024 22:48

tsibley self-requested a review January 18, 2024 21:52

tsibley requested changes Jan 19, 2024

View reviewed changes

jameshadfield added 5 commits January 19, 2024 15:08

[charon] refactor manifest parsing code

331f33e

This is to allow the resource indexer to import `convertManifestJsonToAvailableDatasetList` without the side-effects of running code on import and setting `global.availableDatasets`. No code changes, just extracting 2 functions into a new file.

[tests] Fix typo in comment

ecf2134

The automated CI tests were using the correct index

jameshadfield force-pushed the james/resource-index-url-remapping branch from f9d2000 to c88a87d Compare January 19, 2024 02:51

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-uw9ge9 January 19, 2024 02:51 Inactive

tsibley approved these changes Jan 19, 2024

View reviewed changes

jameshadfield merged commit 2ab0598 into master Jan 21, 2024
5 checks passed

jameshadfield deleted the james/resource-index-url-remapping branch January 21, 2024 20:42

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Jan 22, 2024

Update versions / snapshots docs

92bc452

Changes via <nextstrain/nextstrain.org#783>

jameshadfield mentioned this pull request Jan 22, 2024

Update versions / snapshots docs nextstrain/docs.nextstrain.org#182

Merged

jameshadfield mentioned this pull request Feb 1, 2024

[resource indexer] include dataset redirects in index #791

Merged

jameshadfield added a commit that referenced this pull request Feb 6, 2024

Top-level redirects handle versions

fc330b1

This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface previous dataset versions across URL redirects #783

Surface previous dataset versions across URL redirects #783

jameshadfield commented Jan 17, 2024 •

edited

Loading

tsibley Jan 19, 2024

Surface previous dataset versions across URL redirects #783

Surface previous dataset versions across URL redirects #783

Conversation

jameshadfield commented Jan 17, 2024 • edited Loading

Example URLs which are enabled via this work:

Testing

tsibley Jan 19, 2024

Choose a reason for hiding this comment

jameshadfield commented Jan 17, 2024 •

edited

Loading