-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface previous dataset versions across URL redirects #783
Conversation
I've supported using local manifest+inventory files from the start as it makes for a much nicer dev experience than fetching from S3 each time. This adds the ability to easily save a local copy of the latest inventory data from S3. This both makes getting started much easier and provides an easy way to update your local data.
This is to allow the resource indexer to import `convertManifestJsonToAvailableDatasetList` without the side-effects of running code on import and setting `global.availableDatasets`. No code changes, just extracting 2 functions into a new file.
following the approach the server uses to redirect URLs. For instance, we change the resource name for "dengue/denv1" to "dengue/denv1/genome" because the former URL is redirected to the latter. This allows the server to access historical datasets for the URL "dengue/denv1/genome" which now include all "dengue/denv1" ("dengue_denv1.json") datasets. Due to the way the server currently works this will not allow historical version access for "dengue/denv1" URLs because the instantiation of the Dataset (Resource) instance happens before the resolve/redirecting happens, and the "dengue/denv1" dataset is considered invalid because it's not present in the index _and_ a version descriptor is being provided.
The automated CI tests were using the correct index
The desired behaviour is that (e.g.) dengue/denv1@version redirects to dengue/denv1/genome@version, similarly to how the non-versioned URL currently behaves. This failed because of three inter-related aspects: 1. The previous commit removed (e.g.) "dengue/denv1" from the index, choosing instead to place that data under the "dengue/denv1/genome" ID. 2. The server instantiates the Resource (Dataset) class for the request before resolving/redirecting the URL. During this instantiation the requested version was checked against the index and since the resource wasn't present in the index this failed (NotFound error). This was solved by deferring the version lookup until after the URL resolving/redirecting stage. 3. The redirect URL didn't include the version descriptor.
f9d2000
to
c88a87d
Compare
/* test Charon API */ | ||
const charonUrl = (path, sidecar) => | ||
`${BASE_URL}/charon/getDataset?prefix=${path}${sidecar ? `&type=root-sequence` : ''}`; | ||
|
||
/* note that the redirect URL of charon requests is constructed by creating | ||
a URL object (the RESTful API does not) which results in '/' and '@' being | ||
escaped in the query. This isn't necessary, but it's also fine to do so. */ | ||
|
||
it(`Charon main dataset: ${fromUrl} goes to ${toUrl} `, async () => { | ||
const res = await fetch(charonUrl(fromUrl, false), {redirect: 'manual'}); | ||
expect(res.status).toEqual(307); | ||
expect(decodeURIComponent(res.headers.get('location'))).toEqual(charonUrl(toUrl, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
These are only tests, so non-blocking, but this is the wrong way 'round in terms of URL encoding/decoding. The Location header is a URL, not a single URL component, so the more appropriate function is decodeURI
not decodeURIComponent
. Instead of decoding the Location URL at all though, the right way 'round fix is making charonUrl
properly encode the URL it generates.
Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack². To use a couple of examples representative of the redirects we use: 1. the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12. 2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would be resolved to this, such as "ncov", "ncov/gisaid") contained periods without any snapshots because we were using the url "ncov/global". These datasets (versions of "ncov_global.json") are now included as snapshots for "ncov/gisaid/global/6m". ¹ <#783> ² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack². To use a couple of examples representative of the redirects we use: 1. the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12. 2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would be resolved to this, such as "ncov", "ncov/gisaid") contained periods without any snapshots because we were using the url "ncov/global". These datasets (versions of "ncov_global.json") are now included as snapshots for "ncov/gisaid/global/6m". ¹ <#783> ² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
This functionality should have been implemented as part of <#783> but that PR didn't consider the hardcoded redirects in `src/redirects.js`
Closes #777. See that issue or the commit messages for full details.
Example URLs which are enabled via this work:
The earliest nCoV dataset we have is from 2020-01-19, and was back then accessed via
/ncov
. Now/ncov@2020-01-19
redirects to/ncov/gisaid/global/6m@2020-01-19
and loads that dataset.We can access the
/dengue/denv1
datasets prior to Jan 2024 via either/dengue/denv1@YYYY-MM-DD
or/dengue/denv1/genome@YYYY-MM-DD
, with the former redirecting to the latter.Testing
Best done locally, via something like
The heroku review app will not correctly use this functionality as the live index has not been updated with the changes in this PR. If you want a non-local way to test this let me know and I'll make it happen. (You can test a subset of the functionality, e.g. /dengue/denv1@2024-01-03 works as expected.)