feat(nf-tower): add dataset:// FileSystem provider by edmundmiller · Pull Request #6866 · nextflow-io/nextflow

edmundmiller · 2026-02-25T20:53:43Z

Summary

NIO FileSystemProvider for dataset:// URIs in the nf-tower plugin. Resolves Seqera Platform dataset references to their backing cloud storage paths transparently — pipelines use file('dataset://my-samplesheet') with zero code changes.

Replaces the rejected Channel.fromDataset() approach (seqeralabs/nextflow#19) with a platform-agnostic file system abstraction.

Resolution flow

file('dataset://my-samplesheet')
  → FileHelper.asPath() → DatasetPathFactory.parseUri()
  → DatasetFileSystemProvider → DatasetPath (lazy)
  → DatasetResolver: GET /datasets → ID → GET /versions → cloud URL
  → FileHelper.asPath('s3://bucket/data.csv') → delegate to cloud FS

Components

File	Purpose
`DatasetFileSystemProvider.java`	NIO SPI provider, delegates I/O to resolved cloud path
`DatasetFileSystem.java`	Minimal read-only FileSystem
`DatasetPath.java`	Path impl with lazy resolution, parses `dataset://name?version=N`
`DatasetResolver.groovy`	Platform API client (HttpClient + Bearer auth)
`DatasetPathFactory.groovy`	FileSystemPathFactory extension point

Phase 1 scope

Read-only — write ops throw ReadOnlyFileSystemException
Works with file(), Channel.fromPath(), nf-schema samplesheetToList()
Config from existing tower.* settings (endpoint, accessToken, workspaceId)

Tests

Unit tests: DatasetPath parsing, provider scheme/read-only enforcement, factory URI matching
WireMock integration tests: full resolve → read flow with mocked Platform API, version selection, caching, auth headers
All existing nf-tower tests pass unchanged

DatasetFileSystemProvider: NIO SPI for 'dataset' scheme, read-only. Delegates I/O to the resolved cloud path's provider. Write ops throw ReadOnlyFileSystemException. DatasetFileSystem: minimal read-only FileSystem implementation. DatasetPath: Path wrapping dataset name + optional version. Parses dataset://name?version=N URIs. Lazy resolution to backing cloud path. Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

Resolves dataset name → cloud storage Path via Platform API: 1. GET /datasets?workspaceId=X → match by name → dataset ID 2. GET /datasets/{id}/versions → latest or specific version → cloud URL 3. FileHelper.asPath(cloudUrl) → concrete S3/GCS/Azure Path Uses java.net.http.HttpClient with Bearer token auth. Config from existing tower.* settings (endpoint, accessToken, workspaceId). Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

DatasetPathFactory: FileSystemPathFactory extension point that intercepts dataset:// URIs in parseUri(), making FileHelper.asPath() and Nextflow.file() work transparently. Register DatasetFileSystemProvider via META-INF/services SPI. Add DatasetPathFactory to plugin extensionPoints in build.gradle. Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

DatasetPathTest: URI/string parsing, Path interface, equality DatasetFileSystemProviderTest: scheme, FS creation, read-only enforcement DatasetPathFactoryTest: URI matching, toUriString DatasetResolverTest: WireMock API error cases, auth, workspace param Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

WireMock Platform API + local file:// as resolved storage. Tests: - Full resolve → read file contents - Specific version selection - Latest version selection (picks highest) - Provider newInputStream/readAttributes delegation - Resolved path caching (API called once across multiple reads) Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

adamrtalbot · 2026-02-26T17:00:58Z

Neat idea but I dislike the name dataset://, which feels too generic. But I can't think of a better one 🤔.

This will remove the requirement for an ephemeral URI for the dataset when running the pipeline and make it easier to make datasets a general data store.

bentsherman · 2026-02-26T17:54:32Z

I think it would be better to do seqera:// and add some qualifiers to the URI, for example:

seqera://<org>/<workspace>/datasets/<dataset>

Maybe some of these can be omitted. But the extra scoping would keep the URI open to future extensions, like accessing data links.

jordeu · 2026-02-27T05:57:54Z

I like Ben's suggestion; it allows future extensions.

Currently we have an experimental Fusion version that maps multiple Seqera resources to a path like this:

/fusion/seqera/
├── <organization>/
│   └── <workspace>/
│       ├── pipelines/                        # read-only
│       │   └── <name>.<id>.json
│       ├── runs/                             # read-only
│       │   ├── running/
│       │   ├── failed/
│       │   └── all/
│       │       └── <name>.<workflowId>/
│       │           ├── workflow.json
│       │           └── tasks/
│       │               └── <name>.<taskId>/
│       │                   ├── stdout.log
│       │                   └── stderr.log
│       ├── compute-envs/                     # read-only
│       │   └── <name>.<id>.json
│       ├── credentials/                      # read-only
│       │   └── <name>.<id>.json
│       ├── datasets/                         # read-only
│       │   └── <name>.<id>/
│       │       ├── metadata.json
│       │       ├── latest.csv
│       │       └── versions/
│       │           └── v<N>.csv
│       ├── datarepos/                        # read/write
│       │   └── <link-name>/
│       │       └── <cloud-storage-contents>...
│       └── studios/                          # read-only
│           └── <name>.<sessionId>/
│               ├── studio.json
│               └── checkpoints/
│                   └── <name>.<id>.json

I think that this new filesystem provider, even if it's only used for datasets now, can be extended to fetch other platform resources in the future.

edmundmiller added 5 commits February 25, 2026 16:52

edmundmiller force-pushed the dataset-filesystem-provider branch from 7c93913 to 8e16e49 Compare February 25, 2026 22:52

edmundmiller added 3 commits February 25, 2026 17:01

test: simplify dataset specs w spock tables

11a400e

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

test: add pending live auth regression

caf25e6

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

fix: forward tower auth for dataset http reads

475c1c1

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>

pditommaso force-pushed the master branch 2 times, most recently from d9fa5cd to d752bc2 Compare February 28, 2026 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nf-tower): add dataset:// FileSystem provider#6866

feat(nf-tower): add dataset:// FileSystem provider#6866
edmundmiller wants to merge 8 commits intonextflow-io:masterfrom
edmundmiller:dataset-filesystem-provider

edmundmiller commented Feb 25, 2026

Uh oh!

netlify bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

adamrtalbot commented Feb 26, 2026

Uh oh!

bentsherman commented Feb 26, 2026

Uh oh!

jordeu commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edmundmiller commented Feb 25, 2026

Summary

Resolution flow

Components

Phase 1 scope

Tests

Related

Uh oh!

netlify bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

adamrtalbot commented Feb 26, 2026

Uh oh!

bentsherman commented Feb 26, 2026

Uh oh!

jordeu commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Feb 25, 2026 •

edited

Loading