Skip to content

feat(nf-tower): add dataset:// FileSystem provider#6866

Draft
edmundmiller wants to merge 8 commits intonextflow-io:masterfrom
edmundmiller:dataset-filesystem-provider
Draft

feat(nf-tower): add dataset:// FileSystem provider#6866
edmundmiller wants to merge 8 commits intonextflow-io:masterfrom
edmundmiller:dataset-filesystem-provider

Conversation

@edmundmiller
Copy link
Member

Summary

NIO FileSystemProvider for dataset:// URIs in the nf-tower plugin. Resolves Seqera Platform dataset references to their backing cloud storage paths transparently — pipelines use file('dataset://my-samplesheet') with zero code changes.

Replaces the rejected Channel.fromDataset() approach (seqeralabs/nextflow#19) with a platform-agnostic file system abstraction.

Resolution flow

file('dataset://my-samplesheet')
  → FileHelper.asPath() → DatasetPathFactory.parseUri()
  → DatasetFileSystemProvider → DatasetPath (lazy)
  → DatasetResolver: GET /datasets → ID → GET /versions → cloud URL
  → FileHelper.asPath('s3://bucket/data.csv') → delegate to cloud FS

Components

File Purpose
DatasetFileSystemProvider.java NIO SPI provider, delegates I/O to resolved cloud path
DatasetFileSystem.java Minimal read-only FileSystem
DatasetPath.java Path impl with lazy resolution, parses dataset://name?version=N
DatasetResolver.groovy Platform API client (HttpClient + Bearer auth)
DatasetPathFactory.groovy FileSystemPathFactory extension point

Phase 1 scope

  • Read-only — write ops throw ReadOnlyFileSystemException
  • Works with file(), Channel.fromPath(), nf-schema samplesheetToList()
  • Config from existing tower.* settings (endpoint, accessToken, workspaceId)

Tests

  • Unit tests: DatasetPath parsing, provider scheme/read-only enforcement, factory URI matching
  • WireMock integration tests: full resolve → read flow with mocked Platform API, version selection, caching, auth headers
  • All existing nf-tower tests pass unchanged

Related

  • Supersedes seqeralabs/nextflow#19 (Channel.fromDataset)

@netlify
Copy link

netlify bot commented Feb 25, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 475c1c1
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/699f9799fefc6300087f1050

DatasetFileSystemProvider: NIO SPI for 'dataset' scheme, read-only.
Delegates I/O to the resolved cloud path's provider. Write ops throw
ReadOnlyFileSystemException.

DatasetFileSystem: minimal read-only FileSystem implementation.

DatasetPath: Path wrapping dataset name + optional version. Parses
dataset://name?version=N URIs. Lazy resolution to backing cloud path.
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
Resolves dataset name → cloud storage Path via Platform API:
1. GET /datasets?workspaceId=X → match by name → dataset ID
2. GET /datasets/{id}/versions → latest or specific version → cloud URL
3. FileHelper.asPath(cloudUrl) → concrete S3/GCS/Azure Path

Uses java.net.http.HttpClient with Bearer token auth. Config from
existing tower.* settings (endpoint, accessToken, workspaceId).

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
DatasetPathFactory: FileSystemPathFactory extension point that
intercepts dataset:// URIs in parseUri(), making FileHelper.asPath()
and Nextflow.file() work transparently.

Register DatasetFileSystemProvider via META-INF/services SPI.
Add DatasetPathFactory to plugin extensionPoints in build.gradle.

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
DatasetPathTest: URI/string parsing, Path interface, equality
DatasetFileSystemProviderTest: scheme, FS creation, read-only enforcement
DatasetPathFactoryTest: URI matching, toUriString
DatasetResolverTest: WireMock API error cases, auth, workspace param
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
WireMock Platform API + local file:// as resolved storage. Tests:
- Full resolve → read file contents
- Specific version selection
- Latest version selection (picks highest)
- Provider newInputStream/readAttributes delegation
- Resolved path caching (API called once across multiple reads)

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
@edmundmiller edmundmiller force-pushed the dataset-filesystem-provider branch from 7c93913 to 8e16e49 Compare February 25, 2026 22:52
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
@adamrtalbot
Copy link
Collaborator

Neat idea but I dislike the name dataset://, which feels too generic. But I can't think of a better one 🤔.

This will remove the requirement for an ephemeral URI for the dataset when running the pipeline and make it easier to make datasets a general data store.

@bentsherman
Copy link
Member

I think it would be better to do seqera:// and add some qualifiers to the URI, for example:

seqera://<org>/<workspace>/datasets/<dataset>

Maybe some of these can be omitted. But the extra scoping would keep the URI open to future extensions, like accessing data links.

@jordeu
Copy link
Collaborator

jordeu commented Feb 27, 2026

I like Ben's suggestion; it allows future extensions.

Currently we have an experimental Fusion version that maps multiple Seqera resources to a path like this:

/fusion/seqera/
├── <organization>/
│   └── <workspace>/
│       ├── pipelines/                        # read-only
│       │   └── <name>.<id>.json
│       ├── runs/                             # read-only
│       │   ├── running/
│       │   ├── failed/
│       │   └── all/
│       │       └── <name>.<workflowId>/
│       │           ├── workflow.json
│       │           └── tasks/
│       │               └── <name>.<taskId>/
│       │                   ├── stdout.log
│       │                   └── stderr.log
│       ├── compute-envs/                     # read-only
│       │   └── <name>.<id>.json
│       ├── credentials/                      # read-only
│       │   └── <name>.<id>.json
│       ├── datasets/                         # read-only
│       │   └── <name>.<id>/
│       │       ├── metadata.json
│       │       ├── latest.csv
│       │       └── versions/
│       │           └── v<N>.csv
│       ├── datarepos/                        # read/write
│       │   └── <link-name>/
│       │       └── <cloud-storage-contents>...
│       └── studios/                          # read-only
│           └── <name>.<sessionId>/
│               ├── studio.json
│               └── checkpoints/
│                   └── <name>.<id>.json

I think that this new filesystem provider, even if it's only used for datasets now, can be extended to fetch other platform resources in the future.

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from d9fa5cd to d752bc2 Compare February 28, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants