-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
The /index/{dcc}/{local_id} endpoint currently supports 4DN only. 4DN provides index files (.px2, .bai) as structured entries in its API response (extra_files array), which are stored in extra.extra_files on materialized file documents and served via the index endpoint.
For genomic visualization (e.g., tiling BAM alignments in a genome browser), clients need to fetch the index file first to determine byte ranges, then stream specific tiles from /data/{dcc}/{local_id} using byte-range requests.
Investigation
ENCODE was investigated as a candidate for extending the index endpoint. Findings:
ENCODE does not provide genomic index files
- No
.baifor BAM files — ENCODE releases BAM files without accompanying indexes. Users generate them locally withsamtools index. See broadinstitute/gdr-ingest#8 for precedent. - No
.tbi/.csifor VCF/BED — No tabix or CSI indexes are provided. - No
extra_filesequivalent — Unlike 4DN's API which returns anextra_filesarray with index file metadata, ENCODE file objects have no field for companion files. index_offield is unrelated — ENCODE'sindex_ofrefers to FASTQ index reads (barcode sequences for demultiplexing), not.bai/.tbigenomic indexes. The ENCODE file schema defines it as linkingoutput_type: "index reads"files to parent FASTQs.
Self-indexed formats already work
bigWig, bigBed, and hic files are self-indexed and support random access via HTTP Range requests on the existing /data/encode/{local_id} endpoint — no separate index endpoint needed.
| Format | Index Type | Provided by ENCODE? | Self-Indexed? |
|---|---|---|---|
| BAM | .bai | No | No |
| VCF.gz | .tbi | No | No |
| BED.gz | .tbi | No | No |
| bigWig | — | N/A | Yes |
| bigBed | — | N/A | Yes |
| hic | — | N/A | Yes |
Possible approaches
-
Server-side index generation during sync — Run
samtools indexon BAM files andtabixon VCF/BED files during ENCODE sync; store the generated index files (e.g., in object storage) along with their metadata and serve them via the/indexendpoint. Adds compute time to sync and storage cost. -
On-demand index generation — Generate the index on first
/index/encode/{local_id}request, cache it, and serve subsequent requests from cache. Adds significant latency on first access (must download full BAM to index it). Users may need to request indexing explicitly via a new endpoint for files missing indexes. -
Proxy to cloud storage with convention-based URLs — If ENCODE ever co-locates
.baifiles alongside BAMs in S3 (s3://encode-public/.../ENCFF*.bam.bai), we could try fetching from the predicted URL. Currently no evidence this exists.