ENCODE index file support for genomic visualization

## Context

The `/index/{dcc}/{local_id}` endpoint currently supports 4DN only. 4DN provides index files (`.px2`, `.bai`) as structured entries in its API response (`extra_files` array), which are stored in `extra.extra_files` on materialized file documents and served via the index endpoint.

For genomic visualization (e.g., tiling BAM alignments in a genome browser), clients need to fetch the index file first to determine byte ranges, then stream specific tiles from `/data/{dcc}/{local_id}` using byte-range requests.

## Investigation

ENCODE was investigated as a candidate for extending the index endpoint. Findings:

### ENCODE does not provide genomic index files

- **No `.bai` for BAM files** — ENCODE releases BAM files without accompanying indexes. Users generate them locally with `samtools index`. See [broadinstitute/gdr-ingest#8](https://github.com/broadinstitute/gdr-ingest/issues/8) for precedent.
- **No `.tbi`/`.csi` for VCF/BED** — No tabix or CSI indexes are provided.
- **No `extra_files` equivalent** — Unlike 4DN's API which returns an `extra_files` array with index file metadata, ENCODE file objects have no field for companion files.
- **`index_of` field is unrelated** — ENCODE's `index_of` refers to FASTQ index reads (barcode sequences for demultiplexing), not `.bai`/`.tbi` genomic indexes. The [ENCODE file schema](https://www.encodeproject.org/profiles/file.json) defines it as linking `output_type: "index reads"` files to parent FASTQs.

### Self-indexed formats already work

bigWig, bigBed, and hic files are self-indexed and support random access via HTTP Range requests on the existing `/data/encode/{local_id}` endpoint — no separate index endpoint needed.

| Format | Index Type | Provided by ENCODE? | Self-Indexed? |
|--------|-----------|---------------------|---------------|
| BAM | .bai | No | No |
| VCF.gz | .tbi | No | No |
| BED.gz | .tbi | No | No |
| bigWig | — | N/A | Yes |
| bigBed | — | N/A | Yes |
| hic | — | N/A | Yes |

## Possible approaches

1. **Server-side index generation during sync** — Run `samtools index` on BAM files and `tabix` on VCF/BED files during ENCODE sync; store the generated index files (e.g., in object storage) along with their metadata and serve them via the `/index` endpoint. Adds compute time to sync and storage cost.

2. **On-demand index generation** — Generate the index on first `/index/encode/{local_id}` request, cache it, and serve subsequent requests from cache. Adds significant latency on first access (must download full BAM to index it). Users may need to request indexing explicitly via a new endpoint for files missing indexes.

3. **Proxy to cloud storage with convention-based URLs** — If ENCODE ever co-locates `.bai` files alongside BAMs in S3 (`s3://encode-public/.../ENCFF*.bam.bai`), we could try fetching from the predicted URL. Currently no evidence this exists.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENCODE index file support for genomic visualization #1

Context

Investigation

ENCODE does not provide genomic index files

Self-indexed formats already work

Possible approaches

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Format	Index Type	Provided by ENCODE?	Self-Indexed?
BAM	.bai	No	No
VCF.gz	.tbi	No	No
BED.gz	.tbi	No	No
bigWig	—	N/A	Yes
bigBed	—	N/A	Yes
hic	—	N/A	Yes

ENCODE index file support for genomic visualization #1

Description

Context

Investigation

ENCODE does not provide genomic index files

Self-indexed formats already work

Possible approaches

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions