Skip to content

ENCODE index file support for genomic visualization #1

@conradbzura

Description

@conradbzura

Context

The /index/{dcc}/{local_id} endpoint currently supports 4DN only. 4DN provides index files (.px2, .bai) as structured entries in its API response (extra_files array), which are stored in extra.extra_files on materialized file documents and served via the index endpoint.

For genomic visualization (e.g., tiling BAM alignments in a genome browser), clients need to fetch the index file first to determine byte ranges, then stream specific tiles from /data/{dcc}/{local_id} using byte-range requests.

Investigation

ENCODE was investigated as a candidate for extending the index endpoint. Findings:

ENCODE does not provide genomic index files

  • No .bai for BAM files — ENCODE releases BAM files without accompanying indexes. Users generate them locally with samtools index. See broadinstitute/gdr-ingest#8 for precedent.
  • No .tbi/.csi for VCF/BED — No tabix or CSI indexes are provided.
  • No extra_files equivalent — Unlike 4DN's API which returns an extra_files array with index file metadata, ENCODE file objects have no field for companion files.
  • index_of field is unrelated — ENCODE's index_of refers to FASTQ index reads (barcode sequences for demultiplexing), not .bai/.tbi genomic indexes. The ENCODE file schema defines it as linking output_type: "index reads" files to parent FASTQs.

Self-indexed formats already work

bigWig, bigBed, and hic files are self-indexed and support random access via HTTP Range requests on the existing /data/encode/{local_id} endpoint — no separate index endpoint needed.

Format Index Type Provided by ENCODE? Self-Indexed?
BAM .bai No No
VCF.gz .tbi No No
BED.gz .tbi No No
bigWig N/A Yes
bigBed N/A Yes
hic N/A Yes

Possible approaches

  1. Server-side index generation during sync — Run samtools index on BAM files and tabix on VCF/BED files during ENCODE sync; store the generated index files (e.g., in object storage) along with their metadata and serve them via the /index endpoint. Adds compute time to sync and storage cost.

  2. On-demand index generation — Generate the index on first /index/encode/{local_id} request, cache it, and serve subsequent requests from cache. Adds significant latency on first access (must download full BAM to index it). Users may need to request indexing explicitly via a new endpoint for files missing indexes.

  3. Proxy to cloud storage with convention-based URLs — If ENCODE ever co-locates .bai files alongside BAMs in S3 (s3://encode-public/.../ENCFF*.bam.bai), we could try fetching from the predicted URL. Currently no evidence this exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions