Streaming response delivery with deferred cache write (TeeBody)

## Problem

When a cacheable response is received from upstream on a cache miss, the current flow blocks the client until the entire response body is buffered, stored in cache, and then reconstructed:

```
upstream responds (Passthrough body)
  → response predicates (headers/status checked, body untouched)
  → into_cached(): body.collect().await  ← BLOCKS until full body received
  → backend.set()                        ← serializes + compresses + writes
  → from_cached()                        ← reconstructs response from cached bytes
  → client receives response             ← LAST in line
```

For large responses (file downloads, big API payloads), the client waits for the entire body to be buffered and cached before receiving the first byte. This defeats HTTP chunked transfer and adds unnecessary latency.

## Proposed Solution

Introduce a `TeeBody` mechanism that streams the response to the client immediately while writing fixed-size indexed chunks to cache in the background.

### New flow

```
upstream responds (Passthrough body)
  → response predicates (headers/status checked, body untouched)
  → wrap body in TeeBody
  → return response to client immediately       ← CLIENT FIRST
  → as client consumes frames, TeeBody copies each into fixed-size chunks
  → each completed chunk is written to cache with its index
  → on body completion, finalize the chunk sequence
```

### Chunked indexed storage

Instead of buffering the full body and writing it as a single blob, the body is split into **fixed-size indexed chunks** and stored separately from response metadata:

- **Metadata (headers, status)** is serialized and stored as a single cache entry
- **Body chunks** are stored as raw bytes with sequential indices (e.g. `key:meta`, `key:0`, `key:1`, …)
- On cache hit, metadata is deserialized and body chunks are read back in order and streamed to the client

This approach has several advantages:

1. **Streaming without backend streaming support** — even if the backend (e.g. Moka, Redis) only supports simple get/set, indexed chunks simulate streaming writes and reads. No changes to the `Backend` trait are required.
2. **HTTP Range request support** — since the body is stored as indexed fixed-size chunks, serving byte-range requests from cache becomes straightforward. The chunk index maps directly to byte offsets.
3. **Pre-allocated chunk pool** — a pool of reusable fixed-size buffers can be used to reduce allocation overhead during streaming. Chunks are borrowed from the pool, filled, written to cache, and returned to the pool.

### Custom serialization: meta and body separated

Currently `into_cached()` serializes the entire response (headers + body) together. The new approach splits this:

- **Meta serialization**: only headers and status are serialized (e.g. via the existing `CachedValue` mechanism)
- **Body storage**: raw bytes, split into fixed-size chunks — no serialization needed. Chunks can optionally be compressed (gzip), though the compression strategy needs further consideration.
- **Deserialization**: read meta entry, then stream body chunks back in index order

### Touch points

**1. New `TeeBody` type** (`hitbox-http/src/body.rs`)
- Wraps a `BufferedBody<B>`, forwards `poll_frame()` to the client
- Copies each data frame into fixed-size chunks internally
- On each chunk completion, signals it for background cache write
- On body completion, signals the final chunk

**2. Modified `into_cached()` path** (`hitbox-http/src/response.rs:410-454`)
- Currently calls `body.collect().await` which blocks until the full body is received
- Instead: return early with a `TeeBody` that streams to client, plus a handle for background chunk writes

**3. Modified FSM transition** (`hitbox/src/fsm/states.rs:456-504`)
- Currently: `backend.set()` → `from_cached()` → return response
- New: store metadata immediately, return response with `TeeBody`, spawn background task (via existing offload infrastructure) that writes chunks as they complete

**4. Modified `from_cached()` path**
- Currently: deserializes entire response from a single cached blob
- New: deserializes metadata, then streams body chunks from cache in index order

### Existing infrastructure to reuse

The **offload mechanism** used by stale-while-revalidate (`OffloadRevalidate` in `hitbox/src/fsm/future.rs:286-321`) already implements a "serve now, update cache later" pattern. The chunk writing can use the same `offload.register()` infrastructure.

## Edge Cases

| Scenario | Behavior |
|----------|----------|
| **Body error mid-stream** | Client gets a truncated response (unavoidable). Partial chunks are discarded — cache entry is not finalized. |
| **Backend write failure** | Client already received the full response. Silent cache miss — log the error, next request re-fetches. |
| **Concurrent duplicate misses** | Both requests tee independently. Second write overwrites first (idempotent, acceptable). |
| **Memory** | With a pre-allocated chunk pool, memory usage is bounded by pool size. Without a pool, peak memory is comparable to today but spread across smaller allocations. |

## Open Questions

- **Chunk compression**: should individual chunks be compressed, or should compression be left to the backend? Per-chunk gzip adds CPU overhead but reduces storage. Needs benchmarking.
- **Chunk size**: what's the optimal fixed chunk size? Likely configurable, with a sensible default (e.g. 64 KiB or 128 KiB).
- **Chunk pool sizing**: how many pre-allocated chunks to keep in the pool? Should be tunable based on expected concurrency.

## Non-goals (follow-up work)

- **Unbounded stream detection** (SSE, WebSocket) — these should be handled by response predicates marking them as non-cacheable, so the body stays `Passthrough` and streams through without tee. Separate concern.

## Affected crates

- `hitbox-http` (TeeBody type, `into_cached()` / `from_cached()` modifications, chunk pool)
- `hitbox` (FSM state transition, offload integration)
- `hitbox-backend` (potentially, if the `Backend` trait needs helper methods for indexed chunk operations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming response delivery with deferred cache write (TeeBody) #248

Problem

Proposed Solution

New flow

Chunked indexed storage

Custom serialization: meta and body separated

Touch points

Existing infrastructure to reuse

Edge Cases

Open Questions

Non-goals (follow-up work)

Affected crates

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Behavior
Body error mid-stream	Client gets a truncated response (unavoidable). Partial chunks are discarded — cache entry is not finalized.
Backend write failure	Client already received the full response. Silent cache miss — log the error, next request re-fetches.
Concurrent duplicate misses	Both requests tee independently. Second write overwrites first (idempotent, acceptable).
Memory	With a pre-allocated chunk pool, memory usage is bounded by pool size. Without a pool, peak memory is comparable to today but spread across smaller allocations.

Streaming response delivery with deferred cache write (TeeBody) #248

Description

Problem

Proposed Solution

New flow

Chunked indexed storage

Custom serialization: meta and body separated

Touch points

Existing infrastructure to reuse

Edge Cases

Open Questions

Non-goals (follow-up work)

Affected crates

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions