Skip to content

Support for KV-cache loading? #8

@priontific

Description

@priontific

It's not clear to me from looking at the code if this library supports the following pattern:

mlx_lm.cache_prompt --prompt 'Here are 100 examples of how to produce a desired output: {examples}'

... cachedprompt.safetensors saved to cwd

prompt_template = "Now produce an output from this sentence: {sentence}"
prompts_raw = [prompt_template.format(sentence=sentence) for sentence in sentences]
response = batch_generate(kv-cache-file=cachedprompt.safetensors, prompts=prompts_raw)

... batched generations created

Is this something the library can or could do? I'm interested in being able to provide multi-shot examples without introducing huge prompt processing times due to wasted compute on re-encoding the same pre-prompt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions