v0.1.7
This release introduces the following changes:
-
Added Instruction Templates: We have added
instruction-templates
to the model definition. Now, one can explicitly provide aninstruction_template
inLlamaCppModel
orExllamaModel
. This helps to generate more accurate prompts in the chat completion endpoint. -
Streaming Response Timeout: Streaming responses will now timeout automatically if the next chunk is not received within 30 seconds.
-
Semaphore Bug Fix: Fixed a bug where a semaphore would not be properly released after being acquired.
-
Auto Truncate in Model Definition: If
auto_truncate
is set toTrue
in the model definition, past prompts will be automatically truncated to fit within the context window, thus preventing errors. The default setting isTrue
. -
Automatic RoPE Parameter Adjustment: If explicit settings for rope_freq_base, rope_freq_scale in llama.cpp, or alpha_value, compress_pos_emb in exllama are not provided, the RoPE frequency and scaling factor will be automatically adjusted. This default behavior is based on the llama2 model with a training token count of 4096.
-
Dynamic Model Definition Parsing: Model definitions are primarily configured in
model_definitions.py
. However, parsing will now also be attempted for Python script files in the root directory containing the words 'model' and 'def'. Environment variables containing 'model' and 'def' will also be parsed automatically. This applies toopenai_replacement_models
as well. For example, you can set environment variables as shown below:
#!/bin/bash
export MODEL_DEFINITIONS='{
"gptq": {
"type": "exllama",
"model_path": "TheBloke/MythoMax-L2-13B-GPTQ",
"max_total_tokens": 4096
},
"ggml": {
"type": "llama.cpp",
"model_path": "TheBloke/Airoboros-L2-13B-2.1-GGUF",
"max_total_tokens": 8192,
"rope_freq_base": 26000,
"rope_freq_scale": 0.5,
"n_gpu_layers": 50
}
}'
export OPENAI_REPLACEMENT_MODELS='{
"gpt-3.5-turbo": "ggml",
"gpt-3.5-turbo-0613": "ggml",
"gpt-3.5-turbo-16k": "ggml",
"gpt-3.5-turbo-16k-0613": "ggml",
"gpt-3.5-turbo-0301": "ggml",
"gpt-4": "gptq",
"gpt-4-32k": "gptq",
"gpt-4-0613": "gptq",
"gpt-4-32k-0613": "gptq",
"gpt-4-0301": "gptq"
}'
echo "MODEL_DEFINITIONS: $MODEL_DEFINITIONS"
echo "OPENAI_REPLACEMENT_MODELS: $OPENAI_REPLACEMENT_MODELS"