Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument out of range exception when running any prompt through DeepSeek-R1-Distill-Llama-8B-Q8_0 #1053

Open
wased89 opened this issue Jan 21, 2025 · 4 comments

Comments

@wased89
Copy link

wased89 commented Jan 21, 2025

Description

https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF/blob/main/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf is the model being used.
I tried different sentences at it and what works flawlessly with regular llama models this one immediately throws an error about the template?

Stacktrace:
StackTrace " at System.ThrowHelper.ThrowArgumentOutOfRangeException()\r\n at System.MemoryExtensions.AsSpan[T](T[] array, Int32 start, Int32 length)\r\n at LLama.LLamaTemplate.Apply()\r\n at LLama.Transformers.PromptTemplateTransformer.ToModelPrompt(LLamaTemplate template)\r\n at LLama.Transformers.PromptTemplateTransformer.HistoryToText(ChatHistory history)\r\n at LLama.ChatSession.d43.MoveNext()\r\n at LLama.ChatSession.d43.System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult(Int16 token)\r\n at AISpeechChatApp.ChatModelServer.d__13.MoveNext() in ..... the rest is my name and computer info

code where breakpoint is triggered:
string output = "";

(breakpoint) await foreach (var text in session.ChatAsync(new ChatHistory.Message(AuthorRole.User, transcribedMessage), inferenceParameters))
{
Console.Write(text);
output += text;
}
Console.WriteLine();

AddToPrompt(false, output);

My model settings:

//initialize llm
var modelParameters = new ModelParams(modelPrePath + modelPath)
{
ContextSize = 8096,
GpuLayerCount = layercount // for 8b model
//GpuLayerCount = 18 //for 70b model

};
model = null;

model = LLamaWeights.LoadFromFile(modelParameters);
context = model.CreateContext(modelParameters);
executor = new InteractiveExecutor(context);

if (Directory.Exists("Assets/chathistory"))
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("Loading session from disk.");
Console.ForegroundColor = ConsoleColor.White;

session = new ChatSession(executor);

}

//initialize the inference parameters
inferenceParameters = new InferenceParams
{
SamplingPipeline = new DefaultSamplingPipeline
{
Temperature = 0.8f
},

MaxTokens = -1, // keep generating tokens until the anti prompt is encountered
AntiPrompts = new List<string> { model.Tokens.EndOfTurnToken!, "<|im_end|>" }, // Stop generation once antiprompts appear.

};

//set system prompt
chatHistory = new ChatHistory();
chatHistory.AddMessage(AuthorRole.System, "You are Alex, an AI assistant tasked with helping the user with their project coded in C#. Answer any question they have and follow them through their ramblings about the project at hand.");
//set up session
session = new ChatSession(executor, chatHistory);

session.WithHistoryTransform(new PromptTemplateTransformer(model, withAssistant: true));

session.WithOutputTransform(new LLamaTransforms.KeywordTextOutputStreamTransform(
new string[] { model.Tokens.EndOfTurnToken!, "�" },
redundancyLength: 5));

@phil-scott-78
Copy link
Contributor

All the deepseek stuff was added within the past week to llama.cpp, ands my understanding is the llama.cpp bundled predates it. I tried with the 0.20 release myself just to see what would happen and I'm getting this which jives with the updates that I've seen on llama.cpp around this.

Please input your model.gguf path (or ENTER for default): (b:\models\phi-4-Q6_K.gguf): b:\models\DeepSeek-R1-Distill-Qwen-14B-Q6_K_L.gguf
[llama Info]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[llama Info]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[llama Info]: ggml_cuda_init: found 1 CUDA devices:
[llama Info]:   Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
[llama Info]: llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4080 SUPER) - 15035 MiB free
[llama Info]: llama_model_loader: loaded meta data with 30 key-value pairs and 579 tensors from b:\models\DeepSeek-R1-Distill-Qwen-14B-Q6_K_L.gguf (version GGUF V3 (latest))
[llama Info]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[llama Info]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
[llama Info]: llama_model_loader: - kv   1:                               general.type str              = model
[llama Info]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
[llama Info]: llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
[llama Info]: llama_model_loader: - kv   4:                         general.size_label str              = 14B
[llama Info]: llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
[llama Info]: llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
[llama Info]: llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
[llama Info]: llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
[llama Info]: llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
[llama Info]: llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
[llama Info]: llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
[llama Info]: llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
[llama Info]: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
[llama Info]: llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
[llama Info]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[llama Info]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[llama Info]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ä  Ä ", "Ä Ä  Ä Ä ", "i n", "Ä  t",...
[llama Info]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
[llama Info]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
[llama Info]: llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
[llama Info]: llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
[llama Info]: llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
[llama Info]: llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de... 
[llama Info]: llama_model_loader: - kv  24:               general.quantization_version u32              = 2
[llama Info]: llama_model_loader: - kv  25:                          general.file_type u32              = 18
[llama Info]: llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-... 
[llama Info]: llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt     
[llama Info]: llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 336
[llama Info]: llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
[llama Info]: llama_model_loader: - type  f32:  241 tensors
[llama Info]: llama_model_loader: - type q8_0:    2 tensors
[llama Info]: llama_model_loader: - type q6_K:  336 tensors
[llama Error]: llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
[llama Error]: llama_load_model_from_file: failed to load model
Unhandled exception. LLama.Exceptions.LoadWeightsFailedException: Failed to load model 'b:\models\DeepSeek-R1-Distill-Qwen-14B-Q6_K_L.gguf'
   at LLama.Native.SafeLlamaModelHandle.LoadFromFile(String modelPath, LLamaModelParams lparams) in B:\llama-src\LLamaSharp\LLama\Native\SafeLlamaModelHandle.cs:line 142
   at LLama.LLamaWeights.<>c__DisplayClass21_1.<LoadFromFileAsync>b__1() in B:\llama-src\LLamaSharp\LLama\LLamaWeights.cs:line 123
   at System.Threading.Tasks.Task`1.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at LLama.LLamaWeights.LoadFromFileAsync(IModelParams params, CancellationToken token, IProgress`1 progressReporter) in B:\llama-src\LLamaSharp\LLama\LLamaWeights.cs:line 118
   at LLama.Examples.Examples.StatelessModeExecute.Run() in B:\llama-src\LLamaSharp\LLama.Examples\Examples\StatelessModeExecute.cs:line 17        
   at ExampleRunner.Run() in B:\llama-src\LLamaSharp\LLama.Examples\ExampleRunner.cs:line 58
   at Program.<Main>$(String[] args) in B:\llama-src\LLamaSharp\LLama.Examples\Program.cs:line 38
   at Program.<Main>(String[] args)

@AgentSmithers
Copy link

I second this. I loaded up the same model and received the "unknown pre-tokenizer type" error. I assume were just waiting for them to update it on their end.

@martindevans
Copy link
Member

Yeah that looks like an issue with an outdated llama.cpp version. The 0.20 update required huge changes to the binary loading system, so by the time that was done and released we were already 3 weeks out of date! I'm already working on the next update :)

@vltmedia
Copy link

vltmedia commented Jan 31, 2025

I got it working on my end with the current version of LlamaSharp with the cuda12 backend with the base chat tutorial from the documentation, just changed the model name and it worked.

Check my implementation here.

I tested the Q2 and Q8 versions from Unsloth HuggingFace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants