sync master #27

tc-mb · 2024-08-15T13:26:27Z

sync master

* gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* default n_swa for phi-3 * fix * double check swa

…ronization overhead. (ggerganov#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

Co-authored-by: Neo Zhang <>

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

ggml-ci

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <david@adalogics.com>

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com>

* readme: introduce gpustack GPUStack is an open-source GPU cluster manager for running large language models, which uses llama.cpp as the backend. Signed-off-by: thxCode <thxcode0824@gmail.com> * readme: introduce gguf-parser GGUF Parser is a tool to review/check the GGUF file and estimate the memory usage without downloading the whole model. Signed-off-by: thxCode <thxcode0824@gmail.com> --------- Signed-off-by: thxCode <thxcode0824@gmail.com>

…8970) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>

ref: ggerganov#8912

Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>

* ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>

* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment

* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces

) Signed-off-by: Jiri Podivin <jpodivin@redhat.com>

…ov#8850)

…nov#8778)

…rganov#8994)

* retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space

fairydreaming and others added 28 commits August 10, 2024 11:43

llama : default n_swa for phi-3 (ggerganov#8931)

7eb2384

* default n_swa for phi-3 * fix * double check swa

metal : fix uninitialized abort_callback (ggerganov#8968)

6e02327

llama : check all graph nodes when searching for result_embd_pooled (g…

33309f6

…gerganov#8956) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

update guide (ggerganov#8909)

a21c6fd

Co-authored-by: Neo Zhang <>

flake.lock: Update (ggerganov#8979)

8cd1bcf

gguf-py : Numpy dequantization for most types (ggerganov#8939)

4134999

* gguf-py : Numpy dequantization for most types * gguf-py : Numpy dequantization for grid-based i-quants

server : handle models with missing EOS token (ggerganov#8997)

5ef07e2

ggml-ci

py : fix requirements check '==' -> '~=' (ggerganov#8982)

d3ae0ee

* py : fix requirements check '==' -> '~=' * cont : fix the fix * ci : run on all requirements.txt

Fix a spelling mistake (ggerganov#9001)

2589292

grammar-parser : fix possible null-deref (ggerganov#9004)

1262e7e

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com>

llama : model-based max number of graph nodes calculation (ggerganov#…

0fd93cd

…8970) * llama : model-based max number of graph nodes calculation * Update src/llama.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>

ci : enable RPC in all of the released builds (ggerganov#9006)

1f67436

ref: ggerganov#8912

ci : fix github workflow vulnerable to script injection (ggerganov#9008)

fc4ca27

Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>

export-lora : throw error if lora is quantized (ggerganov#9002)

828d6ff

cmake : remove unused option GGML_CURL (ggerganov#9011)

43bdd3c

server : fix segfault on long system prompt (ggerganov#8987)

98a532d

* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment

server : init stop and error fields of the result struct (ggerganov#9026

234b306

) Signed-off-by: Jiri Podivin <jpodivin@redhat.com>

ci : disable bench workflow (ggerganov#9010)

d5492f0

llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (ggergan…

6bda7ce

…ov#8850)

common : remove duplicate function llama_should_add_bos_token (ggerga…

4af8420

…nov#8778)

server : fix duplicated n_predict key in the generation_settings (gge…

37501d9

…rganov#8994)

retrieval : fix memory leak in retrieval query handling (ggerganov#8955)

4b9afbb

* retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space

tc-mb merged commit 831a2a5 into minicpmv-main-dev Aug 15, 2024
79 of 110 checks passed

github-actions bot added the documentation Improvements or additions to documentation label Aug 15, 2024

github-actions bot added examples SYCL Nvidia GPU Vulkan devops python server ggml labels Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync master #27

sync master #27

tc-mb commented Aug 15, 2024

sync master #27

sync master #27

Conversation

tc-mb commented Aug 15, 2024