Run 13B Alpaca Llama Failed #515
Unanswered
PeterFujiyu
asked this question in
Q&A
Replies: 1 comment
-
很明显是你offload到GPU的显存超过最大值了。log里有明显提示。
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
这是运行时的log,提问后不断ggml_metal_graph_compute: command buffer 0 failed with status 5
基本设备信息都在log里,使用最新llama.cpp,和最新模型。
'''
$ ./chat.sh models/chinese-alpaca-2-13b-16k-hf/ggml-model-q6_k.gguf '作为一个 AI 助手,你可以帮助人类做哪些事情?'
Log start
main: build = 1 (d83fdc2)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.3.0
main: seed = 1706339166
llama_model_loader: loaded meta data with 22 key-value pairs and 363 tensors from models/chinese-alpaca-2-13b-16k-hf/ggml-model-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.scaling.type str = linear
llama_model_loader: - kv 11: llama.rope.scaling.factor f32 = 4.000000
llama_model_loader: - kv 12: general.file_type u32 = 18
llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,55296] = ["", "
", "", "<0x00>", "<...llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,55296] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,55296] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q6_K: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 55296
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 13.25 B
llm_load_print_meta: model size = 10.13 GiB (6.56 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 8192.00 MiB, offs = 0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 2178.36 MiB, offs = 8357675008, (10370.42 / 10922.67)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Metal buffer size = 10148.86 MiB
llm_load_tensors: CPU buffer size = 221.48 MiB
.................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.25
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/peter/Documents/Peter_Code/GitHub/llama/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 3200.00 MiB, (13571.98 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_kv_cache_init: Metal KV buffer size = 3200.00 MiB
llama_new_context_with_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, (13572.00 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 358.02 MiB, (13930.00 / 10922.67)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model: Metal compute buffer size = 358.00 MiB
llama_new_context_with_model: CPU compute buffer size = 18.00 MiB
ggml_metal_graph_compute: command buffer 0 failed with status 5
system_info: n_threads = 8 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Input prefix with BOS
Input prefix: ' [INST] '
Input suffix: ' [/INST]'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
[INST] <>
You are a helpful assistant. 你是一个乐于助人的助手。请你提供专业、有逻辑、内容真实、有价值的详细回复。
<>
作为一个 AI 助手,你可以帮助人类做哪些事情? [/INST]ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
用了ggml_metal_graph_compute: command buffer 0 failed with status 5
《ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
'''
Beta Was this translation helpful? Give feedback.
All reactions