-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Granite Vision Support #11794
base: master
Are you sure you want to change the base?
Add Granite Vision Support #11794
Conversation
fout.add_uint32(k(KEY_ATTENTION_HEAD_COUNT, VISION), v_hparams["num_attention_heads"]) | ||
fout.add_float32(k(KEY_ATTENTION_LAYERNORM_EPS, VISION), v_hparams["layer_norm_eps"]) | ||
block_count = v_hparams["num_hidden_layers"] - 1 if has_llava_projector else v_hparams["num_hidden_layers"] | ||
block_count = v_hparams["num_hidden_layers"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part I'm not quite sure about with the current code; it seems like the intention between this + the way the layer loop in clip.cpp (here) is written is to go up to the second to last layer, which is consistent with the default feature layer -2
in transformers and a lot of llava next models, but it seems like this actually resulting in -3
, unless I am misunderstanding something.
More concretely, if we say v_hparams["num_hidden_layers"]
is 24
here, if there's a llava projector, we set the block count to 23
and pop
the last layer from the model a bit further down. Then, in clip.cpp
, the encoder layer loop goes up to n_layer - 1
, but since that value is already decremented her, this would cause it to go up through the x < 22
, i.e., taking the output of block 21
(-3
).
Since this PR needs to be able to take features from the last layer, I've updated it to not drop the last layer, and iterate up to the max vision feature layer that is set, or the second last if it's not, but please feel free to let me know if this seems incorrect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you mentioned is pretty much spaghetti, you can refer to #11292 where I refactored that part.
Since this PR needs to be able to take features from the last layer
Funny enough, someone deleted the comment to explain how to get the last layer. You can see it on this version https://github.com/ggerganov/llama.cpp/blob/d408bb9268a988c5a60a5746d3a6430386e7604d/examples/llava/clip.cpp#L734
Would be nice if you can bring back the comment (or better, replace it with an explanation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks a lot for the context and guidance @ngxson! That looks great, It is exciting to see things potentially moving away from the surgery scripts and all of the vision hparams being handled the same as everything else with the gguf writer 🎉
I pulled the bit out to determine which layer to iterate up to out to a separate function and added some explanation here, and rewrote the relevant parts of this PR to avoid changing behavior in the feature layer used for existing models 🙂
examples/llava/clip.cpp
Outdated
@@ -444,8 +445,9 @@ struct clip_hparams { | |||
|
|||
char mm_patch_merge_type[32] = "flat"; // spatial_unpad or flat (default) | |||
|
|||
int32_t image_grid_pinpoints[32]; | |||
int32_t image_grid_pinpoints[64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you mentioned this fixed size was potentially problematic. Can you add a comment explaining the size choice and limitations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - I moved this + max feature layers to be const and added a comment up there.
This value is the number of flattened ordered pairs that are supported for any res gridpoints (i.e., 32/2 -> currently 16 ordered pairs). Granite vision models have more than this, e.g., in the preview model, which has 26 pairs. So I increased the size to prevent the extra gridpoints from being dropped!
Hi @ngxson and @gabe-l-hart, thanks for the early thoughts - I think this PR should be cleaned up and ready for another look when you get the chance! |
Hi @ngxson @ggerganov is it possible to get a quick check on your priority for review on this PR? We're getting ready to officially launch the Granite Vision models (preview model already out here) and would love to get a sense for timing so we can plan for how to support it with several projects that depend on |
ggml/src/ggml-cpu/ggml-cpu.c
Outdated
GGML_ASSERT(i01 >= 0 && i01 < ne01); | ||
// Copying this out for a bit while investigating due to issues like: | ||
// https://github.com/ggerganov/llama.cpp/issues/10157 | ||
// GGML_ASSERT(i01 >= 0 && i01 < ne01); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov, thank you very much for your thoughts!
For now, on CPU it is - commenting it out does run & give coherent outputs on CPU, but I think that there is a bug here for siglip, which I am guessing is an issue of how visual encoders without the CLS
embedding are handled, since the models that I have seen raise this issue appear to use siglip as the visual encoder.
I think I am getting close to the actual fix - I'll follow up (with hopefully a correct fix and this assert reenabled) by the end of today!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a separate small PR to fix the underlying cause here! #11982
Will reenable this assertion on this PR now.
Generally we don't spend much time reviewing the |
@alex-jw-brooks I wanted to try this out and made an attempt to convert the model. But I'm not sure if made a mistake in one of the steps there. If you have time to take a look at the steps in the linked document that would be great and let me know if you spot something I'm doing differently compared to when you converted your model. |
Awesome, thank you very much @danbev! I have some similar local notes for converting the model that I used to do my initial conversion to test with - I'll compare them and send some thoughts 😄 Could you please clarify what the issue that you are running into on your branch is? Is it mostly that the model is slow for inference, or did I crash / produce garbage? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of MAX_*
pattern in this PR really makes my review become harder. Would be nice if we can get rid of that.
examples/llava/clip.cpp
Outdated
@@ -1443,15 +1476,35 @@ struct clip_ctx * clip_model_load(const char * fname, const int verbosity = 1) { | |||
int idx = get_key_idx(ctx, KEY_IMAGE_GRID_PINPOINTS); | |||
int n = gguf_get_arr_n(ctx, idx); | |||
const int32_t * pinpoints = (const int32_t *)gguf_get_arr_data(ctx, idx); | |||
for (int i = 0; i < 32 && i < n && pinpoints[i] != 0; ++i) { | |||
for (int i = 0; i < MAX_IMAGE_GRID_PINPOINTS && i < n && pinpoints[i] != 0; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need MAX_IMAGE_GRID_PINPOINTS
. Can't image_grid_pinpoints
be a std::vector
?
examples/llava/clip.cpp
Outdated
} | ||
|
||
// If we set explicit vision feature layers, only go up to the deepest one | ||
for (int i = 0; i < MAX_IMAGE_FEATURE_LAYERS && (hparams.vision_feature_layer[i] > 0); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I don't really like this MAX_*
pattern, the code of clip.cpp
is current quite fragile and adding this will make it even more fragile.
Why can't we use std::vector
for these arrays?
examples/llava/clip.cpp
Outdated
std::vector<struct ggml_tensor *> embedding_stack; | ||
// Check to see if we have 1+ set vision feature layers set; otherwise it's determined | ||
// by the type of projector that this model has (usually last or second to last layer). | ||
int max_feature_layer = get_deepest_feature_layer(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits, but max_feature_layer
is never changed for a given model, so I think it should be set on loading the model instead (i.e. clip_model
)
examples/llava/clip.cpp
Outdated
if (embedding_stack.size() > 0) { | ||
embeddings = embedding_stack.at(0); | ||
for (unsigned long i=1; i < embedding_stack.size(); i++) { | ||
embeddings = ggml_concat(ctx0, embeddings, embedding_stack.at(i), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ggml_concat
on each embedding is fine for now, but please be aware that it may use a lot more memory.
Another way to do is to manually create a result tensor (with dimension that can hold all the embeddings), then use ggml_view_*
with appropriate offset.
But this is a minor optimization, probably can be done later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I thought this felt pretty efficient! Thank you for the guidance - I'm happy to submit a follow-up PR making this change in the near future 🙂
Btw, to fix the CI, you should rebase to latest |
Cool, thanks @ngxson - I agree with you that the pattern for MAX_* gridpoints / feature layers is really weird - the current code will also drop anything over the max, which is not great. I will rewrite happily rewrite it to use |
Sorry for not being clear on that. It was me being impatient and not letting it run long enough (I was using a debug build which seem to be slower on my machine at least). This was the output of an image of the apollo moon landing: ./build/bin/llama-llava-cli -m models/granite-vision-3.1-2b-Q4_K.gguf --mmproj vit/mmproj-model-f16.gguf --image ny.jpg -c 4096 -ngl 40 -v
...
encode_image_with_clip: image embedding created: 729 tokens
encode_image_with_clip: image encoded in 560079.81 ms by CLIP ( 768.29 ms per image patch)
The image captures a moment of human interaction with space. In the foreground, a lone astronaut is seen walking on the moon's surface. The astronaut is clad in a white suit, which is standard for such missions. The suit appears to be in good condition, suggesting that the astronaut is well-equipped for the conditions of space.
In the background, there are two flags flying on the moon. These flags are the United States and Canada flags, respectively. The presence of these flags indicates that the astronaut is part of a mission or operation related
to the United States and Canada.
Overall, the image provides a glimpse into the lunar missions that have taken place over the years
. The astronaut's solitary journey across the moon's surface, coupled with the flags in the background, paints a picture of human exploration of space.
llama_perf_context_print: load time = 561960.75 ms
llama_perf_context_print: prompt eval time = 548.67 ms / 768 tokens ( 0.71 ms per token, 1399.74 tokens per second)
llama_perf_context_print: eval time = 24427.76 ms / 196 runs ( 124.63 ms per token, 8.02 tokens per second)
llama_perf_context_print: total time = 586576.16 ms / 964 tokens When running the same example but for a normal (Release build) I got: encode_image_with_clip: image embedding created: 729 tokens
encode_image_with_clip: image encoded in 40104.00 ms by CLIP ( 55.01 ms per image patch)
I'm sorry to hear that you're struggling with your emotions. It's important to take care of yourself and seek help from mental health professionals if you need assistance.
As an AI language model, I don't have the ability to provide personalized assistance or emotional support. However, I can provide general information about mental health and resources for people struggling with their emotions.
It's important to remember that you are not alone, and there are people who care about you and want to help. If you're struggling with your emotions, you can reach out to your primary care physician, mental health professionals, or support groups for assistance.
Remember, taking care of yourself is important, and seeking help when you need it is a sign of strength.
llama_perf_context_print: load time = 41517.82 ms
llama_perf_context_print: prompt eval time = 294.52 ms / 768 tokens ( 0.38 ms per token, 2607.64 tokens per second)
llama_perf_context_print: eval time = 2095.80 ms / 167 runs ( 12.55 ms per token, 79.68 tokens per second)
llama_perf_context_print: total time = 43645.99 ms / 935 tokens
And this is another run: encode_image_with_clip: image embedding created: 729 tokens
encode_image_with_clip: image encoded in 40446.42 ms by CLIP ( 55.48 ms per image patch)
Please provide the image you want to be converted into text.
llama_perf_context_print: load time = 41897.67 ms
llama_perf_context_print: prompt eval time = 303.87 ms / 768 tokens ( 0.40 ms per token, 2527.42 tokens per second)
llama_perf_context_print: eval time = 178.02 ms / 14 runs ( 12.72 ms per token, 78.64 tokens per second)
llama_perf_context_print: total time = 42089.41 ms / 782 tokens
I've not been able to get a good response when using a Release build yet. |
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
794dbc2
to
16a95d6
Compare
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Hi @ggerganov @ngxson - I think the requested changes should be addressed! Please note the bugfix for the assert that was commented out for running on CPU is open in another PR here. @danbev - thanks for the clarification! There are a few problems, mostly in the visual encoder config since you should use anyres (llava next) preprocessing by setting If the model is running correctly, you should see thousands of tokens per image. I've added some detailed docs for how to convert the model here - if it is useful, I am also happy to open a PR to add it to Also, please make sure to bring in this change #11982, which will fix assertion errors if the get rows operation is triggered. Hopefully that helps, and please feel free to let me know if you have any questions. |
@alex-jw-brooks Thanks for the details and I'll give this a try later today! I think the conversion document would be very helpful to have, so please open a PR it 👍 |
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
This PR adds support for IBM's granite vision models, which are a variant of LLava Next that use multiple feature layers from the (siglip) visual encoder.
It's in draft at the moment, as I'm still testing and trying to track down the cause of the failing assertion in
ggml-cpu.c
, which seems to be the same issue as what's reported here, but any thoughts are very welcome!Current summary of changes:
clip.vision.feature_layer
to llava next hparams; this has the same meaning asvision_feature_layer
in transformers (ref). Note that currently it converts values to a positive value so that it can use-1
as unset in llama.cpp.-2
, which matches the default in transformers and many llava next models, but it seems to use-3
if it goes through GGUF currentlytransformers
so that it can be converted with the normal HF -> GGUF conversion script; this is a workaround for encapsulated LLMs that may not be compatible with the legacy llama converter, e.g.,granite
LLMs