Add JinaBert model #407

joelpaulkoch · 2024-11-08T10:49:16Z

I want to share my work on the JinaBert model.
Not sure if you want to include it at all, since it's not officially part of transformers, you must specify trust_remote_code=True when running it with transformers, and there is still an open issue.

This PR would enable bumblebee users to run the jina embeddings v2 models.

The implementation of JinaBert is here.

Another issue with this being a custom implementation is that there is another variant that I started to work on: jinaai/jina-embeddings-v2-base-code.

Both, jinaai/jina-embeddings-v2-base-en and jinaai/jina-embeddings-v2-base-code, specify JinaBertForMaskedLM as architecture but point to different implementations.

 "_name_or_path": "jinaai/jina-bert-implementation",
  "architectures": [
    "JinaBertForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "jinaai/jina-bert-implementation--configuration_bert.JinaBertConfig",
    "AutoModelForMaskedLM": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForMaskedLM",
    "AutoModel": "jinaai/jina-bert-implementation--modeling_bert.JinaBertModel",
    "AutoModelForSequenceClassification": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForSequenceClassification"
  },

vs.

  "_name_or_path": "jinaai/jina-bert-v2-qk-post-norm",
  "architectures": [
    "JinaBertForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "jinaai/jina-bert-v2-qk-post-norm--configuration_bert.JinaBertConfig",
    "AutoModel": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertModel",
    "AutoModelForMaskedLM": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForMaskedLM",
    "AutoModelForSequenceClassification": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForSequenceClassification"
  },

Is there a mechanism in bumblebee to distinguish these?

There are still some issues in this PR, I will add comments and can work on them over the next days/weeks.

joelpaulkoch · 2024-11-08T10:53:40Z

lib/bumblebee.ex

@@ -150,6 +150,8 @@ defmodule Bumblebee do
    "GPTNeoXForCausalLM" => {Bumblebee.Text.GptNeoX, :for_causal_language_modeling},
    "GPTNeoXForSequenceClassification" => {Bumblebee.Text.GptNeoX, :for_sequence_classification},
    "GPTNeoXForTokenClassification" => {Bumblebee.Text.GptNeoX, :for_token_classification},
+    "JinaBertForMaskedLM" => {Bumblebee.Text.JinaBert, :for_masked_language_modeling},


The config says it's JinaBertForMaskedLM. However, with this mapping there are missing and unused parameters:

11:51:58.408 [debug] the following parameters were missing: * language_modeling_head.dense.kernel * language_modeling_head.dense.bias * language_modeling_head.output.kernel * language_modeling_head.bias.bias * language_modeling_head.norm.gamma * language_modeling_head.norm.beta 11:51:58.408 [debug] the following PyTorch parameters were unused: * pooler.dense.bias * pooler.dense.weight

Looks to me like this is not in line with the previous :for_masked_language_modeling implementation of BERT.
So, we could map here to the :base architecture instead?

There is JinaBertForMaskedLM implementation and it has the expected layers. I think the issue is that the model on the hub is actually JinaBertmodel and the config is wrong.

So the correct way to workaround this would be specifying architecture when loading:

Bumblebee.load_model({:hf, "..."}, architecture: :base)

It may be worth opening a PR on the HF repo, changing it to JinaBertmodel. Unfortunately, the same is the case for the other checkpoints of this model (small, etc).

joelpaulkoch · 2024-11-08T10:57:12Z

lib/bumblebee/text/jina_bert.ex

+  defp get_slopes_power_of_2(n) do
+    start = 2 ** -(2 ** -(:math.log2(n) - 3))
+    ratio = start
+    for i <- 0..(n - 1), do: start * ratio ** i
+  end
+
+  defp integer?(number) do
+    round(number) == number
+  end
+
+  defp get_alibi_head_slopes(n_heads) do
+    if integer?(:math.log2(n_heads)) do
+      get_slopes_power_of_2(n_heads)
+    else
+      closest_power_of_2 = 2 ** round(:math.floor(:math.log2(n_heads)))
+
+      get_slopes_power_of_2(closest_power_of_2) ++
+        (get_alibi_head_slopes(2 * closest_power_of_2)
+         |> Enum.take_every(2)
+         |> Enum.take(n_heads - closest_power_of_2))
+    end
+  end


I think I could rewrite all of this using Nx functions. I'm assuming that would theoretically speed things up. Not sure if it's worth the effort.

In terms of performance this is actually better than writing a defn, assuming n_heads is small (which is the case). The reason is that all of this code runs when building the defn expression (defn-compile time), and not as part of the inference. In other words, when compiling the model, we are building this small tensor and it gets embedded as a constant into the computation.

To make the distinction more clear, I would make alibi_matrix a defnp, and make alibi_head_slopes a deftransformp that returns a tensor.

joelpaulkoch · 2024-11-08T11:04:43Z

lib/bumblebee/text/jina_bert.ex

+    alibi_relative_bias_matrix =
+      Axon.nx(hidden_state, fn hidden_state ->
+        {_, seqlen, _} = Nx.shape(hidden_state)
+
+        matrix = alibi_matrix(spec.num_attention_heads, spec.max_positions)
+
+        matrix[[.., .., 0..(seqlen - 1), 0..(seqlen - 1)]]
+      end)


This way, we're recalculating the matrix on each run, right?

In the original implementation, they are storing the matrix and only recalculating when it's too small for the current seqlen, otherwise they cut it down to match the dimensions.
I guess we could do the same, maybe using Axon.StatefulOutput?

I'm also wondering, shouldn't we know seqlen anyways when we're compiling the model?

I'm also wondering, shouldn't we know seqlen anyways when we're compiling the model?

Yes! PyTorch models generally accept a dynamically sized inputs and adjust the computation accordingly. In our case, we always know the lengths at model compile time.

Consequently, instead of computing the whole matrix and slicing it, ideally we would compute it only for the known length, to avoid unnecessary work and memory consumption.

As for reusing the matrix, in cases like this we always create the tensor at runtime. We prefer the model to be stateless and also let the XLA compiler optimise across operations. There are certain cases where we need some level of statefulness, such as autoregressive text generation, and we do it with an explicit "cache" output/input (though a whole generation request is still stateless).

joelpaulkoch · 2024-11-08T11:11:47Z

lib/bumblebee/text/jina_bert.ex

+        "encoder.blocks.{n}.ffn.wo" => "encoder.layer.{n}.mlp.wo",
+        "encoder.blocks.{n}.ffn.layernorm" => "encoder.layer.{n}.mlp.layernorm",
+        "encoder.blocks.{n}.ffn.gated_layers" => "encoder.layer.{n}.mlp.gated_layers"


I took over the param names, so we might want to change them

joelpaulkoch · 2024-11-08T11:15:07Z

test/bumblebee/text/jina_bert_test.exs

+
+  @tag :skip
+  test ":base" do
+    repo = {:hf, "doesnotexist/tiny-random-JinaBert"}


I've tried to make a tiny-random-JinaBert with no success, I might try again..

jonatanklosko · 2024-11-12T08:40:54Z

Hey @joelpaulkoch, thanks for the PR and a great article!

To be honest, I am hesitant to support implementations from the Hub because (a) theoretically they are less stable, because they may be still subject to tweaks; (b) model proliferation is more likely, the jina-embeddings-v2-base-en vs jina-embeddings-v2-base-code is a good example.

We generally wait until models make it to hf/transformers, though from huggingface/transformers#27035 it's not clear if that's ever going to happen.

At the moment, I would defer the decision and see how the status quo evolves. People can still use the model by installing bumblebee as {:bumblebee, github: "joelpaulkoch/jina-embeddings-v2"}.

joelpaulkoch added 4 commits October 25, 2024 20:20

Add JinaBert with copy of Bert implementation

958e8a3

Implement JinaBert

b1b6ec3

Clean up

ae2baf7

Add jina_bert_test.exs

ae6024c

joelpaulkoch commented Nov 8, 2024

View reviewed changes

Some more clean up

cbe59c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JinaBert model #407

Add JinaBert model #407

joelpaulkoch commented Nov 8, 2024

joelpaulkoch Nov 8, 2024

jonatanklosko Nov 12, 2024

joelpaulkoch Nov 8, 2024

jonatanklosko Nov 12, 2024

joelpaulkoch Nov 8, 2024

jonatanklosko Nov 12, 2024

joelpaulkoch Nov 8, 2024

joelpaulkoch Nov 8, 2024

jonatanklosko commented Nov 12, 2024

Add JinaBert model #407

Are you sure you want to change the base?

Add JinaBert model #407

Conversation

joelpaulkoch commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatanklosko commented Nov 12, 2024