Update dependency @huggingface/transformers to v3.1.0 #59

renovate · 2024-11-26T23:27:17Z

This PR contains the following updates:

Package	Change	Age	Adoption	Passing	Confidence
@huggingface/transformers	`3.0.2` -> `3.1.0`

Release Notes

huggingface/transformers.js (@huggingface/transformers)

`v3.1.0`

Compare Source

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

Table of contents:

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.
🐛 Bug fixes
📝 Documentation improvements
🛠️ Other improvements
🤗 New contributors

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.

Janus for any-to-any generation (e.g., image-to-text and text-to-image)

First of all, this release adds support for Janus, a novel autoregressive framework that unifies multimodal understanding and generation. The most popular model, deepseek-ai/Janus-1.3B, is tagged as an "any-to-any" model, and has specifically been trained for the following tasks:

Example: Image-Text-to-Text

import { AutoProcessor, MultiModalityCausalLM } from "@&#8203;huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "<image_placeholder>\nConvert the formula into latex code.",
    images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
  },
];
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 150,
  do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null]);
const decoded = processor.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded[0]);

Sample output:

Sure, here is the LaTeX code for the given formula:

```
x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
```

This code represents the mathematical expression for the variable \( x \).

Example: Text-to-Image

import { AutoProcessor, MultiModalityCausalLM } from "@&#8203;huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
  },
];
const inputs = await processor(conversation, { chat_template: "text_to_image" });

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
  ...inputs,
  min_new_tokens: num_image_tokens,
  max_new_tokens: num_image_tokens,
  do_sample: true,
});

// Save the generated image
await outputs[0].save("test.png");

Sample outputs:

Qwen2-VL for Image-Text-to-Text

Example: Image-Text-to-Text

Next, we added support for Qwen2-VL, the multimodal large language model series developed by Qwen team, Alibaba Cloud. It introduces the Naive Dynamic Resolution mechanism, allowing the model to process images of varying resolutions and leading to more efficient and accurate visual representations.

import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage } from "@&#8203;huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id);

// Prepare inputs
const url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg";
const image = await (await RawImage.read(url)).resize(448, 448);
const conversation = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image." },
    ],
  },
];
const text = processor.apply_chat_template(conversation, { add_generation_prompt: true });
const inputs = await processor(text, image);

// Perform inference
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 128,
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
// The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt, and appears to be engaged in a playful interaction with the dog. The dog, which is a large breed, is sitting on its hind legs and appears to be reaching out to the woman, possibly to give her a high-five or a paw. The background shows the ocean with gentle waves, and the sky is clear, suggesting it might be either sunrise or sunset. The overall atmosphere is calm and relaxed, capturing a moment of connection between the woman and the dog.

JinaCLIP for multimodal embeddings

Example: Compute text and/or image embeddings with jinaai/jina-clip-v2:

import { AutoModel, AutoProcessor, RawImage, matmul } from "@&#8203;huggingface/transformers";

// Load processor and model
const model_id = "jinaai/jina-clip-v2";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModel.from_pretrained(model_id, { dtype: "q4" /* e.g., "fp16", "q8", or "q4" */ });

// Prepare inputs
const urls = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"];
const images = await Promise.all(urls.map(url => RawImage.read(url)));
const sentences = [
    "غروب جميل على الشاطئ", // Arabic
    "海滩上美丽的日落", // Chinese
    "Un beau coucher de soleil sur la plage", // French
    "Ein wunderschöner Sonnenuntergang am Strand", // German
    "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", // Greek
    "समुद्र तट पर एक खूबसूरत सूर्यास्त", // Hindi
    "Un bellissimo tramonto sulla spiaggia", // Italian
    "浜辺に沈む美しい夕日", // Japanese
    "해변 위로 아름다운 일몰", // Korean
];

// Encode text and images
const inputs = await processor(sentences, images, { padding: true, truncation: true });
const { l2norm_text_embeddings, l2norm_image_embeddings } = await model(inputs);

// Encode query (text-only)
const query_prefix = "Represent the query for retrieving evidence documents: ";
const query_inputs = await processor(query_prefix + "beautiful sunset over the beach");
const { l2norm_text_embeddings: query_embeddings } = await model(query_inputs);

// Compute text-image similarity scores
const text_to_image_scores = await matmul(query_embeddings, l2norm_image_embeddings.transpose(1, 0));
console.log("text-image similarity scores", text_to_image_scores.tolist()[0]); // [0.29530206322669983, 0.3183615803718567]

// Compute image-image similarity scores
const image_to_image_score = await matmul(l2norm_image_embeddings[0], l2norm_image_embeddings[1]);
console.log("image-image similarity score", image_to_image_score.item()); // 0.9344457387924194

// Compute text-text similarity scores
const text_to_text_scores = await matmul(query_embeddings, l2norm_text_embeddings.transpose(1, 0));
console.log("text-text similarity scores", text_to_text_scores.tolist()[0]); // [0.5566609501838684, 0.7028406858444214, 0.582255482673645, 0.6648036241531372, 0.5462006330490112, 0.6791588068008423, 0.6192430257797241, 0.6258729100227356, 0.6453716158866882]

LLaVA-OneVision for Image-Text-to-Text

Example: Multi-round conversations w/ PKV caching

import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from '@&#8203;huggingface/transformers';

// Load tokenizer, processor and model
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';

const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32' or 'q8'
        vision_encoder: 'fp16', // or 'fp32' or 'q8'
        decoder_model_merged: 'q4', // or 'q8'
    },
    // device: 'webgpu',
});

// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const { past_key_values, sequences } = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
    return_dict_in_generate: true,
});

// Decode output
const answer = tokenizer.decode(
    sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.

const new_messages = [
    ...messages,
    { role: 'assistant', content: answer },
    { role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);

// Generate another response
const output = await model.generate({
    ...new_text_inputs,
    past_key_values,
    do_sample: false,
    max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
    output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.

ViTPose for pose-estimation

Example: Pose estimation w/ onnx-community/vitpose-base-simple.

import { AutoModel, AutoImageProcessor, RawImage } from '@&#8203;huggingface/transformers';

// Load model and processor
const model_id = 'onnx-community/vitpose-base-simple';
const model = await AutoModel.from_pretrained(model_id);
const processor = await AutoImageProcessor.from_pretrained(model_id);

// Load image and prepare inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/ryan-gosling.jpg';
const image = await RawImage.read(url);
const inputs = await processor(image);

// Predict heatmaps
const { heatmaps } = await model(inputs);

// Post-process heatmaps to get keypoints and scores
const boxes = [[[0, 0, image.width, image.height]]];
const results = processor.post_process_pose_estimation(heatmaps, boxes)[0][0];
console.log(results);

Optionally, visualize the outputs (Node.js usage shown here, using the `canvas` library):

import { createCanvas, createImageData } from 'canvas';

// Create canvas and draw image
const canvas = createCanvas(image.width, image.height);
const ctx = canvas.getContext('2d');
const imageData = createImageData(image.rgba().data, image.width, image.height);
ctx.putImageData(imageData, 0, 0);

// Draw edges between keypoints
const points = results.keypoints;
ctx.lineWidth = 4;
ctx.strokeStyle = 'blue';
for (const [i, j] of model.config.edges) {
    const [x1, y1] = points[i];
    const [x2, y2] = points[j];
    ctx.beginPath();
    ctx.moveTo(x1, y1);
    ctx.lineTo(x2, y2);
    ctx.stroke();
}

// Draw circle at each keypoint
ctx.fillStyle = 'red';
for (const [x, y] of points) {
    ctx.beginPath();
    ctx.arc(x, y, 8, 0, 2 * Math.PI);
    ctx.fill();
}

// Save image to file
import fs from 'fs';
const out = fs.createWriteStream('pose.png');
const stream = canvas.createPNGStream();
stream.pipe(out)
out.on('finish', () =>  console.log('The PNG file was created.'));

Input image	Output image

MGP-STR for Optical Character Recognition (OCR)

Example: Optical Character Recognition (OCR) w/ onnx-community/mgp-str-base

import { MgpstrForSceneTextRecognition, MgpstrProcessor, RawImage } from '@&#8203;huggingface/transformers';

const model_id = 'onnx-community/mgp-str-base';
const model = await MgpstrForSceneTextRecognition.from_pretrained(model_id);
const processor = await MgpstrProcessor.from_pretrained(model_id);

// Load image from the IIIT-5k dataset
const url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png";
const image = await RawImage.read(url);

// Preprocess the image
const result = await processor(image);

// Perform inference
const outputs = await model(result);

// Decode the model outputs
const generated_text = processor.batch_decode(outputs.logits).generated_text;
console.log(generated_text); // [ 'ticket' ]

PatchTST and PatchTSMixer for time series forecasting.

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtst

import { PatchTSTForPrediction, Tensor } from "@&#8203;huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtst";
const model = await PatchTSTForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtsmixer

import { PatchTSMixerForPrediction, Tensor } from "@&#8203;huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtsmixer";
const model = await PatchTSMixerForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

🐛 Bug fixes

When padding an image, the dimensions get stretched by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/1015
fix(scale): add missing scale element by @tosinamuda in https://github.com/huggingface/transformers.js/pull/1017

📝 Documentation improvements

Updated link to sentence similarity models. by @uzyn in https://github.com/huggingface/transformers.js/pull/893
fix(docs): fixed a broken link to quantization guide by @ThomasWT in https://github.com/huggingface/transformers.js/pull/1014
fix(docs): Fixed Typos in README and docs/snippets/6_supported-models.snippet by @hitchhiker3010 in https://github.com/huggingface/transformers.js/pull/1030

🛠️ Other improvements

Add option to maintain aspect ratio on resize by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/971
Add functionality to split RawImage into channels; Update slice documentation and tests by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/978
Avoid resizing images when they already have the desired size by @nemphys in https://github.com/huggingface/transformers.js/pull/1027
Add support for Split pretokenizer w/ behavior=removed & invert=false by @xenova in https://github.com/huggingface/transformers.js/pull/1033
Add type declaration for progress_callback by @ocavue in https://github.com/huggingface/transformers.js/pull/1034
Add support for op_block_list by @pdufour in https://github.com/huggingface/transformers.js/pull/1036

🤗 New contributors

@uzyn made their first contribution in https://github.com/huggingface/transformers.js/pull/893
@ThomasWT made their first contribution in https://github.com/huggingface/transformers.js/pull/1014
@tosinamuda made their first contribution in https://github.com/huggingface/transformers.js/pull/1017
@nemphys made their first contribution in https://github.com/huggingface/transformers.js/pull/1027
@hitchhiker3010 made their first contribution in https://github.com/huggingface/transformers.js/pull/1030
@pdufour made their first contribution in https://github.com/huggingface/transformers.js/pull/1036

Full Changelog: huggingface/transformers.js@3.0.2...3.1.0

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Update dependency @huggingface/transformers to v3.1.0

48c72b5

renovate bot enabled auto-merge (squash) November 26, 2024 23:27

renovate bot merged commit 4c71eb8 into main Nov 26, 2024
1 check passed

renovate bot deleted the renovate/huggingface-transformers-3.x branch November 26, 2024 23:28

ainoya-bot bot mentioned this pull request Nov 26, 2024

Release for v0.0.17 #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency @huggingface/transformers to v3.1.0 #59

Update dependency @huggingface/transformers to v3.1.0 #59

renovate bot commented Nov 26, 2024

Update dependency @huggingface/transformers to v3.1.0 #59

Update dependency @huggingface/transformers to v3.1.0 #59

Conversation

renovate bot commented Nov 26, 2024

Release Notes

v3.1.0

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.

Janus for any-to-any generation (e.g., image-to-text and text-to-image)

Qwen2-VL for Image-Text-to-Text

JinaCLIP for multimodal embeddings

LLaVA-OneVision for Image-Text-to-Text

ViTPose for pose-estimation

MGP-STR for Optical Character Recognition (OCR)

PatchTST and PatchTSMixer for time series forecasting.

🐛 Bug fixes

📝 Documentation improvements

🛠️ Other improvements

🤗 New contributors

Configuration

`v3.1.0`