Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image feature extraction. #213

Open
wants to merge 3 commits into
base: release/v2.1.2
Choose a base branch
from

Conversation

NOSCOPEdev
Copy link

mmproj model loading and image feature extraction update. You will need to load a vision model and its mmproj file. The settings are in the "LLM.cs" script under the "Advanced Options". You will also need llamalib 1.17 or higher.

image

You will need to load a vision model and its mmproj file. The settings are in the "LLM.cs" script under the "Advanced Options". You will also need llamalib 1.17 or higher.
Model used: llava-v1.6-mistral-7b.Q4_K_M, mmproj-model-f16
@amakropoulos amakropoulos self-requested a review August 20, 2024 16:33
Copy link
Collaborator

@amakropoulos amakropoulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR!!
It needs some work before it is merged, I have left some comments.

@@ -0,0 +1,38 @@
using UnityEngine;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be moved in a sample dir inside the Samples~ folder e.g. Samples~/ImageReceiver/ImageReceiver.cs
Also rename to ImageReceiver.cs :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same with the AndroidLlava.unity above.
Also rename to Scene.unity similarly to the other samples.


//This field is used to relay the image to the AI, this can be done by both a URL or a file in your system.

public TextMeshProUGUI AnyImageData;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a Text instead of TextMeshProUGUI element.
TextMeshProUGUI requires the TMP assets which vary between different Unity versions.

public TextMeshProUGUI AnyImageData;

// Should work with any script that calls the Chat function on the LLMCharacter script.
public AndroidDemo AD;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy and paste the SimpleInteraction.cs code and modify it.
This ensures that samples are independent from each other and users can install whichever they want.


public void SendImageToAI()
{
AD.onInputFieldSubmit(" [\r\n {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\r\n {\r\n \"role\": \"user\",\r\n \"content\": [\r\n {\"type\" : \"text\", \"text\": \"What's in this image?\"},\r\n {\"type\": \"image_url\", \"image_url\": {\"url\":" + AnyImageData.text + "\" } }\r\n ]");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you do here, it's better to define a function in Runtime/LLMCharacter.cs that takes over this part and can be reused e.g.:

public async Task<string> ChatWithImage(string query, Uri url, Callback<string> callback = null, EmptyCallback completionCallback = null, bool addToHistory = true)
{
   URLContent urlText  = new URLContent(){url = url.ToString() };
   ImageURLContent urlContent = new ImageURLContent(type="image_url", image_url = urlText)
   TextContent message = new TextContent(){type = "text", text = query};

   string queryWithImage = "[" + JsonUtility.ToJson(message) + "," + JsonUtility.ToJson(urlContent) + "]";
   return await Chat(queryWithImage, callback, completionCallback, addToHistory);
}

public async Task<string> ChatWithImage(string query, Path path, Callback<string> callback = null, EmptyCallback completionCallback = null, bool addToHistory = true)
{
   string queryWithImage = ...
   return await Chat(queryWithImage, callback, completionCallback, addToHistory);
}

and inside the Runtime/LLMInterface.cs

    [Serializable]
    public struct TextContent
    {
        public string type;
        public string text;
    }

    [Serializable]
    public struct ImageURLContent
    {
        public string type;
        public URLContent image_url;
    }

    [Serializable]
    public struct URLContent
    {
        public string url;
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of manually defining the "What's in this image?" text, you can use the existing text box in the SimpleInteraction sample.

if (remote) arguments += $" --port {port} --host 0.0.0.0";
if (numThreadsToUse > 0) arguments += $" -t {numThreadsToUse}";
if (loraPath != "") arguments += $" --lora \"{loraPath}\"";
if (MMPROJmodel != "") arguments += $" --mmproj \"{MMPROJmodel}\"";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying a new LLM.cs file modify the Runtime/LLM.cs to add the MMPROJmodel.
The MMPROJmodel needs to be treated similar to e.g. the loras, rather than providing this as a text, it needs some additional functionality to load it and make sure it is added inside the builds.
I will take over this part because it is quite involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants