ChatMessage content being `str`-only doesn't allow user to pass image #7848

tomarharsh · 2024-06-12T18:45:36Z

Is your feature request related to a problem? Please describe.
While talking to our bot, the user is allowed to send an image. This image is sent to vision enabled LLM bot. Haystack ChatMessage class content only allows string but it needs to allow a List to be passed. Here's the OpenAI page the Haystack refers to for content which allows array and image_url that can be sent that way.

Describe the solution you'd like
ChatMessage to be able to handle inbound image

Describe alternatives you've considered
Not using generator component at all is the only other alternative I can explore.

Additional context
Haystack's ChatMessage content: Link
OpenAI's chat message parameter: Link
How ChatMessage content is getting populated from the generator: Link

The text was updated successfully, but these errors were encountered:

CarlosFerLo · 2024-06-16T11:19:36Z

I will try to add this functionality :)

CarlosFerLo · 2024-06-16T11:57:52Z

I've reviewed the base code and propose that we enable the 'content' of a 'ChatMessage' to be set as a list containing 'str', 'Path', or any type used to encode an image. This will require us to rewrite the 'to_openai_format' method and incorporate image processing with 'base64' for calls involving images. We'll also need to address serialization issues, but we can handle those once #7849 is merged into the main branch to avoid merge conflicts.

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images. I'll begin working on this after the mentioned PR is merged.

lbux · 2024-06-17T03:55:12Z

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images.

I don't think we should try and extract this info ourselves. We should make the user specify. My idea is to make a ContentPart class with type, text, image_url, base_64, and detail. We can then have helper methods in this class that helps with formatting.

Essentially, we would allow for something like this:

message = ChatMessage.from_user([
    ContentPart.from_text("What’s in this image?"),
    ContentPart.from_image_url("example.com/test.jpg"),
    ContentPart.from_base64_image(base64_image)
])

We should also look into deprecating Functions and supporting Tools within ChatMessage as that has also changed.

CarlosFerLo · 2024-06-18T13:51:56Z

I will implement this functionality. Regarding the deprecation of Functions, we could open an issue to handle it separately.

vblagoje · 2024-06-19T07:46:51Z

The main challenge will be accurately distinguishing between images and text in the input list, especially when the input is a string. It would be helpful to know which data types you want to support for images.

I don't think we should try and extract this info ourselves. We should make the user specify. My idea is to make a ContentPart class with type, text, image_url, base_64, and detail. We can then have helper methods in this class that helps with formatting.

Essentially, we would allow for something like this:
message = ChatMessage.from_user([
    ContentPart.from_text("What’s in this image?"),
    ContentPart.from_image_url("example.com/test.jpg"),
    ContentPart.from_base64_image(base64_image)
])
We should also look into deprecating Functions and supporting Tools within ChatMessage as that has also changed.

I agree with this direction. We need to look at all the multimodal message formats across all LLM providers and deduce common denominators. From a brief cursory look I believe these multimodal/multipart messages are all json payloads of various formats (schemas). So let's come up with a nice abstractions (like the ContentPart idea above) that abstracts the implementation details and see how they map to data structures across various LLM providers.

silvanocerza · 2024-06-19T08:47:22Z

We can keep it much simpler.

As of now models can receive and generate the following:

text
image
audio
video
heterogeneous list of all the above

We have all the necessary abstractions to define the above.
str obviously for text.
haystack.dataclasses.ByteStream for image, audio and video.
The list is List[Union[str, ByteStream]] then.

Given that we say that ChatMessage.content type should be Union[str, ByteStream, List[Union[str, ByteStream]]].

This abstracts at an high level all the supported type of data a model receives and generates. If model X needs their input or generates their output in a certain format its Generator will handle the conversion, but that's an implementation detail.

Introducing new classes or new abstractions is not the way to go in my opinion.

CarlosFerLo · 2024-06-19T12:44:35Z

@silvanocerza I like the simplicity of your solution, but I've just read the code for 'ByteStream' and we should expect the metadata to be populated with some flag to indicate the content type, else we won't be able to distinguish. That's why I believe that the 'ContentPart' approach to be easier to handle and allows us to provide for brother input types for the different formats.
I will proceed with this implementation as soon as #7849 is merged to main.

silvanocerza · 2024-09-24T14:18:04Z

Still relevant, reopening.

vblagoje · 2024-09-25T07:49:33Z

Still relevant, reopening.

I agree - schedule it soon as well.

lbux · 2024-10-11T00:56:55Z

We can keep it much simpler.

As of now models can receive and generate the following:

text

image

audio

video

heterogeneous list of all the above

We have all the necessary abstractions to define the above. str obviously for text. haystack.dataclasses.ByteStream for image, audio and video. The list is List[Union[str, ByteStream]] then.

Given that we say that ChatMessage.content type should be Union[str, ByteStream, List[Union[str, ByteStream]]].

This abstracts at an high level all the supported type of data a model receives and generates. If model X needs their input or generates their output in a certain format its Generator will handle the conversion, but that's an implementation detail.

Introducing new classes or new abstractions is not the way to go in my opinion.

Trying to understand how this implementation would work. Let's say that the ChatMessage class is modified to change content to be Union[str, ByteStream, List[Union[str, ByteStream]]]. Given this, we should be able to pass in a ByteStream to ChatMessage. This, I understand. However, my understanding is that you want to move the actual implementation to each generator?

Then does that mean we would need to handle the conversion in, say _convert_message_to_openai_format, like so:

def _convert_message_to_openai_format(message: ChatMessage) -> Dict[str, str]:
    """
    Convert a message to the format expected by OpenAI's Chat API.

    See the [API reference](https://platform.openai.com/docs/api-reference/chat/create) for details.

    :returns: A dictionary with the following key:
        - `role`
        - `content`
        - `name` (optional)
    """

    openai_msg = {"role": message.role.value}

    if isinstance(message.content, str):
        openai_msg["content"] = message.content
    elif isinstance(message.content, ByteStream):
        base64_data = b64encode(message.content.data).decode("utf-8")
        url = f"data:{message.content.mime_type};base64,{base64_data}"
        openai_msg["content"] = [({"type": "image_url", "image_url": {"url": url}})]
    elif isinstance(message.content, list):
        openai_msg["content"] = []
        for item in message.content:
            if isinstance(item, str):
                openai_msg["content"].append({"type": "text", "text": item})
            elif isinstance(item, ByteStream):
                base64_data = b64encode(item.data).decode("utf-8")
                url = f"data:{item.mime_type};base64,{base64_data}"
                openai_msg["content"].append(({"type": "image_url", "image_url": {"url": url}}))

    if message.name:
        openai_msg["name"] = message.name

    return openai_msg

This works provided that the user specifies the valid mime_type in ByteStream.from_file_path (or we can try to infer it like so:

def from_file_path(
        cls, filepath: Path, mime_type: Optional[str] = None, meta: Optional[Dict[str, Any]] = None
    ) -> "ByteStream":
        """
        Create a ByteStream from the contents read from a file.

        :param filepath: A valid path to a file.
        :param mime_type: The mime type of the file.
        :param meta: Additional metadata to be stored with the ByteStream.
        """

        if mime_type is None:
            mime_type = mimetypes.guess_type(filepath)[0]
            if mime_type is None:
                raise ValueError("Mime type was not supplied and could not be guessed.")
        
        with open(filepath, "rb") as fd:
            return cls(data=fd.read(), mime_type=mime_type, meta=meta or {})

With this implementation, the abstractions aren't modified as much and the conversions would (probably) occur in some helper functions for each generator. This would allow for usage like so:

message = [ChatMessage.from_user(content=["Write me a poem about this image", ByteStream.from_file_path("nier.jpg")])]
generator = OpenAIChatGenerator(api_key = Secret.from_env_var("OPENAI_API_KEY"), model = "gpt-4o-mini")
output = generator.run(messages=message)

Is this what you had in mind or is there some other insight you could provide to help with an implementation?

joshdawson · 2024-11-06T14:08:01Z

Just wondering if this is any further along or if there are any new workarounds to feed images in to a chat generator?

Jchang4 · 2024-11-06T15:09:57Z

Yes please make this higher priority! This is a huge piece of functionality from Chat GPT

lbux · 2024-11-07T02:56:06Z

Trying to understand how this implementation would work. Let's say that the ChatMessage class is modified to change content to be Union[str, ByteStream, List[Union[str, ByteStream]]]. Given this, we should be able to pass in a ByteStream to ChatMessage. This, I understand. However, my understanding is that you want to move the actual implementation to each generator?

With this implementation, the abstractions aren't modified as much and the conversions would (probably) occur in some helper functions for each generator. This would allow for usage like so:
message = [ChatMessage.from_user(content=["Write me a poem about this image", ByteStream.from_file_path("nier.jpg")])]
generator = OpenAIChatGenerator(api_key = Secret.from_env_var("OPENAI_API_KEY"), model = "gpt-4o-mini")
output = generator.run(messages=message)
Is this what you had in mind or is there some other insight you could provide to help with an implementation?

Ollama just released vision support. If we stick to the bytestream implementation I suggested, we can add support to it in the Ollama implementation by doing something like

def _message_to_dict(self, message: ChatMessage) -> Dict[str, Union[str, List[str]]]:
        result = {"role": message.role.value}

        # Handle content field
        if isinstance(message.content, str):
            result["content"] = message.content
        elif isinstance(message.content, list):
            # Concatenate text in list and handle images
            text_content = []
            images = []
            for item in message.content:
                if isinstance(item, str):
                    text_content.append(item)
                elif isinstance(item, ByteStream):
                    base64_data = b64encode(item.data).decode("utf-8")
                    images.append(base64_data)
            result["content"] = " ".join(text_content)
            if images:
                result["images"] = images
        elif isinstance(message.content, ByteStream):
            base64_data = b64encode(message.content.data).decode("utf-8")
            result["content"] = ""
            result["images"] = [base64_data]
        else:
            result["content"] = ""

        return result

Which would once again look the same as the OpenAI implementation when the user calls it.

message = [ChatMessage.from_system(content="Talk like a pirate"), ChatMessage.from_user(content=["Write me a poem about this image.", ByteStream.from_file_path("nier.jpg")])]
generator = OllamaChatGenerator(model="llama3.2-vision")
output = generator.run(messages=message)

LastRemote · 2024-11-11T06:46:38Z

Hi @silvanocerza , do you mind sharing more about the broader vision for ChatMessage? Besides the multimodal support discussed here, I've noticed some recent updates related to tool calls in the experimental repo. These have been around for a while, and I'm curious about whether transitioning to the new architecture is recommended. Additionally, could you outline the roadmap or overall strategy for rolling out these features into production?

LastRemote · 2024-11-20T08:55:45Z

I am planning to add multimodal support in haystack-experimental (it already has some advanced tool supports there). I am opening an issue for this (basically my to-do list for better visibility since I am aware that some people are asking multimodality): deepset-ai/haystack-experimental#135

tomarharsh changed the title ~~ChatMessage content only allows content as str but to pass image we need it to be list similar to how OpenAI allows it~~ ChatMessage content only allows content as str-only doesn't allow user to pass image Jun 12, 2024

tomarharsh changed the title ~~ChatMessage content only allows content as str-only doesn't allow user to pass image~~ ChatMessage content being str-only doesn't allow user to pass image Jun 12, 2024

CarlosFerLo mentioned this issue Jun 22, 2024

feat: Add support for multimodal on ChatMessages with ContentPart #7913

Closed

shadeMe added 2.x Related to Haystack v2.0 community-triage labels Jun 25, 2024

CarlosFerLo mentioned this issue Jun 26, 2024

feat: Multimodal ChatMessage #7943

Closed

github-actions bot added the stale label Jul 26, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024

silvanocerza reopened this Sep 24, 2024

github-actions bot removed the stale label Sep 25, 2024

LastRemote mentioned this issue Nov 20, 2024

Add multimodal support for the new ChatMessage class deepset-ai/haystack-experimental#135

Open

LastRemote mentioned this issue Dec 4, 2024

[DRAFT] feat: add multimodal support for ChatMessage deepset-ai/haystack-experimental#145

Open

github-actions bot added the stale label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatMessage content being `str`-only doesn't allow user to pass image #7848

ChatMessage content being `str`-only doesn't allow user to pass image #7848

tomarharsh commented Jun 12, 2024

CarlosFerLo commented Jun 16, 2024

CarlosFerLo commented Jun 16, 2024

lbux commented Jun 17, 2024

CarlosFerLo commented Jun 18, 2024 •

edited

Loading

vblagoje commented Jun 19, 2024

silvanocerza commented Jun 19, 2024

CarlosFerLo commented Jun 19, 2024

silvanocerza commented Sep 24, 2024

vblagoje commented Sep 25, 2024

lbux commented Oct 11, 2024 •

edited

Loading

joshdawson commented Nov 6, 2024

Jchang4 commented Nov 6, 2024

lbux commented Nov 7, 2024

LastRemote commented Nov 11, 2024 •

edited

Loading

LastRemote commented Nov 20, 2024

ChatMessage content being str-only doesn't allow user to pass image #7848

ChatMessage content being str-only doesn't allow user to pass image #7848

Comments

tomarharsh commented Jun 12, 2024

CarlosFerLo commented Jun 16, 2024

CarlosFerLo commented Jun 16, 2024

lbux commented Jun 17, 2024

CarlosFerLo commented Jun 18, 2024 • edited Loading

vblagoje commented Jun 19, 2024

silvanocerza commented Jun 19, 2024

CarlosFerLo commented Jun 19, 2024

silvanocerza commented Sep 24, 2024

vblagoje commented Sep 25, 2024

lbux commented Oct 11, 2024 • edited Loading

joshdawson commented Nov 6, 2024

Jchang4 commented Nov 6, 2024

lbux commented Nov 7, 2024

LastRemote commented Nov 11, 2024 • edited Loading

LastRemote commented Nov 20, 2024

ChatMessage content being `str`-only doesn't allow user to pass image #7848

ChatMessage content being `str`-only doesn't allow user to pass image #7848

CarlosFerLo commented Jun 18, 2024 •

edited

Loading

lbux commented Oct 11, 2024 •

edited

Loading

LastRemote commented Nov 11, 2024 •

edited

Loading