diff --git a/docs/blog/posts/consistent-stories.md b/docs/blog/posts/consistent-stories.md new file mode 100644 index 000000000..9963833c6 --- /dev/null +++ b/docs/blog/posts/consistent-stories.md @@ -0,0 +1,265 @@ +--- +authors: + - ivanleomk +categories: + - OpenAI +comments: true +date: 2024-12-10 +description: Generating complex DAGS with gpt-4o +draft: false +tags: + - OpenAI + - DAGs +--- + +# Consistent Stories with GPT-4o + +Language Models struggle to generate consistent graphs that have a large number of nodes. Often times, this is because the graph itself is too large for the model to handle. This causes the model to generate inconsistent graphs that have invalid and disconnected nodes among other issues. + +In this article, we'll look at how we can get around this limitation by using a two-phase approach to generate complex DAGs with gpt-4o by looking at a simple example of generating a Choose Your Own Adventure story. + + + +## Why do DAGs matter? + +DAGs are directed acyclic graphs. A graph is considered a DAG when every connection between nodes is directed ( it goes in a single direction ) and there are no cycles ( it doesn't loop back to a previous node ). + +```mermaid +graph TD + A --> B + A --> C + B --> D + C --> D +``` + +This isn't too far away from a Choose Your Own Adventure story where users have a fixed set of choices at each step and can only move forward in the story. We can see this in action below: + +```mermaid +graph TD + A[Story Root] --> B[Choice 1] + A --> C[Choice 2] + A --> D[Choice 3] + B --> E[Choice 1.1] + B --> F[Choice 1.2] + C --> G[Choice 2.1] + C --> H[Choice 2.2] + D --> I[Choice 3.1] + D --> J[Choice 3.2] +``` + +## The Challenge: Scaling Story Generation + +When we try to use a language model to generate a story in a single run, this hits several limitations quickly because just with 4 choices at each step, we're already at 20 nodes by the second level. If users can only make 2 choices before our story ends, that doesn't result in a very interesting story to play with. + +In other words, we'll overflow the context window of the model quickly. To get around this, we can use a two-phase approach to generate the story where we generate an initial story setting and then generate the choices/other options in parallel. + +## Parallel Story Generation + +### Generating an Outline + +First, we generate an outline of the story using gpt-4o. This is important because it gives us a starting setting, visual style and image description ( for the banner image ). We can then use this down the line to ensure the images we generate are consistent as much as possible. + +```python +from pydantic import BaseModel +from typing import List + +class GeneratedStory(BaseModel): + setting: str + plot_summary: str + choices: List[str] + visual_style: str + image_description: str + +async def generate_story( + client: instructor.AsyncInstructor, + story_input: RestateStoryInput +): + resp = await client.chat.completions.create( + messages=[{ + "role": "user", + "content": """ + Generate a story with: + - Setting: {{ story_input.setting}} + - Title: {{ story_input.title }} + + Rules: + - Generate 2-4 initial choices that represent actions + - Choices must move story forward + - Include brief setting description + - Generate a visual description for the story + + Required Elements: + 1. Plot Summary: A vivid description of the setting and plot + 2. Initial Choices: 2-4 distinct actions the user can take + 3. Visual Style: Description of art style, color palette + 4. Image Description: One-sentence scene description + """ + }], + model="gpt-4o", + response_model=GeneratedStory, + context={"story_input": story_input}, + ) + return resp +``` + +This outputs a story with a setting, plot summary, choices, visual style and image description. + +```bash +# Example generated output +{ + "setting": "A neon-lit cyberpunk metropolis in 2150", + "plot_summary": "In the sprawling city of Neo-Tokyo...", + "choices": [ + "Investigate the mysterious signal in the abandoned district", + "Meet your contact at the underground hacker hub", + "Follow the corporate executive who seems suspicious" + ], + "visual_style": "Vibrant neon colors, detailed cyberpunk architecture", + "image_description": "A towering cyberpunk cityscape at night with neon signs" +} +``` + +### Parallel Choice Expansion + +One of the biggest challenges in generating deep story trees is maintaining consistency as the story branches grow. + +Here's how we solve this with parallel generation and state tracking: + +```mermaid +graph TD + %% Main nodes + A[Find Door] --> B[Open Door] + A --> C[Walk Away] + + B --> D[Read Book] + B --> E[Leave Room] + + C --> F[Go Home] + C --> G[Wait Outside] + + %% Styling for visual hierarchy + classDef start fill:#ff9999,stroke:#333,stroke-width:2px + classDef decision fill:#99ccff,stroke:#333,stroke-width:2px + classDef outcome fill:#99ffff,stroke:#333,stroke-width:1px + + %% Apply styles + class A start + class B,C decision + class D,E,F,G outcome + + %% Add tooltips for context + click B "Door context" "Open Door Context" + click C "Away context" "Walk Away Context" + click D "Door and Book context" "Read Book Context" +``` + +The key insight is that each path through the story tree has its own unique state. We do so by having a simple accumulator that allows us to keep track of the previous choices and the story context. + +It's also important to note here that the model also has the full flexibility to end the story at any point in time. + +Here's how we implement this: + +```python +async def rewrite_choice( + client: instructor.AsyncInstructor, + choice: str, + story: GeneratedStory, + prev_choices: list[dict], # Accumulator for path state + max_depth: int, + sem: asyncio.Semaphore +) -> FinalStoryChoice: + # Each choice knows its entire path history + async with sem: + rewritten_choice = await client.chat.completions.create( + model="gpt-4o", + response_model=RewrittenChoice, + messages=[{ + "role": "user", + "content": """ + Given this choice: {{ choice }} + + Story context: + Setting: {{ story.setting }} + Plot: {{ story.plot_summary }} + + Previous choices made in this path: + {% for prev in prev_choices %} + - {{ prev.choice_description }} + Result: {{ prev.choice_consequences }} + {% endfor %} + + Generate the next story beat and 2-4 new choices. + The story should end in {{ max_depth - len(prev_choices) }} more turns. + """ + }], + context={ + "choice": choice, + "story": story, + "prev_choices": prev_choices, + } + ) + + # For terminal nodes (at max depth) + if len(prev_choices) == max_depth - 1: + return FinalStoryChoice( + choice_description=rewritten_choice.choice_description, + choice_consequences=rewritten_choice.choice_consequences, + choices=[] # Terminal node + ) + + # Recursively expand child choices + child_choices = await asyncio.gather(*[ + rewrite_choice( + client=client, + choice=new_choice, + story=story, + prev_choices=prev_choices + [{ + "choice_description": rewritten_choice.choice_description, + "choice_consequences": rewritten_choice.choice_consequences + }], + max_depth=max_depth, + sem=sem + ) + for new_choice in rewritten_choice.choices + ]) + + return FinalStoryChoice( + choice_description=rewritten_choice.choice_description, + choice_consequences=rewritten_choice.choice_consequences, + choices=child_choices + ) +``` + +This approach gives us several key benefits: + +1. **Path-Specific Context**: Each node maintains the complete history of choices that led to it, ensuring consistency within each branch +2. **Parallel Generation**: Different branches can be generated simultaneously since they each maintain their own state +3. **Controlled Growth**: The `max_depth` parameter prevents exponential expansion +4. **Rate Limiting**: The semaphore controls concurrent API calls while allowing maximum parallelization + +The semaphore isn't just for rate limiting - it ensures we process choices at a manageable pace while maintaining state consistency. + +Each path through the story tree becomes a self-contained narrative with access to its complete history, allowing us to generate coherent stories at a much faster speed and verbosity than a single call would be able to generate. + +Additionally, we can generate stories that are much broader and deeper than a single call would be able to generate. + +## Beyond Story Generation + +The success of this approach comes down to three key principles: + +1. **State Isolation**: Each node maintains only the context it needs, preventing context window overflow +2. **Parallel Processing**: Generation can happen simultaneously across branches, dramatically reducing total generation time +3. **Structured Validation**: Using Pydantic models ensures each generated component meets your requirements + +For example, generating a 20-node story tree sequentially might take 60 seconds (3s per node), but with parallel generation and 10 concurrent requests, it could complete in just 45-50 seconds. + +This pattern is particularly valuable when: + +- Your generation tasks naturally form a tree or graph structure +- Individual nodes need some but not all context from their ancestors +- You need to generate content that exceeds a single context window +- Speed of generation is important + +By combining structured outputs with parallel generation, you can reliably generate complex, interconnected content at scale while maintaining consistency and control. + +`instructor` makes it easy to generate complex Data Structures with language models - whether they're open source models with ollama or proprietary models with providers such as OpenAI. Give us a try today! diff --git a/docs/blog/posts/extracting-model-metadata.md b/docs/blog/posts/extracting-model-metadata.md new file mode 100644 index 000000000..552347e45 --- /dev/null +++ b/docs/blog/posts/extracting-model-metadata.md @@ -0,0 +1,239 @@ +--- +title: "Extracting Metadata from Images using Structured Extraction" +date: 2024-12-11 +description: Structured Extraction makes working with images easy, in this post we'll see how to use it to extract metadata from images +categories: + - OpenAI + - Multimodal +authors: + - ivanleomk +--- + +Multimodal Language Models like gpt-4o excel at procesing multimodal, enabling us to extract rich, structured metadata from images. + +This is particularly valuable in areas like fashion where we can use these capabilities to understand user style preferences from images and even videos. In this post, we'll see how to use instructor to map images to a given product taxonomy so we can recommend similar products for users. + + + +## Why Image Metadata is useful + +Most online e-commerce stores have a taxonomy of products that they sell. This is a way of categorizing products so that users can easily find what they're looking for. + +A small example of a taxonomy is shown below. You can think of this as a way of mapping a product to a set of attributes, with some common attributes that are shared across all products. + +```yaml +tops: + t-shirts: + - crew_neck + - v_neck + - graphic_tees + sweaters: + - crewneck + - cardigan + - pullover + jackets: + - bomber_jackets + - denim_jackets + - leather_jackets + +bottoms: + pants: + - chinos + - dress_pants + - cargo_pants + shorts: + - athletic_shorts + - cargo_shorts + +colors: + - black + - navy + - white + - beige + - brown +``` + +By using this taxonomy, we can ensure that our model is able to extract metadata that is consistent with the products we sell. In this example, we'll analyze style photos from a fitness influencer to understand their fashion preferences and possibily see what products we can recommend from our own catalog to him. + +We're using some photos from a fitness influencer called [Jpgeez](https://www.instagram.com/jpgeez/) which you can see below. + +
+![](./img/style_1.png){: style="height:200px"} +![](./img/style_2.png){: style="height:200px"} +![](./img/style_3.png){: style="height:200px"} +![](./img/style_4.png){: style="height:200px"} +![](./img/style_5.png){: style="height:200px"} +![](./img/style_6.png){: style="height:200px"} +
+ +While we're mapping these visual elements over to a taxonomy, this is really applicable to any other use case where you want to extract metadata from images. + +## Extracting metadata from images + +### Instructor's `Image` class + +With instructor, working with `multimodal` data is easy. We can use the `Image` class to load images from a URL or local file. We can see this below in action. + +```python +import instructor + +# Load images using instructor.Image.from_path +images = [] +for image_file in image_files: + image_path = os.path.join("./images", image_file) + image = instructor.Image.from_path(image_path) + images.append(image) +``` + +We provide a variety of different methods for loading images, including from a URL, local file, and even from a base64 encoded string which you [can read about here](../../concepts/multimodal.md) + +### Defining a response model + +Since our taxonomy is defined as a yaml file, we can't use literals to define the response model. Instead, we can read in the configuration from a yaml file and then use that in a `model_validator` step to make sure that the metadata we extract is consistent with the taxonomy. + +First, we read in the taxonomy from a yaml file and create a set of categories, subcategories, and product types. + +```python +import yaml + +with open("taxonomy.yml", "r") as file: + taxonomy = yaml.safe_load(file) + +colors = taxonomy["colors"] +categories = set(taxonomy.keys()) +categories.remove("colors") + +subcategories = set() +product_types = set() +for category in categories: + for subcategory in taxonomy[category].keys(): + subcategories.add(subcategory) + for product_type in taxonomy[category][subcategory]: + product_types.add(product_type) +``` + +Then we can use these in our `response_model` to make sure that the metadata we extract is consistent with the taxonomy. + +```python +class PersonalStyle(BaseModel): + """ + Ideally you map this to a specific taxonomy + """ + + categories: list[str] + subcategories: list[str] + product_types: list[str] + colors: list[str] + + @model_validator(mode="after") + def validate_options(self, info: ValidationInfo): + context = info.context + colors = context["colors"] + categories = context["categories"] + subcategories = context["subcategories"] + product_types = context["product_types"] + + # Validate colors + for color in self.colors: + if color not in colors: + raise ValueError( + f"Color {color} is not in the taxonomy. Valid colors are {colors}" + ) + for category in self.categories: + if category not in categories: + raise ValueError( + f"Category {category} is not in the taxonomy. Valid categories are {categories}" + ) + + for subcategory in self.subcategories: + if subcategory not in subcategories: + raise ValueError( + f"Subcategory {subcategory} is not in the taxonomy. Valid subcategories are {subcategories}" + ) + + for product_type in self.product_types: + if product_type not in product_types: + raise ValueError( + f"Product type {product_type} is not in the taxonomy. Valid product types are {product_types}" + ) + + return self +``` + +### Making the API call + +Lastly, we can combine these all into a single api call to `gpt-4o` where we pass in all of the images and the response model into the `response_model` parameter. + +With our inbuilt support for `jinja` formatting using the `context` keyword that exposes data we can also re-use in our validation, this becomes an incredibly easy step to execute. + +```python +import openai +import instructor + +client = instructor.from_openai(openai.OpenAI()) + +resp = client.chat.completions.create( + model="gpt-4o", + messages=[ + { + "role": "system", + "content": """ +You are a helpful assistant. You are given a list of images and you need to map the person style of the person in the image to a given taxonomy. + +Here is the taxonomy that you should use + +Colors: +{% for color in colors %} +* {{ color }} +{% endfor %} + +Categories: +{% for category in categories %} +* {{ category }} +{% endfor %} + +Subcategories: +{% for subcategory in subcategories %} +* {{ subcategory }} +{% endfor %} + +Product types: +{% for product_type in product_types %} +* {{ product_type }} +{% endfor %} +""", + }, + { + "role": "user", + "content": [ + "Here are the images of the person, describe the personal style of the person in the image from a first-person perspective( Eg. You are ... )", + *images, + ], + }, + ], + response_model=PersonalStyle, + context={ + "colors": colors, + "categories": list(categories), + "subcategories": list(subcategories), + "product_types": list(product_types), + }, +) +``` + +This then returns the following response. + +```python +PersonalStyle( + categories=['tops', 'bottoms'], + subcategories=['sweaters', 'jackets', 'pants'], + product_types=['cardigan', 'crewneck', 'denim_jackets', 'chinos'], + colors=['brown', 'beige', 'black', 'white', 'navy'] +) +``` + +## Looking Ahead + +The ability to extract structured metadata from images opens up exciting possibilities for personalization in e-commerce. The key is maintaining the bridge between unstructured visual inspiration and structured product data through well-defined taxonomies and robust validation. + +`instructor` makes working with multimodal data easy, and we're excited to see what you build with it. Give us a try today with `pip install instructor` and see how easy it is to work with language models using structured extraction. diff --git a/docs/blog/posts/img/style_1.png b/docs/blog/posts/img/style_1.png new file mode 100644 index 000000000..6ea80cd26 Binary files /dev/null and b/docs/blog/posts/img/style_1.png differ diff --git a/docs/blog/posts/img/style_2.png b/docs/blog/posts/img/style_2.png new file mode 100644 index 000000000..14c2c8491 Binary files /dev/null and b/docs/blog/posts/img/style_2.png differ diff --git a/docs/blog/posts/img/style_3.png b/docs/blog/posts/img/style_3.png new file mode 100644 index 000000000..7f899de89 Binary files /dev/null and b/docs/blog/posts/img/style_3.png differ diff --git a/docs/blog/posts/img/style_4.png b/docs/blog/posts/img/style_4.png new file mode 100644 index 000000000..aec4003fe Binary files /dev/null and b/docs/blog/posts/img/style_4.png differ diff --git a/docs/blog/posts/img/style_5.png b/docs/blog/posts/img/style_5.png new file mode 100644 index 000000000..24750384b Binary files /dev/null and b/docs/blog/posts/img/style_5.png differ diff --git a/docs/blog/posts/img/style_6.png b/docs/blog/posts/img/style_6.png new file mode 100644 index 000000000..9549ac3c9 Binary files /dev/null and b/docs/blog/posts/img/style_6.png differ