diff --git a/docs/blog/posts/google-openai-client.md b/docs/blog/posts/google-openai-client.md
index e510b301a..9e56fde94 100644
--- a/docs/blog/posts/google-openai-client.md
+++ b/docs/blog/posts/google-openai-client.md
@@ -22,6 +22,8 @@ If you're unfamiliar with instructor, we provide a simple interface to get struc
This makes it easy to switch between providers, get reliable outputs from language models and ultimately build production grade LLM applications.
+
+
## The current state
The new integration provides an easy integration with the Open AI Client, this means that using function calling with Gemini models has become much easier. We don't need to use a gemini specific library like `vertexai` or `google.generativeai` anymore to define response models.
diff --git a/docs/blog/posts/img/untidy_table.png b/docs/blog/posts/img/untidy_table.png
new file mode 100644
index 000000000..3e6b947be
Binary files /dev/null and b/docs/blog/posts/img/untidy_table.png differ
diff --git a/docs/blog/posts/introducing-structured-outputs-with-cerebras-inference.md b/docs/blog/posts/introducing-structured-outputs-with-cerebras-inference.md
index 0d0d169d1..a3b84b0a9 100644
--- a/docs/blog/posts/introducing-structured-outputs-with-cerebras-inference.md
+++ b/docs/blog/posts/introducing-structured-outputs-with-cerebras-inference.md
@@ -1,23 +1,24 @@
---
authors:
-- ivanleomk
-- sarahchieng
+ - ivanleomk
+ - sarahchieng
categories:
-- API Development
-- Pydantic
-- Performance Optimization
+ - API Development
+ - Pydantic
+ - Performance Optimization
comments: true
date: 2024-10-15
-description: Learn how to use Cerebras Inference for structured outputs, faster model
+description:
+ Learn how to use Cerebras Inference for structured outputs, faster model
inference, and seamless integration with Pydantic models.
draft: false
slug: introducing-structured-outputs-with-cerebras-inference
tags:
-- Cerebras Inference
-- Pydantic
-- API Integration
-- Fast Inference
-- Structured Outputs
+ - Cerebras Inference
+ - Pydantic
+ - API Integration
+ - Fast Inference
+ - Structured Outputs
---
# Introducing structured outputs with Cerebras Inference
@@ -32,6 +33,8 @@ Sign up for a Cerebras Inference API key here at [cloud.cerebras.ai](http://clou
To get guaranteed structured outputs with Cerebras Inference, you
+
+
1. Create a new Instructor client with the `from_cerebras` method
2. Define a Pydantic model to pass into the `response_model` parameter
3. Get back a validated response exactly as you would expect
@@ -125,4 +128,4 @@ for person in resp:
# > Person(name='Jessica', age=26)
```
-And that’s it! We're excited to see what you build with Instructor and Cerebras! If you have any questions about Cerebras or need to get off the API key waitlist, please reach out to sarah.chieng@cerebras.net.
\ No newline at end of file
+And that’s it! We're excited to see what you build with Instructor and Cerebras! If you have any questions about Cerebras or need to get off the API key waitlist, please reach out to sarah.chieng@cerebras.net.
diff --git a/docs/blog/posts/llm-as-reranker.md b/docs/blog/posts/llm-as-reranker.md
index 823f84034..32b6a3c5c 100644
--- a/docs/blog/posts/llm-as-reranker.md
+++ b/docs/blog/posts/llm-as-reranker.md
@@ -1,19 +1,19 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- LLM
-- Pydantic
+ - LLM
+ - Pydantic
comments: true
date: 2024-10-23
description: Learn how to use Instructor and Pydantic to create an LLM-based reranker for improving search results relevance.
draft: false
tags:
-- LLM
-- Pydantic
-- Instructor
-- Search Relevance
-- Reranking
+ - LLM
+ - Pydantic
+ - Instructor
+ - Search Relevance
+ - Reranking
---
# Building an LLM-based Reranker for your RAG pipeline
@@ -30,6 +30,8 @@ In this blog post, we'll show you how to create an LLM-based reranker using Inst
By the end of this tutorial, you'll be able to implement a llm reranker to label your synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system. Let's dive in!
+
+
## Setting Up the Environment
First, let's set up our environment with the necessary imports:
@@ -167,7 +169,7 @@ If you want to extend this example, you could use the `rerank_results` function
Moreover, we could also add validators to the `Label.chunk_id` field to ensure that the chunk_id is present in the `chunks` list. This might be useful if labels are `uuids` or complex strings and we want to ensure that the chunk_id is a valid index for the chunks list.
-heres an example
+heres an example
```python
class Label(BaseModel):
@@ -184,4 +186,4 @@ class Label(BaseModel):
return v
```
-This will automatically check that the `chunk_id` is present in the `chunks` list and raise a `ValueError` if it is not, where `context` is the context dictionary that we passed into the `rerank_results` function.
\ No newline at end of file
+This will automatically check that the `chunk_id` is present in the `chunks` list and raise a `ValueError` if it is not, where `context` is the context dictionary that we passed into the `rerank_results` function.
diff --git a/docs/blog/posts/multimodal-gemini.md b/docs/blog/posts/multimodal-gemini.md
index eb7e38da8..df1f4a2c9 100644
--- a/docs/blog/posts/multimodal-gemini.md
+++ b/docs/blog/posts/multimodal-gemini.md
@@ -1,19 +1,19 @@
---
authors:
-- ivanleomk
+ - ivanleomk
categories:
-- Gemini
-- Multimodal
+ - Gemini
+ - Multimodal
comments: true
date: 2024-10-23
description: Learn how to use Google's Gemini model for multimodal structured extraction of YouTube videos, extracting structured recommendations for tourist destinations.
draft: false
tags:
-- Gemini
-- Multimodal AI
-- Travel Recommendations
-- Pydantic
-- Python
+ - Gemini
+ - Multimodal AI
+ - Travel Recommendations
+ - Pydantic
+ - Python
---
# Structured Outputs with Multimodal Gemini
@@ -30,6 +30,8 @@ import instructor
import google.generativeai as genai
```
+
+
## Defining Our Data Models
We'll use Pydantic to define our data models for tourist destinations and recommendations:
@@ -86,27 +88,27 @@ print(resp)
```python
Recomendations(
- chain_of_thought='The video recommends visiting Takayama city, in the Hida Region, Gifu Prefecture. The
- video suggests visiting the Miyagawa Morning Market, to try the Sarubobo good luck charms, and to enjoy the
- cookie cup espresso, made by Koma Coffee. Then, the video suggests visiting a traditional Japanese Cafe,
- called Kissako Katsure, and try their matcha and sweets. Afterwards, the video suggests to visit the Sanmachi
+ chain_of_thought='The video recommends visiting Takayama city, in the Hida Region, Gifu Prefecture. The
+ video suggests visiting the Miyagawa Morning Market, to try the Sarubobo good luck charms, and to enjoy the
+ cookie cup espresso, made by Koma Coffee. Then, the video suggests visiting a traditional Japanese Cafe,
+ called Kissako Katsure, and try their matcha and sweets. Afterwards, the video suggests to visit the Sanmachi
Historic District, where you can find local crafts and delicious foods. The video recommends trying Hida Wagyu
- beef, at the Kin no Kotte Ushi shop, or to have a sit-down meal at the Kitchen Hida. Finally, the video
+ beef, at the Kin no Kotte Ushi shop, or to have a sit-down meal at the Kitchen Hida. Finally, the video
recommends visiting Shirakawa-go, a World Heritage Site in Gifu Prefecture.',
- description='This video recommends a number of places to visit in Takayama city, in the Hida Region, Gifu
- Prefecture. It shows some of the local street food and highlights some of the unique shops and restaurants in
+ description='This video recommends a number of places to visit in Takayama city, in the Hida Region, Gifu
+ Prefecture. It shows some of the local street food and highlights some of the unique shops and restaurants in
the area.',
destinations=[
TouristDestination(
name='Takayama',
- description='Takayama is a city at the base of the Japan Alps, located in the Hida Region of
+ description='Takayama is a city at the base of the Japan Alps, located in the Hida Region of
Gifu.',
location='Hida Region, Gifu Prefecture'
),
TouristDestination(
name='Miyagawa Morning Market',
- description="The Miyagawa Morning Market, or the Miyagawa Asai-chi in Japanese, is a market that
- has existed officially since the Edo Period, more than 100 years ago. It's open every single day, rain or
+ description="The Miyagawa Morning Market, or the Miyagawa Asai-chi in Japanese, is a market that
+ has existed officially since the Edo Period, more than 100 years ago. It's open every single day, rain or
shine, from 7am to noon.",
location='Hida Takayama'
),
@@ -117,19 +119,19 @@ print(resp)
),
TouristDestination(
name='Koma Coffee',
- description="Koma Coffee is a shop that has been in business for about 50 or 60 years, and they
+ description="Koma Coffee is a shop that has been in business for about 50 or 60 years, and they
serve coffee in a cookie cup. They've been serving coffee for about 10 years.",
location='Hida Takayama'
),
TouristDestination(
name='Kissako Katsure',
- description='Kissako Katsure is a traditional Japanese style cafe, called Kissako, and the name
+ description='Kissako Katsure is a traditional Japanese style cafe, called Kissako, and the name
means would you like to have some tea. They have a variety of teas and sweets.',
location='Hida Takayama'
),
TouristDestination(
name='Sanmachi Historic District',
- description='Sanmachi Dori is a Historic Merchant District in Takayama, all of the buildings here
+ description='Sanmachi Dori is a Historic Merchant District in Takayama, all of the buildings here
have been preserved to look as they did in the Edo Period.',
location='Hida Takayama'
),
@@ -146,7 +148,7 @@ print(resp)
),
TouristDestination(
name='Kin no Kotte Ushi',
- description='Kin no Kotte Ushi is a shop known for selling Beef Sushi, especially Hida Wagyu Beef
+ description='Kin no Kotte Ushi is a shop known for selling Beef Sushi, especially Hida Wagyu Beef
Sushi. Their sushi is medium rare.',
location='Hida Takayama'
),
@@ -202,6 +204,7 @@ To address these limitations and expand the capabilities of our video analysis s
2. **Speaker Diarization**: Implement speaker recognition to attribute recommendations to specific individuals. This could be particularly useful for videos featuring multiple hosts or interviewees.
3. **Segment-based Analysis**: Process longer videos in segments to maintain accuracy and capture all relevant information. This approach could involve:
+
- Splitting the video into smaller chunks
- Analyzing each chunk separately
- Aggregating and deduplicating results
diff --git a/docs/blog/posts/openai-multimodal.md b/docs/blog/posts/openai-multimodal.md
index 49f3706b3..faac8b0a8 100644
--- a/docs/blog/posts/openai-multimodal.md
+++ b/docs/blog/posts/openai-multimodal.md
@@ -1,24 +1,26 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- OpenAI
-- Audio
+ - OpenAI
+ - Audio
comments: true
date: 2024-10-17
description: Explore the new audio capabilities in OpenAI's Chat Completions API using the gpt-4o-audio-preview model.
draft: false
tags:
-- OpenAI
-- Audio Processing
-- API
-- Machine Learning
+ - OpenAI
+ - Audio Processing
+ - API
+ - Machine Learning
---
# Audio Support in OpenAI's Chat Completions API
OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new `gpt-4o-audio-preview` model, which brings advanced voice capabilities to the familiar Chat Completions API interface.
+
+
## Key Features
The new audio support in the Chat Completions API offers several compelling features:
diff --git a/docs/blog/posts/pairwise-llm-judge.md b/docs/blog/posts/pairwise-llm-judge.md
index 7be2dbf43..d5e084dcf 100644
--- a/docs/blog/posts/pairwise-llm-judge.md
+++ b/docs/blog/posts/pairwise-llm-judge.md
@@ -1,19 +1,19 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- LLM
-- Pydantic
+ - LLM
+ - Pydantic
comments: true
date: 2024-10-17
description: Explore how to use Instructor and Pydantic to create a pairwise LLM judge for evaluating text relevance.
draft: false
tags:
-- LLM
-- Pydantic
-- Instructor
-- Text Relevance
-- AI Evaluation
+ - LLM
+ - Pydantic
+ - Instructor
+ - Text Relevance
+ - AI Evaluation
---
# Building a Pairwise LLM Judge with Instructor and Pydantic
@@ -24,6 +24,8 @@ In this blog post, we'll explore how to create a pairwise LLM judge using Instru
Evaluating text relevance is a common task in natural language processing and information retrieval. By leveraging large language models (LLMs) and structured outputs, we can create a system that judges the similarity or relevance between a question and a given text.
+
+
## Setting Up the Environment
First, let's set up our environment with the necessary imports:
diff --git a/docs/blog/posts/parea.md b/docs/blog/posts/parea.md
index 3a231db8f..6cca9bf87 100644
--- a/docs/blog/posts/parea.md
+++ b/docs/blog/posts/parea.md
@@ -1,20 +1,21 @@
---
authors:
-- jxnl
-- joschkabraun
+ - jxnl
+ - joschkabraun
categories:
-- LLM Observability
+ - LLM Observability
comments: true
date: 2024-07-17
-description: Explore how Parea enhances the OpenAI instructor, enabling better monitoring,
+description:
+ Explore how Parea enhances the OpenAI instructor, enabling better monitoring,
collaboration, and error tracking for LLM applications.
draft: false
tags:
-- Parea
-- OpenAI
-- LLM
-- instructor
-- validation
+ - Parea
+ - OpenAI
+ - LLM
+ - instructor
+ - validation
---
# Parea for Observing, Testing & Fine-tuning of Instructor
@@ -29,10 +30,11 @@ tags:
Before starting this tutorial, make sure that you've registered for a [Parea](https://www.parea.ai) account. You'll also need to create an [API key](https://docs.parea.ai/api-reference/authentication).
-
## Example: Writing Emails with URLs from Instructor Docs
-We will demonstrate Parea by using `instructor` to write emails which only contain URLs from the `instructor` docs. We'll need to install our dependencies before proceeding so simply run the command below.
+We will demonstrate Parea by using `instructor` to write emails which only contain URLs from the `instructor` docs. We'll need to install our dependencies before proceeding so simply run the command below.
+
+
```bash
pip install -U parea-ai instructor
@@ -133,13 +135,11 @@ To take a look at trace of this execution checkout the screenshot below. Noticea
![](./img/parea/trace.png)
-
Above we can see that while the email was successfully created, there was a validation error which meant that additional cost & latency were introduced because of the initially failed validation.
Below we can see a visualization of the average validation error count for our instructor usage over time.
![](./img/parea/validation-error-chart.png)
-
## Label Responses for Fine-Tuning
Sometimes you may want to let subject-matter experts (SMEs) label responses to use them for fine-tuning. Parea provides a way to do this via an annotation queue. Editing raw JSON objects to correct tool use & function calling responses can be error-prone, esp. for non-devs. For that purpose, Parea has a so-called [Form Mode](https://docs.parea.ai/manual-review/overview#labeling-function-calling-tool-use-responses) which allows the user to safely fill-out a form instead of editing the JSON object. The labeled data can then be exported and used for fine-tuning.
@@ -152,9 +152,9 @@ Sometimes you may want to let subject-matter experts (SMEs) label responses to u
```python hl_lines="5 6"
from parea import Parea
-
+
p = Parea(api_key=os.getenv("PAREA_API_KEY"))
-
+
dataset = p.get_collection(DATASET_ID) #(1)!
dataset.write_to_finetune_jsonl("finetune.jsonl") #(2)!
```
@@ -166,4 +166,4 @@ Sometimes you may want to let subject-matter experts (SMEs) label responses to u
```bash
instructor jobs create-from-file finetune.jsonl
- ```
\ No newline at end of file
+ ```
diff --git a/docs/blog/posts/rag-timelines.md b/docs/blog/posts/rag-timelines.md
index ba8eb0539..26311e7ea 100644
--- a/docs/blog/posts/rag-timelines.md
+++ b/docs/blog/posts/rag-timelines.md
@@ -1,19 +1,20 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- LLM Techniques
+ - LLM Techniques
comments: true
date: 2024-06-06
-description: Explore enhancing RAG systems with time filters using Instructor and
+description:
+ Explore enhancing RAG systems with time filters using Instructor and
Pydantic for accurate, relevant data retrieval.
draft: false
tags:
-- RAG
-- Time Filters
-- Pydantic
-- Instructor
-- LLM Techniques
+ - RAG
+ - Time Filters
+ - Pydantic
+ - Instructor
+ - LLM Techniques
---
# Enhancing RAG with Time Filters Using Instructor
@@ -22,6 +23,8 @@ Retrieval-augmented generation (RAG) systems often need to handle queries with t
Instructor is a Python library that simplifies integrating large language models (LLMs) with data sources and APIs. It allows defining structured output models using Pydantic, which can be used as prompts or to parse LLM outputs.
+
+
## Modeling Time Filters
To handle time filters, we can define a Pydantic model representing a time range:
diff --git a/docs/blog/posts/situate-context.md b/docs/blog/posts/situate-context.md
index 1f0e5d47b..a3723a349 100644
--- a/docs/blog/posts/situate-context.md
+++ b/docs/blog/posts/situate-context.md
@@ -1,28 +1,30 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- Anthropic
-- LLM Techniques
-- Python
+ - Anthropic
+ - LLM Techniques
+ - Python
comments: true
date: 2024-09-26
-description: Learn to implement Anthropic's Contextual Retrieval with async processing
+description:
+ Learn to implement Anthropic's Contextual Retrieval with async processing
to enhance RAG systems and preserve crucial context efficiently.
draft: false
tags:
-- Contextual Retrieval
-- Async Processing
-- RAG Systems
-- Performance Optimization
-- Document Chunking
+ - Contextual Retrieval
+ - Async Processing
+ - RAG Systems
+ - Performance Optimization
+ - Document Chunking
---
# Implementing Anthropic's Contextual Retrieval with Async Processing
-Anthropic's [Contextual Retrieval](https://www.anthropic.com/blog/contextual-retrieval-for-rag) technique enhances RAG systems by preserving crucial context.
+Anthropic's [Contextual Retrieval](https://www.anthropic.com/blog/contextual-retrieval-for-rag) technique enhances RAG systems by preserving crucial context.
+
+This post examines the method and demonstrates an efficient implementation using async processing. We'll explore how to optimize your RAG applications with this approach, building on concepts from our [async processing guide](./learn-async.md).
-This post examines the method and demonstrates an efficient implementation using async processing. We'll explore how to optimize your RAG applications with this approach, building on concepts from our [async processing guide](./learn-async.md).
## Background: The Context Problem in RAG
@@ -48,14 +50,14 @@ contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performa
Anthropic uses Claude to generate context. They provide this prompt:
```
-
-{{WHOLE_DOCUMENT}}
-
-Here is the chunk we want to situate within the whole document
-
-{{CHUNK_CONTENT}}
-
-Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
+
+{{WHOLE_DOCUMENT}}
+
+Here is the chunk we want to situate within the whole document
+
+{{CHUNK_CONTENT}}
+
+Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
```
## Performance Improvements
@@ -181,4 +183,4 @@ Based on Anthropic's suggestions:
4. Experiment with different embedding models and prompts.
5. Implement a reranking step for further performance improvements.
-This implementation provides a starting point for leveraging Anthropic's Contextual Retrieval technique with the added efficiency of async processing.
\ No newline at end of file
+This implementation provides a starting point for leveraging Anthropic's Contextual Retrieval technique with the added efficiency of async processing.
diff --git a/docs/blog/posts/structured-output-anthropic.md b/docs/blog/posts/structured-output-anthropic.md
index 39b1254e4..0df11224e 100644
--- a/docs/blog/posts/structured-output-anthropic.md
+++ b/docs/blog/posts/structured-output-anthropic.md
@@ -1,19 +1,19 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- Anthropic
+ - Anthropic
comments: true
date: 2024-10-23
description: Learn how to leverage Anthropic's Claude with Instructor for structured outputs and prompt caching, enhancing AI application development.
draft: false
tags:
-- Anthropic
-- API Development
-- Pydantic
-- Python
-- LLM Techniques
-- Prompt Caching
+ - Anthropic
+ - API Development
+ - Pydantic
+ - Python
+ - LLM Techniques
+ - Prompt Caching
---
# Structured Outputs and Prompt Caching with Anthropic
@@ -24,6 +24,8 @@ Anthropic's ecosystem now offers two powerful features for AI developers: struct
Instructor now offers seamless integration with Anthropic's powerful language models, allowing developers to easily create structured outputs using Pydantic models. This integration simplifies the process of extracting specific information from AI-generated responses.
+
+
To get started, you'll need to install Instructor with Anthropic support:
```bash
@@ -134,4 +136,4 @@ In this example, the large context (the book content) is cached after the first
By combining Anthropic's Claude with Instructor's structured output capabilities and leveraging prompt caching, developers can create more efficient, cost-effective, and powerful AI applications. These features open up new possibilities for building sophisticated AI systems that can handle complex tasks with ease.
-As the AI landscape continues to evolve, staying up-to-date with the latest tools and techniques is crucial. We encourage you to explore these features and share your experiences with the community. Happy coding!
\ No newline at end of file
+As the AI landscape continues to evolve, staying up-to-date with the latest tools and techniques is crucial. We encourage you to explore these features and share your experiences with the community. Happy coding!
diff --git a/docs/blog/posts/tidy-data-from-messy-tables.md b/docs/blog/posts/tidy-data-from-messy-tables.md
new file mode 100644
index 000000000..78e0bdda1
--- /dev/null
+++ b/docs/blog/posts/tidy-data-from-messy-tables.md
@@ -0,0 +1,142 @@
+---
+title: Using Structured Outputs to convert messy tables into tidy data
+description: With instructor, converting messy tables into tidy data is easy and fast
+categories:
+ - Data Analysis
+ - Structured Outputs
+date: 2024-11-21
+draft: false
+---
+
+# Using Structured Outputs to convert messy tables into tidy data
+
+## Why is this a problem?
+
+Messy data exports are a common problem. Whether it's multiple headers in the table, implicit relationships that make analysis a pain or even just merged cells, using `instructor` with structured outputs makes it easy to convert messy tables into tidy data, even if all you have is just an image of the table as we'll see below.
+
+Let's look at the following table as an example. It makes analysis unnecessarily difficult because it hides data relationships through empty cells and implicit repetition. If we were using it for data analysis, cleaning it manually would be a huge nightmare.
+
+
+
+![](./img/untidy_table.png)
+
+For example, the subject ID (321) and GTT date only appear in the first row, with blank cells below implying these values apply to the following rows. This format breaks most pandas operations - you can't simply group by subject ID or merge with other datasets without complex preprocessing to fill in these missing values.
+
+Instead, we have time series measurements spread across multiple rows, mixed data types in the insulin column (numbers and "lo off curve"), and repeated subject information hidden through empty cells. This means even simple operations like calculating mean glucose levels by time point or plotting glucose curves require data reshaping and careful handling of missing/special values.
+
+## Using Structured Outputs
+
+### Defining a custom type
+
+Using tools like instructor to automatically convert untidy data into tidy format can save hours of preprocessing and reduce errors in your analysis pipeline.
+
+Let's start by first defining a custom type that can parse the markdown table into a pandas dataframe.
+
+```python
+import instructor
+from io import StringIO
+from typing import Annotated, Any
+from pydantic import BeforeValidator, PlainSerializer, InstanceOf, WithJsonSchema
+import pandas as pd
+from pydantic import BaseModel
+from openai import OpenAI
+
+
+def md_to_df(data: Any) -> Any:
+ # Convert markdown to DataFrame
+ if isinstance(data, str):
+ return (
+ pd.read_csv(
+ StringIO(data), # Process data
+ sep="|",
+ index_col=1,
+ )
+ .dropna(axis=1, how="all")
+ .iloc[1:]
+ .applymap(lambda x: x.strip())
+ )
+ return data
+
+
+MarkdownDataFrame = Annotated[
+ InstanceOf[pd.DataFrame],
+ BeforeValidator(md_to_df),
+ PlainSerializer(lambda df: df.to_markdown()),
+ WithJsonSchema(
+ {
+ "type": "string",
+ "description": "The markdown representation of the table, each one should be tidy, do not try to join tables that should be seperate",
+ }
+ ),
+]
+```
+
+### Extracting the table
+
+Then with this new custom data type, it becomes easy to just pass the image to the LLM and get a tidy dataframe in response.
+
+```python
+import instructor
+from pydantic import BaseModel
+from openai import OpenAI
+
+class Table(BaseModel):
+ caption: str
+ dataframe: MarkdownDataFrame # Custom type for handling tables
+
+class TidyTables(BaseModel):
+ tables: list[Table]
+
+# Patch the OpenAI client with instructor
+client = instructor.from_openai(OpenAI())
+
+def extract_table(image_path: str) -> TidyTables:
+ return client.chat.completions.create(
+ model="gpt-4o-mini",
+ messages=[{
+ "role": "user",
+ "content": [
+ "Convert this untidy table to tidy format",
+ instructor.Image.from_path(image_path)
+ ]
+ }],
+ response_model=TidyTables
+ )
+
+extracted_tables = extract_table("./untidy_table.png")
+```
+
+This then returns the following output for us as a single pandas dataframe which we can easily plot and do any sort of data analysis on.
+
+| ID | GTT date | GTT weight | time | glucose mg/dl | insulin ng/ml | Comment |
+| --- | -------- | ---------- | ---- | ------------- | ------------- | ------------ |
+| 321 | 2/9/15 | 24.5 | 0 | 99.2 | | lo off curve |
+| 321 | 2/9/15 | 24.5 | 5 | 349.3 | 0.205 | |
+| 321 | 2/9/15 | 24.5 | 15 | 286.1 | 0.129 | |
+| 321 | 2/9/15 | 24.5 | 30 | 312 | 0.175 | |
+| 321 | 2/9/15 | 24.5 | 60 | 99.9 | 0.122 | |
+| 321 | 2/9/15 | 24.5 | 120 | 217.9 | | lo off curve |
+| 322 | 2/9/15 | 18.9 | 0 | 185.8 | 0.251 | |
+| 322 | 2/9/15 | 18.9 | 5 | 297.4 | 2.228 | |
+| 322 | 2/9/15 | 18.9 | 15 | 439 | 2.078 | |
+| 322 | 2/9/15 | 18.9 | 30 | 362.3 | 0.775 | |
+| 322 | 2/9/15 | 18.9 | 60 | 232.7 | 0.5 | |
+| 322 | 2/9/15 | 18.9 | 120 | 260.7 | 0.523 | |
+| 323 | 2/9/15 | 24.7 | 0 | 198.5 | 0.151 | |
+| 323 | 2/9/15 | 24.7 | 5 | 530.6 | | off curve lo |
+
+More importantly, we can also extract multiple tables from a single image. This would be useful in helping to segment and identify different sections of a messy report. With tidy data, we get the benefits of
+
+1. Each variable being its own column
+2. Each observation being its own row
+3. Each value having its own cell
+4. Seamlessly working with pandas/numpy operations
+5. Visualization libraries "just working"
+
+## Conclusion
+
+We can actually go one step further and make this even tidier by converting things like weight, glucose and insulin into a specific column called metric which would allow us to add arbitrary metrics to the table without having to change the schema or our plotting code. This is a huge productivity boost when doing complex data analysis.
+
+No more wrestling with complex data cleaning pipelines. Let the model handle the heavy lifting while you focus on analysis. With instructor, getting to that step just became a whole lot easier.
+
+Give `instructor` a try today and see how you can build reliable applications. Just run `pip install instructor` or check out our [Getting Started Guide](../../index.md)
diff --git a/docs/blog/posts/using_json.md b/docs/blog/posts/using_json.md
index 2e974ac84..05bd4017c 100644
--- a/docs/blog/posts/using_json.md
+++ b/docs/blog/posts/using_json.md
@@ -1,20 +1,21 @@
---
authors:
-- jxnl
+ - jxnl
categories:
-- LLM Techniques
+ - LLM Techniques
comments: true
date: 2024-06-15
-description: Learn how to easily get structured JSON data from LLMs using the Instructor
+description:
+ Learn how to easily get structured JSON data from LLMs using the Instructor
library with Pydantic models in Python.
draft: false
slug: zero-cost-abstractions
tags:
-- Instructor
-- JSON
-- LLM
-- Pydantic
-- Python
+ - Instructor
+ - JSON
+ - LLM
+ - Pydantic
+ - Python
---
# Why Instructor is the best way to get JSON from LLMs
@@ -32,6 +33,7 @@ It stands out for its simplicity, transparency, and user-centric design, built o
- [Go](https://go.useinstructor.com)
- [Elixir](https://hex.pm/packages/instructor)
+
## The Simple Patch for JSON LLM Outputs