Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,8 +205,7 @@ Once you've got the basics working, there's more:
```python
extractor.optimize(
texts=your_examples,
expected_results=expected_outputs,
num_trials=50
expected_results=expected_outputs
)
```

Expand Down
2 changes: 1 addition & 1 deletion docs/src/content/docs/examples/legal-contracts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ Create an extractor for legal document analysis:
extractor = LangStruct(
schema=LegalContractSchema,
model="gemini/gemini-2.5-flash-lite", # Fast and reliable for legal analysis
optimize=True,
use_sources=True, # Critical for legal document traceability
temperature=0.1, # Lower temperature for consistency
max_retries=3 # Ensure reliability
)
# Later: extractor.optimize(training_texts, expected_results)

# Example contract text
contract_text = """
Expand Down
2 changes: 1 addition & 1 deletion docs/src/content/docs/examples/scientific-papers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,11 @@ Create an extractor for research paper analysis:
extractor = LangStruct(
schema=ScientificPaperSchema,
model="gemini/gemini-2.5-flash-lite", # Fast and reliable for academic content
optimize=True,
use_sources=True, # Track where information was found
temperature=0.2, # Slightly higher for nuanced interpretation
max_retries=3
)
# Later: extractor.optimize(training_texts, expected_results)

# Example research paper text (excerpt)
paper_text = """
Expand Down
48 changes: 29 additions & 19 deletions docs/src/content/docs/optimization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,28 +9,28 @@ Make your extraction more accurate with automatic optimization. LangStruct learn

## The Easy Way

**Enable optimization (configure optimizer) and then optimize with your data:**
**Create an extractor (optionally choose the optimizer) and call `optimize()` when you're ready:**

```python
from langstruct import LangStruct

# Create extractor with optimization enabled
extractor = LangStruct(
example={
"name": "Dr. Sarah Johnson",
"age": 34,
"occupation": "data scientist"
},
optimize=True # sets up optimizer; run .optimize(...) to train
optimizer="miprov2", # default optimizer
)

# Later, once you have training data:
# extractor.optimize(texts=training_texts, expected_results=good_results)
```

**Default behavior (faster startup, good baseline accuracy):**
**Quick experiments (skip optimization entirely):**

```python
# No optimization - good for quick experiments
extractor = LangStruct(example={"name": "John", "age": 30})
# optimize=False by default - enables faster startup
```

## When You Have Training Data
Expand Down Expand Up @@ -74,8 +74,19 @@ Optimization can significantly improve accuracy on real-world tasks:

## Persisting Results

Saving/loading an optimized extractor is not yet implemented.
For now, re-run `optimize()` when you start up, or persist your training data and configuration.
Save and load optimized extractors to reuse them without re-running optimization:

```python
# Save after optimization
extractor.save("./my_extractor")

# Load later
from langstruct import LangStruct
loaded = LangStruct.load("./my_extractor")

# Use immediately - optimization is preserved
result = loaded.extract("new text")
```

## Advanced (If You Need It)

Expand All @@ -86,7 +97,6 @@ Most users don't need this, but if you want more control:
extractor.optimize(
texts=training_texts,
expected_results=good_results,
num_trials=50, # More trials = better results (takes longer)
validation_split=0.3 # Use 30% for testing improvements
)
```
Expand All @@ -110,26 +120,26 @@ extractor.optimize(

## Common Questions

**Q: Do I always need training data?**
A: No! Optimization can work without training data, but providing examples improves results significantly.
**Q: Do I always need training data?**
A: You need example texts, but not necessarily expected outputs. If you don't provide `expected_results`, LangStruct uses the LLM's confidence ratings to optimize. Providing expected outputs significantly improves accuracy.

**Q: How long does optimization take?**
**Q: How long does optimization take?**
A: Usually 1-5 minutes for typical datasets (10-100 examples).

**Q: Can I optimize an already optimized extractor?**
A: Yes! You can keep optimizing with new data as you get it.
**Q: Can I optimize an already optimized extractor?**
A: Yes, you can continue optimizing with new data as you collect it.

**Q: Will this make my extractions slower?**
A: No - optimization happens once during training. Production extraction speed is the same.
**Q: Will this make my extractions slower?**
A: No - optimization happens once during training. Production extraction speed is unchanged.

**Q: What happens when I switch models?**
A: Just change the model and re-optimize! Same training data, same accuracy - zero prompt rewriting needed.
**Q: What happens when I switch models?**
A: Change the model and re-optimize with the same training data. No prompt rewriting needed.

## Next Steps

<CardGrid>
<Card title="Try It Now" icon="laptop">
Create a LangStruct extractor and enable optimization when you need accuracy!
Create a LangStruct extractor and enable optimization when you need accuracy.
</Card>
<Card title="Source Grounding" icon="document">
[Track where information comes from](/source-grounding/)
Expand Down
8 changes: 3 additions & 5 deletions docs/src/content/docs/persistence.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,9 @@ print(result.entities)
```python
from langstruct import LangStruct

# Create extractor with optimization
# Create extractor
extractor = LangStruct(
example={"name": "John", "age": 30, "role": "engineer"},
optimize=True
)

# Train the extractor
Expand All @@ -58,8 +57,7 @@ expected_results = [{"name": "Expected outputs..."}]

extractor.optimize(
texts=training_texts,
expected_results=expected_results,
num_trials=50
expected_results=expected_results
)

# Save optimized state
Expand Down Expand Up @@ -215,7 +213,7 @@ Common error scenarios:

```python
# Development: Train and save
extractor = LangStruct(schema=MySchema, optimize=True)
extractor = LangStruct(schema=MySchema)
extractor.optimize(training_data, expected_results)
extractor.save("./production_extractor")

Expand Down
49 changes: 24 additions & 25 deletions docs/src/content/docs/query-parsing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,22 +44,22 @@ This single query contains **three distinct types of information**:
- Quarter: Q3 2024 (exact match)
- Revenue: > $100B (numeric comparison)
- Sector: Technology (category match)

These need **database-style filtering**, not semantic search
</Card>
<Card title="Semantic Content" icon="magnifier">
**Conceptual topics for similarity search:**
- "financial reports" (could be 10-K, earnings, statements)
- "AI investments" (could be ML, artificial intelligence, neural networks)

These need **embedding-based semantic search**
</Card>
<Card title="Implicit Context" icon="information">
**Assumed context from natural language:**
- "Show me" implies retrieval intent
- "companies" implies corporate entities
- Plural suggests multiple results expected

These provide **query understanding context**
</Card>
</CardGrid>
Expand All @@ -86,30 +86,30 @@ results = vector_db.similarity_search(query_embedding)
<Tabs>
<TabItem label="Semantic Terms">
**What they are:** Conceptual topics that benefit from semantic understanding

**Examples:**
- "artificial intelligence" ≈ "AI" ≈ "machine learning"
- "financial performance" ≈ "earnings" ≈ "fiscal results"
- "customer satisfaction" ≈ "user happiness" ≈ "client feedback"

**How they work:** Converted to embeddings for similarity matching

**Best for:**
- Finding conceptually related content
- Handling synonyms and variations
- Discovering relevant but not exact matches
</TabItem>
<TabItem label="Structured Filters">
**What they are:** Exact constraints that must be precisely matched

**Examples:**
- Date/Time: "Q3 2024", "after 2023", "last 30 days"
- Numbers: "revenue > $100M", "5-10 employees", "top 3"
- Categories: "tech sector", "approved status", "high priority"
- Entities: "Apple Inc.", "California", "John Smith"

**How they work:** Converted to database-style filter operations

**Best for:**
- Enforcing hard constraints
- Filtering by exact values
Expand All @@ -129,7 +129,7 @@ Let's see how different queries naturally decompose:
- **Structured filters:** `{"quarter": "Q3 2024", "sector": "Technology", "profitable": true}`
- **Why it matters:** You want companies that ARE profitable (filter), not just ones that DISCUSS profitability

#### Healthcare Query
#### Healthcare Query
> "Patient records over 65 years old with diabetes showing improvement"

- **Semantic terms:** `["showing improvement", "better outcomes"]`
Expand Down Expand Up @@ -216,7 +216,7 @@ print("📖 Explanation:", result.explanation)
'revenue': {'$gte': 100.0}
}
💯 Confidence: 91.5%
📖 Explanation:
📖 Explanation:
Searching for: tech companies
With filters:
• quarter = Q3 2024
Expand Down Expand Up @@ -270,30 +270,30 @@ class EnhancedRAGSystem:
# Same schema for both extraction and parsing!
self.langstruct = LangStruct(example=schema_example)
self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())

def index_document(self, text: str):
"""Extract metadata and index document"""
# Extract structured metadata
extraction = self.langstruct.extract(text)

# Index with both text and metadata
self.vectorstore.add_texts(
texts=[text],
metadatas=[extraction.entities]
)

def natural_query(self, query: str, k: int = 5):
"""Query using natural language"""
# Parse query into components
parsed = self.langstruct.query(query)

# Perform hybrid search
results = self.vectorstore.similarity_search(
query=' '.join(parsed.semantic_terms),
k=k,
filter=parsed.structured_filters
)

return results, parsed.explanation

# Usage
Expand Down Expand Up @@ -407,13 +407,13 @@ ls = LangStruct(example=your_schema)
# Query with natural language
def smart_search(query: str):
parsed = ls.query(query)

results = collection.query(
query_texts=parsed.semantic_terms,
where=parsed.structured_filters,
n_results=10
)

return results
```

Expand All @@ -431,19 +431,19 @@ ls = LangStruct(example=your_schema)
# Natural language query
def pinecone_search(query: str):
parsed = ls.query(query)

# Convert to Pinecone filter format
pinecone_filter = {
f"metadata.{k}": v
f"metadata.{k}": v
for k, v in parsed.structured_filters.items()
}

results = index.query(
vector=embed(parsed.semantic_terms),
filter=pinecone_filter,
top_k=10
)

return results
```

Expand Down Expand Up @@ -497,9 +497,8 @@ domain_ls = LangStruct(
# Include synonyms in descriptions
"earnings": 10.5, # Also covers "profits", "income"
},
# Can optimize for better accuracy
optimize=True
)
# Call domain_ls.optimize(...) with training examples when ready
```

## Performance Considerations
Expand All @@ -512,7 +511,7 @@ from functools import lru_cache
class CachedLangStruct:
def __init__(self, schema):
self.ls = LangStruct(example=schema)

@lru_cache(maxsize=1000)
def query_cached(self, query: str):
"""Cache frequently used queries"""
Expand Down
4 changes: 1 addition & 3 deletions docs/src/content/docs/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,8 @@ extractor = LangStruct(example=schema)
# See optimization in action
extractor.optimize(
texts=["training texts..."],
expected=[{"expected outputs..."}],
num_trials=50 # More trials = better accuracy
expected_results=[{"expected outputs..."}] # Optional - uses confidence if omitted
)
print(f"Optimized accuracy: {extractor.score:.1%}")
```

## Process Multiple Documents (with quotas)
Expand Down
9 changes: 4 additions & 5 deletions docs/src/content/docs/why-dspy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ extractor = LangStruct(example={

# 2. Let MIPROv2 optimize prompts and examples automatically
extractor.optimize(
training_texts=["Apple reported $125B in Q3...", "Meta earned $40B..."],
texts=["Apple reported $125B in Q3...", "Meta earned $40B..."],
expected_results=[
{"company": "Apple", "revenue": 125.0, "quarter": "Q3"},
{"company": "Meta", "revenue": 40.0, "quarter": "Q3"}
Expand Down Expand Up @@ -147,17 +147,16 @@ result = extractor.extract("Microsoft announced $65B revenue for Q4")
extractor = LangStruct(
example={"company": "Apple", "revenue": 100.0},
model="gpt-5-mini",
optimize=True
)
extractor.optimize(training_texts, expected_results)
extractor.optimize(texts=training_texts, expected_results=expected_results)

# 6 months later, switch to Claude - just two lines!
extractor.model = "claude-3-7-sonnet-latest"
extractor.optimize(training_texts, expected_results) # Auto-reoptimizes prompts
extractor.optimize(texts=training_texts, expected_results=expected_results) # Auto-reoptimizes prompts

# Or use local models for privacy
extractor.model = "ollama/llama3.2"
extractor.optimize(training_texts, expected_results) # Works the same way
extractor.optimize(texts=training_texts, expected_results=expected_results) # Works the same way

# Same accuracy, zero prompt rewriting, zero vendor lock-in
```
Expand Down
Loading