This session focuses on practical model discovery and management with Foundry Local. You'll learn how to list available models, test different options, and understand basic performance characteristics. The approach emphasizes hands-on exploration with the foundry CLI to help you select the right models for your use cases.
- Master foundry CLI commands for model discovery and management
- Understand model cache and local storage patterns
- Learn to quickly test and compare different models
- Establish practical workflows for model selection and benchmarking
- Explore the growing ecosystem of models available through Foundry Local
- Completed Session 1: Getting Started with Foundry Local
- Foundry Local CLI installed and accessible
- Sufficient storage space for model downloads (models can range from 1GB to 20GB+)
- Basic understanding of model types and use cases-Source Models with Foundry Local
This session explores how to bring open-source models to Foundry L## Part 6: Hands-On Exercise
Create your own model evaluation script based on Sample 03:
REM create_model_test.cmd
@echo off
echo Model Discovery and Testing Script
echo =====================================
echo.
echo Step 1: List available models
foundry model list
echo.
echo Step 2: Check what's cached
foundry cache list
echo.
echo Step 3: Start phi-4-mini for testing
foundry model run phi-4-mini --verbose
echo.
echo Step 4: Test with a simple prompt
curl -X POST http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello, please introduce yourself.\"}],\"max_tokens\":100}"
echo.
echo Model test complete!- Run the Sample 03 script:
samples\03\list_and_bench.cmd - Try different models: Test at least 3 different models
- Compare performance: Note differences in speed and response quality
- Document findings: Create a simple comparison chart
Model Comparison Results:
========================
phi-4-mini: Fast (~2s), good for general chat
qwen2.5-7b: Slower (~5s), better reasoning
deepseek-r1: Medium (~3s), excellent for code
Recommendation: Start with phi-4-mini for development,
switch to qwen2.5-7b for production reasoning tasks.
Model Won't Start:
REM Check service status
foundry service status
REM Restart service if needed
foundry service stop
foundry service start
REM Try with verbose output
foundry model run phi-4-mini --verboseInsufficient Memory:
- Start with smaller models (
phi-4-mini) - Close other applications
- Upgrade RAM if frequently hitting limits
Slow Performance:
- Ensure model is fully loaded (check verbose output)
- Close unnecessary background applications
- Consider faster storage (SSD)
- Start Small: Begin with
phi-4-minito validate setup - One Model at a Time: Stop previous models before starting new ones
- Monitor Resources: Keep an eye on memory usage
- Test Consistently: Use the same prompts for fair comparisons
- Document Results: Keep notes on model performance for your use cases
- Session 4 Focus: Optimization tools and techniques
- Prerequisites: Comfortable with model switching and basic performance testing
- Recommended: Have 2-3 favorite models identified from this session
- Foundry Local Documentation: Official documentation
- CLI Reference: Complete command reference
- Model Mondays: Weekly model spotlights
- Foundry Local GitHub: Community and issues
- Sample 03: Model Discovery: Hands-on example script
✅ Model Discovery: Use foundry model list to explore available models
✅ Quick Testing: The list_and_bench.cmd pattern for rapid evaluation
✅ Performance Monitoring: Basic resource usage and response time measurement
✅ Model Selection: Practical guidelines for choosing models by use case
✅ Cache Management: Understanding storage and cleanup procedures
You now have the practical skills to discover, test, and select appropriate models for your AI applications using Foundry Local's straightforward CLI approach.: selecting community models, integrating Hugging Face content, and adopting “bring your own model” (BYOM) strategies. You’ll also discover the Model Mondays series for continuous learning and model discovery.
References:
- Foundry Local docs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
- Compile Hugging Face models: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models
- Model Mondays: https://aka.ms/model-mondays
- Foundry Local GitHub: https://github.com/microsoft/Foundry-Local
- Discover and evaluate open-source models for local inference
- Compile and run select Hugging Face models within Foundry Local
- Apply model selection strategies for accuracy, latency, and resource needs
- Manage models locally with cache and versioning
The foundry CLI provides straightforward commands for model discovery and management:
REM List all available models in the catalog
foundry model list
REM List cached (downloaded) models
foundry cache list
REM Check cache directory location
foundry cache lsStart with popular, well-tested models to understand performance characteristics:
REM Run Phi-4-Mini (lightweight, fast)
foundry model run phi-4-mini --verbose
REM Run Qwen 2.5 7B (larger, more capable)
foundry model run qwen2.5-7b --verbose
REM Run DeepSeek (specialized for coding)
foundry model run deepseek-r1-7b --verboseNote: The --verbose flag provides detailed startup information, including:
- Model download progress (on first run)
- Memory allocation details
- Service binding information
- Performance initialization metrics
Small Language Models (SLMs):
phi-4-mini: Fast, efficient, great for general chatphi-4: More capable version with better reasoning
Medium Models:
qwen2.5-7b: Excellent reasoning and longer contextdeepseek-r1-7b: Optimized for code generation
Larger Models:
llama-3.2: Meta's latest open-source modelqwen2.5-14b: Enterprise-grade reasoning
Based on our Sample 03 pattern, here's the minimal workflow:
@echo off
REM Sample 03 - List and bench pattern
echo Listing available models...
foundry model list
echo.
echo Checking cached models...
foundry cache list
echo.
echo Starting phi-4-mini with verbose output...
foundry model run phi-4-mini --verboseOnce a model is running, test it with consistent prompts:
REM Test via curl (Windows Command Prompt)
curl -X POST http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain edge AI in one sentence.\"}],\"max_tokens\":50}"# PowerShell approach for testing
$body = @{
model = "phi-4-mini"
messages = @(
@{
role = "user"
content = "Explain edge AI in one sentence."
}
)
max_tokens = 50
} | ConvertTo-Json -Depth 3
Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body $body -ContentType "application/json"Foundry Local automatically manages model downloads and caching:
REM Check cache directory and contents
foundry cache ls
REM View cache location
foundry cache cd
REM Clean up unused models (if needed)
foundry cache cleanTypical Model Sizes:
phi-4-mini: ~2.5 GBqwen2.5-7b: ~4.1 GBdeepseek-r1-7b: ~4.3 GBllama-3.2: ~4.9 GBqwen2.5-14b: ~8.2 GB
Storage Best Practices:
- Keep 2-3 models cached for quick switching
- Remove unused models to free space:
foundry cache clean - Monitor disk usage, especially on smaller SSDs
- Consider model size vs. capability trade-offs
While models are running, monitor system resources:
Windows Task Manager:
- Watch memory usage (models stay loaded in RAM)
- Monitor CPU utilization during inference
- Check disk I/O during initial model loading
Command Line Monitoring:
REM Check memory usage (PowerShell)
Get-Process | Where-Object {$_.ProcessName -like "*foundry*"} | Select-Object ProcessName, WorkingSet64
REM Monitor running models
foundry service psFor General Chat and Q&A:
- Start with:
phi-4-mini(fast, efficient) - Upgrade to:
phi-4(better reasoning) - Advanced:
qwen2.5-7b(longer context)
For Code Generation:
- Recommended:
deepseek-r1-7b - Alternative:
qwen2.5-7b(also good for code)
For Complex Reasoning:
- Best:
qwen2.5-7borqwen2.5-14b - Budget option:
phi-4
Minimum System Requirements:
phi-4-mini: 8GB RAM, entry-level CPU
phi-4: 12GB RAM, mid-range CPU
qwen2.5-7b: 16GB RAM, mid-range CPU
deepseek-r1: 16GB RAM, mid-range CPU
qwen2.5-14b: 24GB RAM, high-end CPU
Recommended for Best Performance:
- 32GB+ RAM for comfortable multi-model switching
- SSD storage for faster model loading
- Modern CPU with good single-thread performance
- NPU support (Windows 11 Copilot+ PCs) for acceleration
REM Stop current model (if needed)
foundry service stop
REM Start different model
foundry model run qwen2.5-7b
REM Verify model is running
foundry service statusHere's a straightforward approach to compare model performance:
# simple_bench.py - Based on Sample 03 patterns
import time
import requests
import json
def test_model_response(model_name, prompt="Explain edge AI in one sentence."):
"""Test a single model with a prompt and measure response time."""
start_time = time.time()
try:
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": model_name,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 64
},
timeout=30
)
elapsed = time.time() - start_time
if response.status_code == 200:
result = response.json()
return {
"model": model_name,
"latency_sec": round(elapsed, 3),
"response": result["choices"][0]["message"]["content"],
"status": "success"
}
else:
return {
"model": model_name,
"status": "error",
"error": f"HTTP {response.status_code}"
}
except Exception as e:
return {
"model": model_name,
"status": "error",
"error": str(e)
}
# Test the currently running model
if __name__ == "__main__":
# Test with different models (start each model first)
test_models = ["phi-4-mini", "qwen2.5-7b", "deepseek-r1-7b"]
print("Model Performance Test")
print("=" * 50)
for model in test_models:
print(f"\nTesting {model}...")
print("Note: Make sure this model is running first with 'foundry model run {model}'")
result = test_model_response(model)
if result["status"] == "success":
print(f"✅ {model}: {result['latency_sec']}s")
print(f" Response: {result['response'][:100]}...")
else:
print(f"❌ {model}: {result['error']}")For each model, test with consistent prompts and manually evaluate:
Test Prompts:
- "Explain quantum computing in simple terms."
- "Write a Python function to sort a list."
- "What are the pros and cons of remote work?"
- "Summarize the benefits of edge AI."
Evaluation Criteria:
- Accuracy: Is the information correct?
- Clarity: Is the explanation easy to understand?
- Completeness: Does it address the full question?
- Speed: How quickly does it respond?
REM Monitor while testing different models
REM Start model
foundry model run phi-4-mini
REM In another terminal, monitor resources
foundry service status
foundry service ps
REM Check system resources (PowerShell)
Get-Process | Where-Object ProcessName -Like "*foundry*" | Format-Table ProcessName, WorkingSet64, CPU- Subscribe to Model Mondays for new models and tips: https://aka.ms/model-mondays
- Contribute findings to your team’s
models.json - Prepare for Session 4: comparing LLMs vs SLMs, local vs cloud inference, and hands-on demos