-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up the skills.md and example scripts! #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| --- | ||
| name: tinyfish | ||
| description: Use TinyFish/Mino web agent to extract/scrape websites, extract data, and automate browser actions using natural language. Use when you need to extract/scrape data from websites, handle bot-protected sites, or automate web tasks. | ||
| --- | ||
|
|
||
| # TinyFish Web Agent | ||
|
|
||
| Requires: `MINO_API_KEY` environment variable | ||
|
|
||
| ## Best Practices | ||
|
|
||
| 1. **Specify JSON format**: Always describe the exact structure you want returned | ||
| 2. **Parallel calls**: When extracting from multiple independent sites, make separate parallel calls instead of combining into one prompt | ||
|
|
||
| ## Basic Extract/Scrape | ||
|
|
||
| Extract data from a page. Specify the JSON structure you want: | ||
|
|
||
| ```python | ||
| import requests | ||
| import json | ||
| import os | ||
|
|
||
| response = requests.post( | ||
| "https://mino.ai/v1/automation/run-sse", | ||
| headers={ | ||
| "X-API-Key": os.environ["MINO_API_KEY"], | ||
| "Content-Type": "application/json", | ||
| }, | ||
| json={ | ||
| "url": "https://example.com", | ||
| "goal": "Extract product info as JSON: {\"name\": str, \"price\": str, \"in_stock\": bool}", | ||
| }, | ||
| stream=True, | ||
| ) | ||
|
|
||
| for line in response.iter_lines(): | ||
| if line: | ||
| line_str = line.decode("utf-8") | ||
| if line_str.startswith("data: "): | ||
| event = json.loads(line_str[6:]) | ||
| if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED": | ||
| print(json.dumps(event["resultJson"], indent=2)) | ||
| ``` | ||
|
|
||
| ## Multiple Items | ||
|
|
||
| Extract lists of data with explicit structure: | ||
|
|
||
| ```python | ||
| json={ | ||
| "url": "https://example.com/products", | ||
| "goal": "Extract all products as JSON array: [{\"name\": str, \"price\": str, \"url\": str}]", | ||
| } | ||
| ``` | ||
|
|
||
| ## Stealth Mode | ||
|
|
||
| For bot-protected sites: | ||
|
|
||
| ```python | ||
| json={ | ||
| "url": "https://protected-site.com", | ||
| "goal": "Extract product data as JSON: {\"name\": str, \"price\": str, \"description\": str}", | ||
| "browser_profile": "stealth", | ||
| } | ||
| ``` | ||
|
|
||
| ## Proxy | ||
|
|
||
| Route through specific country: | ||
|
|
||
| ```python | ||
| json={ | ||
| "url": "https://geo-restricted-site.com", | ||
| "goal": "Extract pricing data as JSON: {\"item\": str, \"price\": str, \"currency\": str}", | ||
| "browser_profile": "stealth", | ||
| "proxy_config": { | ||
| "enabled": True, | ||
| "country_code": "US", | ||
| }, | ||
| } | ||
| ``` | ||
|
|
||
| ## Output | ||
|
|
||
| Results are in `event["resultJson"]` when `event["type"] == "COMPLETE"` | ||
|
|
||
| ## Parallel Extraction | ||
|
|
||
| When extracting from multiple independent sources, make separate parallel API calls instead of combining into one prompt: | ||
|
|
||
| **Good** - Parallel calls: | ||
| ```python | ||
| # Compare pizza prices - run these simultaneously | ||
| call_1 = extract("https://pizzahut.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]") | ||
| call_2 = extract("https://dominos.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]") | ||
| ``` | ||
|
|
||
| **Bad** - Single combined call: | ||
| ```python | ||
| # Don't do this - less reliable and slower | ||
| extract("https://pizzahut.com", "Extract prices from Pizza Hut and also go to Dominos...") | ||
| ``` | ||
|
|
||
| Each independent extraction task should be its own API call. This is faster (parallel execution) and more reliable. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| TinyFish web extract/scrape helper | ||
|
|
||
| Usage: | ||
| extract.py <url> <goal> [--stealth] [--proxy US] | ||
|
|
||
| Best practice: Specify the JSON format you want in the goal for better results. | ||
|
|
||
| Examples: | ||
| extract.py "https://example.com" "Extract product as JSON: {\"name\": str, \"price\": str}" | ||
| extract.py "https://site.com" "Get all links as JSON: [{\"text\": str, \"url\": str}]" --stealth | ||
| extract.py "https://site.com" "Extract items as JSON: [{\"title\": str, \"price\": str}]" --stealth --proxy US | ||
| """ | ||
|
|
||
| import os | ||
| import sys | ||
| import json | ||
| import urllib.request | ||
| import argparse | ||
|
|
||
|
|
||
| def extract(url, goal, stealth=False, proxy_country=None): | ||
| """Extract/scrape data from a website using TinyFish""" | ||
| api_key = os.environ.get("MINO_API_KEY") | ||
| if not api_key: | ||
| print("Error: MINO_API_KEY environment variable not set", file=sys.stderr) | ||
| sys.exit(1) | ||
|
Comment on lines
+25
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: find . -name "extract.py" | grep -E "tinyfish-web-agent"Repository: tinyfish-io/skills Length of output: 108 🏁 Script executed: cat -n skills/tinyfish-web-agent/scripts/extract.py | head -100Repository: tinyfish-io/skills Length of output: 3254 Avoid
Also, the CLI entry point (line 79) calls 🛠️ Suggested fix def extract(url, goal, stealth=False, proxy_country=None):
"""Extract/scrape data from a website using TinyFish"""
api_key = os.environ.get("MINO_API_KEY")
if not api_key:
- print("Error: MINO_API_KEY environment variable not set", file=sys.stderr)
- sys.exit(1)
+ raise RuntimeError("MINO_API_KEY environment variable not set") if __name__ == "__main__":
parser = argparse.ArgumentParser(description="TinyFish web extract/scrape tool")
parser.add_argument("url", help="URL to extract/scrape from")
parser.add_argument("goal", help="What to extract (natural language)")
parser.add_argument("--stealth", action="store_true", help="Use stealth mode")
parser.add_argument("--proxy", help="Proxy country code (e.g., US, UK, DE)")
args = parser.parse_args()
- extract(args.url, args.goal, args.stealth, args.proxy)
+ try:
+ extract(args.url, args.goal, args.stealth, args.proxy)
+ except RuntimeError as exc:
+ print(f"Error: {exc}", file=sys.stderr)
+ sys.exit(1)🤖 Prompt for AI Agents |
||
|
|
||
| payload = { | ||
| "url": url, | ||
| "goal": goal, | ||
| } | ||
|
|
||
| if stealth: | ||
| payload["browser_profile"] = "stealth" | ||
|
|
||
| if proxy_country: | ||
| payload["proxy_config"] = { | ||
| "enabled": True, | ||
| "country_code": proxy_country, | ||
| } | ||
|
|
||
| req = urllib.request.Request( | ||
| "https://mino.ai/v1/automation/run-sse", | ||
| data=json.dumps(payload).encode(), | ||
| headers={ | ||
| "X-API-Key": api_key, | ||
| "Content-Type": "application/json", | ||
| } | ||
| ) | ||
|
|
||
| print(f"Extracting from {url}...", file=sys.stderr) | ||
|
|
||
| with urllib.request.urlopen(req) as response: | ||
| for line in response: | ||
| line_str = line.decode("utf-8").strip() | ||
| if line_str.startswith("data: "): | ||
| event = json.loads(line_str[6:]) | ||
|
|
||
| # Print status updates | ||
| if event.get("type") == "STATUS_UPDATE": | ||
| print(f"[{event.get('status')}] {event.get('message', '')}", file=sys.stderr) | ||
|
|
||
| # Print final result | ||
| if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED": | ||
| print(json.dumps(event["resultJson"], indent=2)) | ||
| return event["resultJson"] | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="TinyFish web extract/scrape tool") | ||
| parser.add_argument("url", help="URL to extract/scrape from") | ||
| parser.add_argument("goal", help="What to extract (natural language)") | ||
| parser.add_argument("--stealth", action="store_true", help="Use stealth mode") | ||
| parser.add_argument("--proxy", help="Proxy country code (e.g., US, UK, DE)") | ||
|
|
||
| args = parser.parse_args() | ||
| extract(args.url, args.goal, args.stealth, args.proxy) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: tinyfish-io/skills
Length of output: 2172
🏁 Script executed:
Repository: tinyfish-io/skills
Length of output: 1213
Add timeout + HTTP/URL error handling for the API call.
The blocking network call at line 55 lacks both a timeout parameter and exception handling. Without a timeout, this call can hang indefinitely, and unhandled HTTP/URL errors will crash the script with raw tracebacks instead of graceful error messages.
🛠️ Suggested fix (timeout + friendly errors)
🤖 Prompt for AI Agents