Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions skills/tinyfish-web-agent/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
name: tinyfish
description: Use TinyFish/Mino web agent to extract/scrape websites, extract data, and automate browser actions using natural language. Use when you need to extract/scrape data from websites, handle bot-protected sites, or automate web tasks.
---

# TinyFish Web Agent

Requires: `MINO_API_KEY` environment variable

## Best Practices

1. **Specify JSON format**: Always describe the exact structure you want returned
2. **Parallel calls**: When extracting from multiple independent sites, make separate parallel calls instead of combining into one prompt

## Basic Extract/Scrape

Extract data from a page. Specify the JSON structure you want:

```python
import requests
import json
import os

response = requests.post(
"https://mino.ai/v1/automation/run-sse",
headers={
"X-API-Key": os.environ["MINO_API_KEY"],
"Content-Type": "application/json",
},
json={
"url": "https://example.com",
"goal": "Extract product info as JSON: {\"name\": str, \"price\": str, \"in_stock\": bool}",
},
stream=True,
)

for line in response.iter_lines():
if line:
line_str = line.decode("utf-8")
if line_str.startswith("data: "):
event = json.loads(line_str[6:])
if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED":
print(json.dumps(event["resultJson"], indent=2))
```

## Multiple Items

Extract lists of data with explicit structure:

```python
json={
"url": "https://example.com/products",
"goal": "Extract all products as JSON array: [{\"name\": str, \"price\": str, \"url\": str}]",
}
```

## Stealth Mode

For bot-protected sites:

```python
json={
"url": "https://protected-site.com",
"goal": "Extract product data as JSON: {\"name\": str, \"price\": str, \"description\": str}",
"browser_profile": "stealth",
}
```

## Proxy

Route through specific country:

```python
json={
"url": "https://geo-restricted-site.com",
"goal": "Extract pricing data as JSON: {\"item\": str, \"price\": str, \"currency\": str}",
"browser_profile": "stealth",
"proxy_config": {
"enabled": True,
"country_code": "US",
},
}
```

## Output

Results are in `event["resultJson"]` when `event["type"] == "COMPLETE"`

## Parallel Extraction

When extracting from multiple independent sources, make separate parallel API calls instead of combining into one prompt:

**Good** - Parallel calls:
```python
# Compare pizza prices - run these simultaneously
call_1 = extract("https://pizzahut.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]")
call_2 = extract("https://dominos.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]")
```

**Bad** - Single combined call:
```python
# Don't do this - less reliable and slower
extract("https://pizzahut.com", "Extract prices from Pizza Hut and also go to Dominos...")
```

Each independent extraction task should be its own API call. This is faster (parallel execution) and more reliable.
79 changes: 79 additions & 0 deletions skills/tinyfish-web-agent/scripts/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env python3
"""
TinyFish web extract/scrape helper

Usage:
extract.py <url> <goal> [--stealth] [--proxy US]

Best practice: Specify the JSON format you want in the goal for better results.

Examples:
extract.py "https://example.com" "Extract product as JSON: {\"name\": str, \"price\": str}"
extract.py "https://site.com" "Get all links as JSON: [{\"text\": str, \"url\": str}]" --stealth
extract.py "https://site.com" "Extract items as JSON: [{\"title\": str, \"price\": str}]" --stealth --proxy US
"""

import os
import sys
import json
import urllib.request
import argparse

Comment on lines +16 to +21
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the extract.py file to see the current state
head -60 skills/tinyfish-web-agent/scripts/extract.py | cat -n

Repository: tinyfish-io/skills

Length of output: 2172


🏁 Script executed:

#!/bin/bash
# Read the rest of the extract.py file to see complete API call code
sed -n '55,75p' skills/tinyfish-web-agent/scripts/extract.py | cat -n

Repository: tinyfish-io/skills

Length of output: 1213


Add timeout + HTTP/URL error handling for the API call.

The blocking network call at line 55 lacks both a timeout parameter and exception handling. Without a timeout, this call can hang indefinitely, and unhandled HTTP/URL errors will crash the script with raw tracebacks instead of graceful error messages.

🛠️ Suggested fix (timeout + friendly errors)
-import urllib.request
+import urllib.request
+import urllib.error
@@
-    with urllib.request.urlopen(req) as response:
-        for line in response:
-            line_str = line.decode("utf-8").strip()
-            if line_str.startswith("data: "):
-                event = json.loads(line_str[6:])
+    try:
+        with urllib.request.urlopen(req, timeout=30) as response:
+            for line in response:
+                line_str = line.decode("utf-8").strip()
+                if line_str.startswith("data: "):
+                    event = json.loads(line_str[6:])
@@
-                if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED":
-                    print(json.dumps(event["resultJson"], indent=2))
-                    return event["resultJson"]
+                    if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED":
+                        print(json.dumps(event["resultJson"], indent=2))
+                        return event["resultJson"]
+    except urllib.error.HTTPError as e:
+        raise RuntimeError(f"MINO API HTTP {e.code}: {e.reason}") from e
+    except urllib.error.URLError as e:
+        raise RuntimeError(f"MINO API connection error: {e.reason}") from e
🤖 Prompt for AI Agents
In `@skills/tinyfish-web-agent/scripts/extract.py` around lines 16 - 21, The
network call in extract.py (the urllib.request.urlopen call) needs a timeout and
friendly error handling: add a configurable timeout (e.g., TIMEOUT constant or
argparse --timeout) and pass it to urllib.request.urlopen(..., timeout=timeout),
wrap the call in try/except catching urllib.error.HTTPError,
urllib.error.URLError, socket.timeout (and a fallback Exception), and on each
print a concise user-facing message and exit with a non-zero code instead of
letting raw tracebacks propagate; update the code around the existing
urllib.request.urlopen usage to implement this handling.


def extract(url, goal, stealth=False, proxy_country=None):
"""Extract/scrape data from a website using TinyFish"""
api_key = os.environ.get("MINO_API_KEY")
if not api_key:
print("Error: MINO_API_KEY environment variable not set", file=sys.stderr)
sys.exit(1)
Comment on lines +25 to +28
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "extract.py" | grep -E "tinyfish-web-agent"

Repository: tinyfish-io/skills

Length of output: 108


🏁 Script executed:

cat -n skills/tinyfish-web-agent/scripts/extract.py | head -100

Repository: tinyfish-io/skills

Length of output: 3254


Avoid sys.exit inside the reusable extract() function.

extract() is a library function that should raise exceptions rather than exit the process. Calling sys.exit() here (line 28) means any code importing and calling this function will terminate unexpectedly when the environment variable is missing. Raise an exception instead and handle it in the CLI entry point.

Also, the CLI entry point (line 79) calls extract() without catching potential exceptions. Add error handling there to catch the exception and exit gracefully.

🛠️ Suggested fix
 def extract(url, goal, stealth=False, proxy_country=None):
     """Extract/scrape data from a website using TinyFish"""
     api_key = os.environ.get("MINO_API_KEY")
     if not api_key:
-        print("Error: MINO_API_KEY environment variable not set", file=sys.stderr)
-        sys.exit(1)
+        raise RuntimeError("MINO_API_KEY environment variable not set")
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="TinyFish web extract/scrape tool")
     parser.add_argument("url", help="URL to extract/scrape from")
     parser.add_argument("goal", help="What to extract (natural language)")
     parser.add_argument("--stealth", action="store_true", help="Use stealth mode")
     parser.add_argument("--proxy", help="Proxy country code (e.g., US, UK, DE)")
 
     args = parser.parse_args()
-    extract(args.url, args.goal, args.stealth, args.proxy)
+    try:
+        extract(args.url, args.goal, args.stealth, args.proxy)
+    except RuntimeError as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        sys.exit(1)
🤖 Prompt for AI Agents
In `@skills/tinyfish-web-agent/scripts/extract.py` around lines 25 - 28, Replace
the in-function process exit with an exception: inside extract() (where it
checks MINO_API_KEY) raise a descriptive exception (e.g., EnvironmentError or
ValueError) instead of calling sys.exit; then update the CLI entry point that
invokes extract() (the top-level main/if __name__ == '__main__' caller) to wrap
the extract() call in try/except, print the error to stderr and call sys.exit(1)
on failure so library callers get an exception while the CLI still exits
gracefully.


payload = {
"url": url,
"goal": goal,
}

if stealth:
payload["browser_profile"] = "stealth"

if proxy_country:
payload["proxy_config"] = {
"enabled": True,
"country_code": proxy_country,
}

req = urllib.request.Request(
"https://mino.ai/v1/automation/run-sse",
data=json.dumps(payload).encode(),
headers={
"X-API-Key": api_key,
"Content-Type": "application/json",
}
)

print(f"Extracting from {url}...", file=sys.stderr)

with urllib.request.urlopen(req) as response:
for line in response:
line_str = line.decode("utf-8").strip()
if line_str.startswith("data: "):
event = json.loads(line_str[6:])

# Print status updates
if event.get("type") == "STATUS_UPDATE":
print(f"[{event.get('status')}] {event.get('message', '')}", file=sys.stderr)

# Print final result
if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED":
print(json.dumps(event["resultJson"], indent=2))
return event["resultJson"]


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="TinyFish web extract/scrape tool")
parser.add_argument("url", help="URL to extract/scrape from")
parser.add_argument("goal", help="What to extract (natural language)")
parser.add_argument("--stealth", action="store_true", help="Use stealth mode")
parser.add_argument("--proxy", help="Proxy country code (e.g., US, UK, DE)")

args = parser.parse_args()
extract(args.url, args.goal, args.stealth, args.proxy)