Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 90 additions & 49 deletions skills/tinyfish-web-agent/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,39 @@ description: Use TinyFish/Mino web agent to extract/scrape websites, extract dat

Requires: `MINO_API_KEY` environment variable

## Pre-flight Check (REQUIRED)

Before making any API call, **always** run this first to verify the key is available:

```bash
[ -n "$MINO_API_KEY" ] && echo "MINO_API_KEY is set" || echo "MINO_API_KEY is NOT set"
```

If the key is **not set**, you **MUST stop and ask the user** to add their API key. Do **NOT** fall back to other tools or approaches — the task requires TinyFish/Mino.

Tell the user:

> You need a Mino API key. Get one at: <https://mino.ai/api-keys>
>
> Then set it so the agent can use it:
>
> **Option 1 — Environment variable (works everywhere):**
> ```bash
> export MINO_API_KEY="your-key-here"
> ```
>
> **Option 2 — Claude Code settings (Claude Code only):**
> Add to `~/.claude/settings.local.json`:
> ```json
> {
> "env": {
> "MINO_API_KEY": "your-key-here"
> }
> }
> ```

Do NOT proceed until the key is confirmed available.

## Best Practices

1. **Specify JSON format**: Always describe the exact structure you want returned
Expand All @@ -16,91 +49,99 @@ Requires: `MINO_API_KEY` environment variable

Extract data from a page. Specify the JSON structure you want:

```python
import requests
import json
import os

response = requests.post(
"https://mino.ai/v1/automation/run-sse",
headers={
"X-API-Key": os.environ["MINO_API_KEY"],
"Content-Type": "application/json",
},
json={
"url": "https://example.com",
"goal": "Extract product info as JSON: {\"name\": str, \"price\": str, \"in_stock\": bool}",
},
stream=True,
)

for line in response.iter_lines():
if line:
line_str = line.decode("utf-8")
if line_str.startswith("data: "):
event = json.loads(line_str[6:])
if event.get("type") == "COMPLETE" and event.get("status") == "COMPLETED":
print(json.dumps(event["resultJson"], indent=2))
```bash
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"goal": "Extract product info as JSON: {\"name\": str, \"price\": str, \"in_stock\": bool}"
}'
```

## Multiple Items

Extract lists of data with explicit structure:

```python
json={
```bash
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"goal": "Extract all products as JSON array: [{\"name\": str, \"price\": str, \"url\": str}]",
}
"goal": "Extract all products as JSON array: [{\"name\": str, \"price\": str, \"url\": str}]"
}'
```

## Stealth Mode

For bot-protected sites:
For bot-protected sites, add `"browser_profile": "stealth"` to the request body:

```python
json={
```bash
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://protected-site.com",
"goal": "Extract product data as JSON: {\"name\": str, \"price\": str, \"description\": str}",
"browser_profile": "stealth",
}
"browser_profile": "stealth"
}'
```

## Proxy

Route through specific country:
Route through a specific country by adding `"proxy_config"` to the body:

```python
json={
```bash
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://geo-restricted-site.com",
"goal": "Extract pricing data as JSON: {\"item\": str, \"price\": str, \"currency\": str}",
"browser_profile": "stealth",
"proxy_config": {
"enabled": True,
"country_code": "US",
},
}
"proxy_config": {"enabled": true, "country_code": "US"}
}'
```

## Output

Results are in `event["resultJson"]` when `event["type"] == "COMPLETE"`
The SSE stream returns `data: {...}` lines. The final result is the event where `type == "COMPLETE"` and `status == "COMPLETED"` — the extracted data is in the `resultJson` field. Claude reads the raw SSE output directly; no script-side parsing is needed.

## Parallel Extraction

When extracting from multiple independent sources, make separate parallel API calls instead of combining into one prompt:
When extracting from multiple independent sources, make separate parallel curl calls instead of combining into one prompt:

**Good** - Parallel calls:
```python
```bash
# Compare pizza prices - run these simultaneously
call_1 = extract("https://pizzahut.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]")
call_2 = extract("https://dominos.com", "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]")
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://pizzahut.com",
"goal": "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]"
}'

curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://dominos.com",
"goal": "Extract pizza prices as JSON: [{\"name\": str, \"price\": str}]"
}'
```

**Bad** - Single combined call:
```python
```bash
# Don't do this - less reliable and slower
extract("https://pizzahut.com", "Extract prices from Pizza Hut and also go to Dominos...")
curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: $MINO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://pizzahut.com",
"goal": "Extract prices from Pizza Hut and also go to Dominos..."
}'
```

Each independent extraction task should be its own API call. This is faster (parallel execution) and more reliable.
79 changes: 0 additions & 79 deletions skills/tinyfish-web-agent/scripts/extract.py

This file was deleted.

70 changes: 70 additions & 0 deletions skills/tinyfish-web-agent/scripts/extract.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env bash
#
# TinyFish web extract/scrape helper
#
# Usage:
# extract.sh <url> <goal> [--stealth] [--proxy COUNTRY]
#
# Examples:
# extract.sh "https://example.com" 'Extract product as JSON: {"name": str, "price": str}'
# extract.sh "https://site.com" 'Get all links as JSON: [{"text": str, "url": str}]' --stealth
# extract.sh "https://site.com" 'Extract items' --stealth --proxy US

set -euo pipefail

if [ $# -lt 2 ]; then
echo "Usage: extract.sh <url> <goal> [--stealth] [--proxy COUNTRY]" >&2
exit 1
fi

if [ -z "${MINO_API_KEY:-}" ]; then
echo "Error: MINO_API_KEY environment variable not set" >&2
exit 1
fi

URL="$1"
GOAL="$2"
shift 2

STEALTH=false
PROXY_COUNTRY=""

while [ $# -gt 0 ]; do
case "$1" in
--stealth)
STEALTH=true
shift
;;
--proxy)
PROXY_COUNTRY="$2"
shift 2
;;
Comment on lines +38 to +41
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate --proxy has a country argument.
With set -euo pipefail, --proxy without a value can trigger an unbound parameter/shift error instead of a clear message.

🛠️ Suggested fix
     --proxy)
+      if [ $# -lt 2 ] || [[ "$2" == --* ]]; then
+        echo "Error: --proxy requires COUNTRY" >&2
+        exit 1
+      fi
       PROXY_COUNTRY="$2"
       shift 2
       ;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--proxy)
PROXY_COUNTRY="$2"
shift 2
;;
--proxy)
if [ $# -lt 2 ] || [[ "$2" == --* ]]; then
echo "Error: --proxy requires COUNTRY" >&2
exit 1
fi
PROXY_COUNTRY="$2"
shift 2
;;
🤖 Prompt for AI Agents
In `@skills/tinyfish-web-agent/scripts/extract.sh` around lines 38 - 41, The
--proxy case in the argument-parsing branch (where PROXY_COUNTRY is set) doesn't
validate that a following argument exists and is not another flag, which under
set -euo pipefail can produce an unbound parameter/shift error; update the
--proxy handling (the case that sets PROXY_COUNTRY) to first check that "$2" is
present and does not start with '-' and if it is missing or looks like a flag
print a clear error message and exit non-zero, otherwise assign
PROXY_COUNTRY="$2" and shift 2 as before.

*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done

# Build JSON payload — escape URL and goal for safe embedding
JSON_URL=$(printf '%s' "$URL" | sed 's/\\/\\\\/g; s/"/\\"/g')
JSON_GOAL=$(printf '%s' "$GOAL" | sed 's/\\/\\\\/g; s/"/\\"/g')

PAYLOAD="{\"url\":\"${JSON_URL}\",\"goal\":\"${JSON_GOAL}\""

if [ "$STEALTH" = true ]; then
PAYLOAD="${PAYLOAD},\"browser_profile\":\"stealth\""
fi

if [ -n "$PROXY_COUNTRY" ]; then
PAYLOAD="${PAYLOAD},\"proxy_config\":{\"enabled\":true,\"country_code\":\"${PROXY_COUNTRY}\"}"
fi
Comment on lines +49 to +61
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden JSON escaping to handle newlines/tabs/control chars.
Current escaping only handles \ and ". Multi-line goals (common) will produce invalid JSON.

🛠️ Suggested fix
+# Escape input for JSON string context
+json_escape() {
+  local s="$1"
+  s=${s//\\/\\\\}
+  s=${s//\"/\\\"}
+  s=${s//$'\n'/\\n}
+  s=${s//$'\r'/\\r}
+  s=${s//$'\t'/\\t}
+  s=${s//$'\f'/\\f}
+  s=${s//$'\b'/\\b}
+  printf '%s' "$s"
+}
+
 # Build JSON payload — escape URL and goal for safe embedding
-JSON_URL=$(printf '%s' "$URL" | sed 's/\\/\\\\/g; s/"/\\"/g')
-JSON_GOAL=$(printf '%s' "$GOAL" | sed 's/\\/\\\\/g; s/"/\\"/g')
+JSON_URL=$(json_escape "$URL")
+JSON_GOAL=$(json_escape "$GOAL")
@@
 if [ -n "$PROXY_COUNTRY" ]; then
-  PAYLOAD="${PAYLOAD},\"proxy_config\":{\"enabled\":true,\"country_code\":\"${PROXY_COUNTRY}\"}"
+  JSON_PROXY=$(json_escape "$PROXY_COUNTRY")
+  PAYLOAD="${PAYLOAD},\"proxy_config\":{\"enabled\":true,\"country_code\":\"${JSON_PROXY}\"}"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Build JSON payload — escape URL and goal for safe embedding
JSON_URL=$(printf '%s' "$URL" | sed 's/\\/\\\\/g; s/"/\\"/g')
JSON_GOAL=$(printf '%s' "$GOAL" | sed 's/\\/\\\\/g; s/"/\\"/g')
PAYLOAD="{\"url\":\"${JSON_URL}\",\"goal\":\"${JSON_GOAL}\""
if [ "$STEALTH" = true ]; then
PAYLOAD="${PAYLOAD},\"browser_profile\":\"stealth\""
fi
if [ -n "$PROXY_COUNTRY" ]; then
PAYLOAD="${PAYLOAD},\"proxy_config\":{\"enabled\":true,\"country_code\":\"${PROXY_COUNTRY}\"}"
fi
# Escape input for JSON string context
json_escape() {
local s="$1"
s=${s//\\/\\\\}
s=${s//\"/\\\"}
s=${s//$'\n'/\\n}
s=${s//$'\r'/\\r}
s=${s//$'\t'/\\t}
s=${s//$'\f'/\\f}
s=${s//$'\b'/\\b}
printf '%s' "$s"
}
# Build JSON payload — escape URL and goal for safe embedding
JSON_URL=$(json_escape "$URL")
JSON_GOAL=$(json_escape "$GOAL")
PAYLOAD="{\"url\":\"${JSON_URL}\",\"goal\":\"${JSON_GOAL}\""
if [ "$STEALTH" = true ]; then
PAYLOAD="${PAYLOAD},\"browser_profile\":\"stealth\""
fi
if [ -n "$PROXY_COUNTRY" ]; then
JSON_PROXY=$(json_escape "$PROXY_COUNTRY")
PAYLOAD="${PAYLOAD},\"proxy_config\":{\"enabled\":true,\"country_code\":\"${JSON_PROXY}\"}"
fi
🤖 Prompt for AI Agents
In `@skills/tinyfish-web-agent/scripts/extract.sh` around lines 49 - 61, The
current JSON_URL/JSON_GOAL escaping only handles backslashes and quotes and
breaks on newlines/tabs/control chars; replace the sed-based escaping for
JSON_URL and JSON_GOAL with a proper JSON string encoding step (e.g., use jq -R
-s `@json` or a small Python json.dumps invocation) so newlines/tabs/control
characters are escaped correctly, then use those encoded values when
constructing PAYLOAD (adjust how you combine them so you don’t double-quote the
JSON-encoded output). Update the assignments to JSON_URL and JSON_GOAL and the
PAYLOAD construction to use the new, safely encoded strings.


PAYLOAD="${PAYLOAD}}"

echo "Extracting from ${URL}..." >&2

exec curl -N -s -X POST "https://mino.ai/v1/automation/run-sse" \
-H "X-API-Key: ${MINO_API_KEY}" \
-H "Content-Type: application/json" \
-d "$PAYLOAD"