Self-hosting Firecrawl

Contributor?

Welcome to Firecrawl 🔥! Here are some instructions on how to get the project locally so you can run it on your own and contribute.

If you're contributing, note that the process is similar to other open-source repos, i.e., fork Firecrawl, make changes, run tests, PR.

If you have any questions or would like help getting on board, join our Discord community here for more information or submit an issue on Github here!

Why?

Self-hosting Firecrawl is particularly beneficial for organizations with stringent security policies that require data to remain within controlled environments. Here are some key reasons to consider self-hosting:

Enhanced Security and Compliance: By self-hosting, you ensure that all data handling and processing complies with internal and external regulations, keeping sensitive information within your secure infrastructure. Note that Firecrawl is a Mendable product and relies on SOC2 Type2 certification, which means that the platform adheres to high industry standards for managing data security.
Customizable Services: Self-hosting allows you to tailor the services, such as the Playwright service, to meet specific needs or handle particular use cases that may not be supported by the standard cloud offering.
Learning and Community Contribution: By setting up and maintaining your own instance, you gain a deeper understanding of how Firecrawl works, which can also lead to more meaningful contributions to the project.

Considerations

However, there are some limitations and additional responsibilities to be aware of:

Limited Access to Fire-engine: Currently, self-hosted instances of Firecrawl do not have access to Fire-engine, which includes advanced features for handling IP blocks, robot detection mechanisms, and more. This means that while you can manage basic scraping tasks, more complex scenarios might require additional configuration or might not be supported.
Manual Configuration Required: If you need to use scraping methods beyond the basic fetch and Playwright options, you will need to manually configure these in the .env file. This requires a deeper understanding of the technologies and might involve more setup time.

Self-hosting Firecrawl is ideal for those who need full control over their scraping and data processing environments but comes with the trade-off of additional maintenance and configuration efforts.

Steps

First, start by installing the dependencies

Docker instructions

Set environment variables

Create an .env in the root directory using the template below.

.env:

# ===== Required ENVS ======
PORT=3002
HOST=0.0.0.0

# To turn on DB authentication, you need to set up Supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

## === AI features (JSON format on scrape, /extract API) ===
# Provide your OpenAI API key here to enable AI features
# OPENAI_API_KEY=

# Experimental: Use Ollama
# OLLAMA_BASE_URL=http://localhost:11434/api
# MODEL_NAME=deepseek-r1:7b
# MODEL_EMBEDDING_NAME=nomic-embed-text

# Experimental: Use any OpenAI-compatible API
# OPENAI_BASE_URL=https://example.com/v1
# OPENAI_API_KEY=

## === Proxy ===
# PROXY_SERVER can be a full URL (e.g. http://0.1.2.3:1234) or just an IP and port combo (e.g. 0.1.2.3:1234)
# Do not uncomment PROXY_USERNAME and PROXY_PASSWORD if your proxy is unauthenticated
# PROXY_SERVER=
# PROXY_USERNAME=
# PROXY_PASSWORD=

## === /search API ===
# By default, the /search API will use Google search.

# You can specify a SearXNG server with the JSON format enabled, if you'd like to use that instead of direct Google.
# You can also customize the engines and categories parameters, but the defaults should also work just fine.
# SEARXNG_ENDPOINT=http://your.searxng.server
# SEARXNG_ENGINES=
# SEARXNG_CATEGORIES=

## === Other ===

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
# SUPABASE_ANON_TOKEN=
# SUPABASE_URL=
# SUPABASE_SERVICE_TOKEN=

# Use if you've set up authentication and want to test with a real API key
# TEST_API_KEY=

# You can add this to enable ScrapingBee as a fallback scraping engine.
# SCRAPING_BEE_API_KEY=

# This key lets you access the queue admin panel. Change this if your deployment is publicly accessible.
BULL_AUTH_KEY=CHANGEME

# This is now autoconfigured by the docker-compose.yaml. You shouldn't need to set it.
# PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape
# REDIS_URL=redis://redis:6379
# REDIS_RATE_LIMIT_URL=redis://redis:6379

# Set if you have a llamaparse key you'd like to use to parse pdfs
# LLAMAPARSE_API_KEY=

# Set if you'd like to send server health status messages to Slack
# SLACK_WEBHOOK_URL=

# Set if you'd like to send posthog events like job logs
# POSTHOG_API_KEY=
# POSTHOG_HOST=

Build and run the Docker containers:
```
docker compose build
docker compose up
```

This will run a local instance of Firecrawl which can be accessed at http://localhost:3002.

You should be able to see the Bull Queue Manager UI on http://localhost:3002/admin/CHANGEME/queues.

(Optional) Test the API

If you’d like to test the crawl endpoint, you can run this:

curl -X POST http://localhost:3002/v1/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://firecrawl.dev"
    }'

Troubleshooting

This section provides solutions to common issues you might encounter while setting up or running your self-hosted instance of Firecrawl.

API Keys for SDK Usage

Note: When using Firecrawl SDKs with a self-hosted instance, API keys are optional. API keys are only required when connecting to the cloud service (api.firecrawl.dev).

Supabase client is not configured

Symptom:

[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Attempted to access Supabase client when it's not configured.
[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Error inserting scrape event: Error: Supabase client is not configured.

Explanation: This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.

You're bypassing authentication

Symptom:

[YYYY-MM-DDTHH:MM:SS.SSSz]WARN - You're bypassing authentication

Explanation: This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.

Docker containers fail to start

Symptom: Docker containers exit unexpectedly or fail to start.

Solution: Check the Docker logs for any error messages using the command:

docker logs [container_name]

Ensure all required environment variables are set correctly in the .env file.
Verify that all Docker services defined in docker-compose.yml are correctly configured and the necessary images are available.

Connection issues with Redis

Symptom: Errors related to connecting to Redis, such as timeouts or "Connection refused".

Solution:

Ensure that the Redis service is up and running in your Docker environment.
Verify that the REDIS_URL and REDIS_RATE_LIMIT_URL in your .env file point to the correct Redis instance, ensure that it points to the same URL in the docker-compose.yaml file (redis://redis:6379)
Check network settings and firewall rules that may block the connection to the Redis port.

API endpoint does not respond

Symptom: API requests to the Firecrawl instance timeout or return no response.

Solution:

Ensure that the Firecrawl service is running by checking the Docker container status.
Verify that the PORT and HOST settings in your .env file are correct and that no other service is using the same port.
Check the network configuration to ensure that the host is accessible from the client making the API request.

By addressing these common issues, you can ensure a smoother setup and operation of your self-hosted Firecrawl instance.

Install Firecrawl on a Kubernetes Cluster (Simple Version)

Read the examples/kubernetes/cluster-install/README.md for instructions on how to install Firecrawl on a Kubernetes Cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SELF_HOST.md

SELF_HOST.md

Self-hosting Firecrawl

Contributor?

Why?

Considerations

Steps

Troubleshooting

API Keys for SDK Usage

Supabase client is not configured

You're bypassing authentication

Docker containers fail to start

Connection issues with Redis

API endpoint does not respond

Install Firecrawl on a Kubernetes Cluster (Simple Version)

Files

SELF_HOST.md

Latest commit

History

SELF_HOST.md

File metadata and controls

Self-hosting Firecrawl

Contributor?

Why?

Considerations

Steps

Troubleshooting

API Keys for SDK Usage

Supabase client is not configured

You're bypassing authentication

Docker containers fail to start

Connection issues with Redis

API endpoint does not respond

Install Firecrawl on a Kubernetes Cluster (Simple Version)