Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add superscraper blog #2828

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file not shown.
189 changes: 189 additions & 0 deletions website/blog/2025/01-29-superscraper/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
---
slug: superscraper-with-crawlee
title: 'Inside implementing Superscraper with Crawlee.'
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the crawlee blog has only been about crawlee and we haven't mentioned Apify much. This is quite a change in direction. Are you sure we want to publish this here? Just to be sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have such pieces, too; after all, best way to build Actors should be Crawlee. There will be more in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in title its Superscraper, in description its SuperScraper

also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?

image: "./img/superscraper.webp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent quoting

Suggested change
image: "./img/superscraper.webp"
image: './img/superscraper.webp'

authors: [SauravJ, RadoC]
---

[SuperScraper](https://github.com/apify/super-scraper) is an open-source Actor that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link on the Actor word to https://docs.apify.com/platform/actors, so people are not confused by that. We don't use that in the crawlee docs much, only in the platform and deployment guides most likely.


A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.

This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.

### What is SuperScraper?

SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.

### How to enable standby mode

To activate standby mode, you must configure the settings so it listens for incoming requests.

![Activating Actor standby mode](./img/actor-standby.webp)

### Server setup

The project uses Node.js `http` module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.

### Handling multiple crawlers

SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

I would also add a link to https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler


For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.

```js
const crawlers = new Map<string, PlaywrightCrawler>();
```

Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.

```js
const key = JSON.stringify(crawlerOptions);
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

await crawler.requestQueue!.addRequest(request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is how they do it in the actor, but I wouldn't tell our users to do it, they should use crawler.addRequests() instead of working with the RQ directly. Its rather internal property that we might hide completely in next major version.

Suggested change
await crawler.requestQueue!.addRequest(request);
await crawler.addRequests([request]);

Alternatively, there is a public getter which ensures the RQ exists.

```

The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the `MemoryStorage` client. This approach is used for two key reasons:

1. **Performance**: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.
2. **Isolation**: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.

```js
export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => {
const client = new MemoryStorage();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this actually isolates the queues, since they will use the same storage folder for the JSON files. I believe this needs unique localDataDirectory param.

alternatively, disable the persistence via persistStorage: false

cc @metalwarrior665, this probably deserves a fix in the superscraper code if its like that in there too

const queue = await RequestQueue.open(undefined, { storageClient: client });

const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions);

const crawler = new PlaywrightCrawler({
keepAlive: true,
proxyConfiguration: proxyConfig,
maxRequestRetries: 4,
requestQueue: queue,
});
};
```

At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.

```js
crawler.run().then(
() => log.warning(`Crawler ended`, crawlerOptions),
() => { }
);

crawlers.set(JSON.stringify(crawlerOptions), crawler);

log.info('Crawler ready 🚀', crawlerOptions);

return crawler;
```

### Mapping standby HTTP requests to Crawlee requests
Copy link
Contributor

@janbuchar janbuchar Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just isolated functions and you need to browse the repo (or use a lot of imagination) to see how it all fits together. I'd appreciate a more top-down approach - maybe start with a code snippet with the big picture and then show how the individual functions are implemented?


When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as `request.uniqueKey`.

```js
const responses = new Map<string, ServerResponse>();
```

**Saving response objects**

The following function stores a response object in the key-value map:

```js
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is TS since you have type hints in there, not JS, I can see the highlighting wouldn't be happy about this. most (if not all) examples here should use ts.

export const addResponse = (responseId: string, response: ServerResponse) =>{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const addResponse = (responseId: string, response: ServerResponse) =>{
export const addResponse = (responseId: string, response: ServerResponse) => {

also I would prefer a regular function, not an arrow function expression

responses.set(responseId, response);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
responses.set(responseId, response);
responses.set(responseId, response);

};
```

**Updating crawler logic to store responses**

Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:

```js
const key = JSON.stringify(crawlerOptions);
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

addResponse(request.uniqueKey!, res);

await crawler.requestQueue!.addRequest(request);
```

**Sending scraped data back**

Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:

```js
export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => {
const res = responses.get(responseId);
if (!res) {
log.info(`Response for request ${responseId} not found`);
return;
}

res.writeHead(200, { 'Content-Type': contentType });
res.end(result);
responses.delete(responseId);
};
```

**Error handling**

There is similar logic to send a response back if an error occurs during scraping:

```js
export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => {
const res = responses.get(responseId);
if (!res) {
log.info(`Response for request ${responseId} not found`);
return;
}

res.writeHead(statusCode, { 'Content-Type': 'application/json' });
res.end(result);
responses.delete(responseId);
};
```

**Adding timeouts during migrations**

During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.

```js
export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => {
const migrationErrorMessage = {
errorMessage: 'Actor had to migrate to another server. Please, retry your request.',
};

const responseKeys = Object.keys(responses);

for (const key of responseKeys) {
setTimeout(() => {
sendErrorResponseById(key, JSON.stringify(migrationErrorMessage));
}, timeoutInSeconds * 1000);
}
};
```

### Managing migrations

SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.

```js
Actor.on('migrating', ()=>{
addTimeoutToAllResponses(60);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
addTimeoutToAllResponses(60);
addTimeoutToAllResponses(60);

});
```

Users receive clear feedback during server migrations, maintaining stable operation.

### Build your own

This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through `PlaywrightCrawler` instances while managing request-response cycles efficiently to support diverse scraping needs.

Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.

To get started, explore the project on [GitHub](https://github.com/apify/super-scraper) or learn more about [Crawlee](https://crawlee.dev/) to build your own scalable web scraping tools.
8 changes: 8 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,11 @@ VladaD:
image_url: https://avatars.githubusercontent.com/u/25082181?v=4
socials:
github: vdusek

RadoC:
name: Radoslav Chudovský
title: Web Automation Engineer
url: https://github.com/chudovskyr
image_url: https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512
socials:
github: chudovskyr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the line break at the end of the file

Loading