docs: add superscraper blog #2828

souravjain540 · 2025-01-29T17:49:20Z

blog regarding superscraper approved by Lukas, Rado, and marketing

website/blog/authors.yml

janbuchar · 2025-01-30T10:20:18Z

website/blog/2025/01-29-superscraper/index.md

+---
+slug: superscraper-with-crawlee
+title: 'Inside implementing Superscraper with Crawlee.'
+description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'


So far, the crawlee blog has only been about crawlee and we haven't mentioned Apify much. This is quite a change in direction. Are you sure we want to publish this here? Just to be sure.

I think it's good to have such pieces, too; after all, best way to build Actors should be Crawlee. There will be more in future.

in title its Superscraper, in description its SuperScraper

also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?

janbuchar · 2025-01-30T11:10:55Z

website/blog/2025/01-29-superscraper/index.md

+return crawler;
+```
+
+### Mapping standby HTTP requests to Crawlee requests


These are just isolated functions and you need to browse the repo (or use a lot of imagination) to see how it all fits together. I'd appreciate a more top-down approach - maybe start with a code snippet with the big picture and then show how the individual functions are implemented?

Co-authored-by: Jan Buchar <jan.buchar@apify.com>

B4nan · 2025-01-31T09:45:28Z

website/blog/2025/01-29-superscraper/index.md

+---
+slug: superscraper-with-crawlee
+title: 'Inside implementing Superscraper with Crawlee.'
+description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'


in title its Superscraper, in description its SuperScraper

also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?

B4nan · 2025-01-31T09:47:12Z

website/blog/2025/01-29-superscraper/index.md

+slug: superscraper-with-crawlee
+title: 'Inside implementing Superscraper with Crawlee.'
+description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'
+image: "./img/superscraper.webp"


nit: inconsistent quoting

Suggested change

image: "./img/superscraper.webp"

image: './img/superscraper.webp'

B4nan · 2025-01-31T09:49:58Z

website/blog/2025/01-29-superscraper/index.md

+authors: [SauravJ, RadoC]
+---
+
+[SuperScraper](https://github.com/apify/super-scraper) is an open-source Actor that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/). 


Maybe add a link on the Actor word to https://docs.apify.com/platform/actors, so people are not confused by that. We don't use that in the crawlee docs much, only in the platform and deployment guides most likely.

B4nan · 2025-01-31T09:52:37Z

website/blog/2025/01-29-superscraper/index.md

+
+### Handling multiple crawlers
+
+SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. 


Suggested change

SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

I would also add a link to https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler

B4nan · 2025-01-31T09:55:14Z

website/blog/2025/01-29-superscraper/index.md

+const key = JSON.stringify(crawlerOptions); 
+const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);
+
+await crawler.requestQueue!.addRequest(request);


I guess this is how they do it in the actor, but I wouldn't tell our users to do it, they should use crawler.addRequests() instead of working with the RQ directly. Its rather internal property that we might hide completely in next major version.

Suggested change

await crawler.requestQueue!.addRequest(request);

await crawler.addRequests([request]);

Alternatively, there is a public getter which ensures the RQ exists.

B4nan · 2025-01-31T11:09:13Z

website/blog/2025/01-29-superscraper/index.md

+
+```js
+export const addResponse = (responseId: string, response: ServerResponse) =>{
+	responses.set(responseId, response);


Suggested change

responses.set(responseId, response);

responses.set(responseId, response);

B4nan · 2025-01-31T11:09:19Z

website/blog/2025/01-29-superscraper/index.md

+The following function stores a response object in the key-value map:
+
+```js
+export const addResponse = (responseId: string, response: ServerResponse) =>{


Suggested change

export const addResponse = (responseId: string, response: ServerResponse) =>{

export const addResponse = (responseId: string, response: ServerResponse) => {

also I would prefer a regular function, not an arrow function expression

B4nan · 2025-01-31T11:09:41Z

website/blog/2025/01-29-superscraper/index.md

+
+The following function stores a response object in the key-value map:
+
+```js


the code is TS since you have type hints in there, not JS, I can see the highlighting wouldn't be happy about this. most (if not all) examples here should use ts.

B4nan · 2025-01-31T11:12:08Z

website/blog/authors.yml

+    url: https://github.com/chudovskyr
+    image_url: https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512
+    socials:
+        github: chudovskyr


keep the line break at the end of the file

B4nan · 2025-01-31T11:12:23Z

website/blog/2025/01-29-superscraper/index.md

+
+```js
+Actor.on('migrating', ()=>{
+	addTimeoutToAllResponses(60);


Suggested change

addTimeoutToAllResponses(60);

addTimeoutToAllResponses(60);

add blog

701d0d9

souravjain540 requested a review from B4nan January 29, 2025 17:49

souravjain540 added 2 commits January 29, 2025 23:57

authors map

9613715

typo

0eee5a2

janbuchar reviewed Jan 30, 2025

View reviewed changes

Update website/blog/authors.yml

9c79122

Co-authored-by: Jan Buchar <jan.buchar@apify.com>

B4nan requested changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add superscraper blog #2828

docs: add superscraper blog #2828

souravjain540 commented Jan 29, 2025

janbuchar Jan 30, 2025

souravjain540 Jan 31, 2025

B4nan Jan 31, 2025

janbuchar Jan 30, 2025 •

edited

Loading

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

B4nan Jan 31, 2025

	image: "./img/superscraper.webp"
	image: './img/superscraper.webp'


		### Handling multiple crawlers

		SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

	await crawler.requestQueue!.addRequest(request);
	await crawler.addRequests([request]);

	responses.set(responseId, response);
	responses.set(responseId, response);

	export const addResponse = (responseId: string, response: ServerResponse) =>{
	export const addResponse = (responseId: string, response: ServerResponse) => {


		The following function stores a response object in the key-value map:

		```js

docs: add superscraper blog #2828

Are you sure you want to change the base?

docs: add superscraper blog #2828

Conversation

souravjain540 commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janbuchar Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janbuchar Jan 30, 2025 •

edited

Loading