-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add superscraper blog #2828
base: master
Are you sure you want to change the base?
Conversation
--- | ||
slug: superscraper-with-crawlee | ||
title: 'Inside implementing Superscraper with Crawlee.' | ||
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, the crawlee blog has only been about crawlee and we haven't mentioned Apify much. This is quite a change in direction. Are you sure we want to publish this here? Just to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to have such pieces, too; after all, best way to build Actors should be Crawlee. There will be more in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in title its Superscraper
, in description its SuperScraper
also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?
return crawler; | ||
``` | ||
|
||
### Mapping standby HTTP requests to Crawlee requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just isolated functions and you need to browse the repo (or use a lot of imagination) to see how it all fits together. I'd appreciate a more top-down approach - maybe start with a code snippet with the big picture and then show how the individual functions are implemented?
Co-authored-by: Jan Buchar <jan.buchar@apify.com>
--- | ||
slug: superscraper-with-crawlee | ||
title: 'Inside implementing Superscraper with Crawlee.' | ||
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in title its Superscraper
, in description its SuperScraper
also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?
slug: superscraper-with-crawlee | ||
title: 'Inside implementing Superscraper with Crawlee.' | ||
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.' | ||
image: "./img/superscraper.webp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent quoting
image: "./img/superscraper.webp" | |
image: './img/superscraper.webp' |
authors: [SauravJ, RadoC] | ||
--- | ||
|
||
[SuperScraper](https://github.com/apify/super-scraper) is an open-source Actor that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a link on the Actor
word to https://docs.apify.com/platform/actors, so people are not confused by that. We don't use that in the crawlee docs much, only in the platform and deployment guides most likely.
|
||
### Handling multiple crawlers | ||
|
||
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. | |
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. |
I would also add a link to https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
const key = JSON.stringify(crawlerOptions); | ||
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions); | ||
|
||
await crawler.requestQueue!.addRequest(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is how they do it in the actor, but I wouldn't tell our users to do it, they should use crawler.addRequests()
instead of working with the RQ directly. Its rather internal property that we might hide completely in next major version.
await crawler.requestQueue!.addRequest(request); | |
await crawler.addRequests([request]); |
Alternatively, there is a public getter which ensures the RQ exists.
|
||
```js | ||
export const addResponse = (responseId: string, response: ServerResponse) =>{ | ||
responses.set(responseId, response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responses.set(responseId, response); | |
responses.set(responseId, response); |
The following function stores a response object in the key-value map: | ||
|
||
```js | ||
export const addResponse = (responseId: string, response: ServerResponse) =>{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export const addResponse = (responseId: string, response: ServerResponse) =>{ | |
export const addResponse = (responseId: string, response: ServerResponse) => { |
also I would prefer a regular function, not an arrow function expression
|
||
The following function stores a response object in the key-value map: | ||
|
||
```js |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is TS since you have type hints in there, not JS, I can see the highlighting wouldn't be happy about this. most (if not all) examples here should use ts
.
url: https://github.com/chudovskyr | ||
image_url: https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512 | ||
socials: | ||
github: chudovskyr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the line break at the end of the file
|
||
```js | ||
Actor.on('migrating', ()=>{ | ||
addTimeoutToAllResponses(60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addTimeoutToAllResponses(60); | |
addTimeoutToAllResponses(60); |
blog regarding superscraper approved by Lukas, Rado, and marketing