Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add superscraper blog #2828

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

docs: add superscraper blog #2828

wants to merge 4 commits into from

Conversation

souravjain540
Copy link
Collaborator

blog regarding superscraper approved by Lukas, Rado, and marketing

@souravjain540 souravjain540 requested a review from B4nan January 29, 2025 17:49
website/blog/authors.yml Outdated Show resolved Hide resolved
---
slug: superscraper-with-crawlee
title: 'Inside implementing Superscraper with Crawlee.'
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the crawlee blog has only been about crawlee and we haven't mentioned Apify much. This is quite a change in direction. Are you sure we want to publish this here? Just to be sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have such pieces, too; after all, best way to build Actors should be Crawlee. There will be more in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in title its Superscraper, in description its SuperScraper

also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?

return crawler;
```

### Mapping standby HTTP requests to Crawlee requests
Copy link
Contributor

@janbuchar janbuchar Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just isolated functions and you need to browse the repo (or use a lot of imagination) to see how it all fits together. I'd appreciate a more top-down approach - maybe start with a code snippet with the big picture and then show how the individual functions are implemented?

Co-authored-by: Jan Buchar <jan.buchar@apify.com>
---
slug: superscraper-with-crawlee
title: 'Inside implementing Superscraper with Crawlee.'
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in title its Superscraper, in description its SuperScraper

also, it feels weird to say "this blog", its a blogpost or article, "blog" is the platform where we serve it, where all the articles are, no?

slug: superscraper-with-crawlee
title: 'Inside implementing Superscraper with Crawlee.'
description: 'This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.'
image: "./img/superscraper.webp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent quoting

Suggested change
image: "./img/superscraper.webp"
image: './img/superscraper.webp'

authors: [SauravJ, RadoC]
---

[SuperScraper](https://github.com/apify/super-scraper) is an open-source Actor that combines features from various web scraping services, including [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://www.scraperapi.com/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link on the Actor word to https://docs.apify.com/platform/actors, so people are not confused by that. We don't use that in the crawlee docs much, only in the platform and deployment guides most likely.


### Handling multiple crawlers

SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.
SuperScraper processes user requests using multiple instances of Crawlee’s `PlaywrightCrawler`. Since each `PlaywrightCrawler` instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

I would also add a link to https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler

const key = JSON.stringify(crawlerOptions);
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

await crawler.requestQueue!.addRequest(request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is how they do it in the actor, but I wouldn't tell our users to do it, they should use crawler.addRequests() instead of working with the RQ directly. Its rather internal property that we might hide completely in next major version.

Suggested change
await crawler.requestQueue!.addRequest(request);
await crawler.addRequests([request]);

Alternatively, there is a public getter which ensures the RQ exists.


```js
export const addResponse = (responseId: string, response: ServerResponse) =>{
responses.set(responseId, response);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
responses.set(responseId, response);
responses.set(responseId, response);

The following function stores a response object in the key-value map:

```js
export const addResponse = (responseId: string, response: ServerResponse) =>{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const addResponse = (responseId: string, response: ServerResponse) =>{
export const addResponse = (responseId: string, response: ServerResponse) => {

also I would prefer a regular function, not an arrow function expression


The following function stores a response object in the key-value map:

```js
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is TS since you have type hints in there, not JS, I can see the highlighting wouldn't be happy about this. most (if not all) examples here should use ts.

url: https://github.com/chudovskyr
image_url: https://ca.slack-edge.com/T0KRMEKK6-U04MGU11VUK-7f59c4a9343b-512
socials:
github: chudovskyr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the line break at the end of the file


```js
Actor.on('migrating', ()=>{
addTimeoutToAllResponses(60);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
addTimeoutToAllResponses(60);
addTimeoutToAllResponses(60);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants