Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Crawlee + Cheerio templates code refactoring + Readme update #172

Merged
merged 18 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions templates/js-crawlee-cheerio/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,15 @@
"editor": "requestListSources",
"prefill": [
{
"url": "https://apify.com"
"url": "https://crawlee.dev"
}
]
}
},
"maxRequestsPerCrawl": {
"title": "Max Requests per Crawl",
"type": "integer",
"description": "Maximum number of requests that can be made by this crawler.",
"default": 100
},
}
}
22 changes: 14 additions & 8 deletions templates/js-crawlee-cheerio/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# CheerioCrawler Actor template
# JavaScript Crawlee & CheerioCrawler Actor template

This template is a production ready boilerplate for developing with `CheerioCrawler`. Use this to bootstrap your projects using the most up-to-date code.
A template example built with [Crawlee](https://crawlee.dev) to scrape data from a website using [Cheerio](https://cheerio.js.org/) wrapped into [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler).

> We decided to split Apify SDK into two libraries, [Crawlee](https://crawlee.dev) and [Apify SDK v3](https://docs.apify.com/sdk/js). Crawlee will retain all the crawling and scraping-related tools and will always strive to be the best web scraping library for its community. At the same time, Apify SDK will continue to exist, but keep only the Apify-specific features related to building actors on the Apify platform. Read the [upgrading guide](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3) to learn about the changes.
## Included features

If you're looking for examples or want to learn more visit:
- **[Apify SDK](https://docs.apify.com/sdk/js)** - toolkit for building Actors
- **[Crawlee](https://crawlee.dev)** - web scraping and browser automation library
- **[Input schema](https://docs.apify.com/platform/actors/development/input-schema)** - define and easily validate a schema for your Actor's input
- **[Dataset](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets)** - store structured data where each object stored has the same attributes
- **[Cheerio](https://cheerio.js.org/)** - a fast, flexible & elegant library for parsing and manipulating HTML and XML

- [Crawlee + Apify Platform guide](https://crawlee.dev/docs/guides/apify-platform)
- [Cheerio Tutorial](https://crawlee.dev/docs/guides/cheerio-crawler-guide)
- [Documentation](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler)
- [Examples](https://crawlee.dev/docs/examples/cheerio-crawler)
## How it works

This code is a JavaScript script that uses Cheerio to scrape data from a website. It then stores the website titles in a dataset.

- The crawler starts with URLs provided from the input `startUrls` field defined by the input schema. Number of crawler pages is limited by `maxPagesPerCrawl` field from input schema.
- The crawler uses `requestHandler` for each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.
HonzaTuron marked this conversation as resolved.
Show resolved Hide resolved
19 changes: 15 additions & 4 deletions templates/js-crawlee-cheerio/src/main.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,30 @@
// For more information, see https://docs.apify.com/sdk/js
import { Actor } from 'apify';
// For more information, see https://crawlee.dev
import { CheerioCrawler } from 'crawlee';
import { router } from './routes.js';
import { CheerioCrawler, Dataset } from 'crawlee';

// Initialize the Apify SDK
await Actor.init();

const startUrls = ['https://apify.com'];
const {
startUrls = ['https://crawlee.dev'],
maxRequestsPerCrawl = 100,
} = await Actor.getInput() ?? {};

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
maxRequestsPerCrawl,
async requestHandler({ enqueueLinks, request, $, log }) {
log.info('enqueueing new URLs');
await enqueueLinks();

const title = $('title').text();
log.info(`${title}`, { url: request.loadedUrl });

await Dataset.pushData({ url: request.loadedUrl, title });
},
});

await crawler.run(startUrls);
Expand Down
21 changes: 0 additions & 21 deletions templates/js-crawlee-cheerio/src/routes.js

This file was deleted.

6 changes: 3 additions & 3 deletions templates/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@
"crawlee",
"cheerio"
],
"description": "A scraper example that uses HTTP requests and Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.",
"description": "A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.",
"archiveUrl": "https://github.com/apify/actor-templates/blob/master/dist/templates/js-crawlee-cheerio.zip?raw=true",
"defaultRunOptions": {
"build": "latest",
Expand Down Expand Up @@ -301,7 +301,7 @@
"cheerio"
],
"description": "Skeleton project that helps you quickly bootstrap `CheerioCrawler` in JavaScript. It's best for developers who already know Apify SDK and Crawlee.",
"archiveUrl": "https://github.com/apify/actor-templates/blob/master/dist/templates/ts-crawlee-cheerio.zip?raw=true",
"archiveUrl": "https://github.com/apify/actor-templates/blob/master/dist/templates/js-bootstrap-cheerio-crawler.zip?raw=true",
HonzaTuron marked this conversation as resolved.
Show resolved Hide resolved
"defaultRunOptions": {
"build": "latest",
"memoryMbytes": 2048,
Expand All @@ -326,7 +326,7 @@
"crawlee",
"cheerio"
],
"description": "A scraper example that uses HTTP requests and Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.",
"description": "A scraper example that uses Cheerio to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges.",
"archiveUrl": "https://github.com/apify/actor-templates/blob/master/dist/templates/ts-crawlee-cheerio.zip?raw=true",
"defaultRunOptions": {
"build": "latest",
Expand Down
10 changes: 8 additions & 2 deletions templates/ts-crawlee-cheerio/.actor/input_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,15 @@
"editor": "requestListSources",
"prefill": [
{
"url": "https://apify.com"
"url": "https://crawlee.dev"
}
]
}
},
"maxRequestsPerCrawl": {
"title": "Max Requests per Crawl",
"type": "integer",
"description": "Maximum number of requests that can be made by this crawler.",
"default": 100
},
}
}
22 changes: 14 additions & 8 deletions templates/ts-crawlee-cheerio/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
# CheerioCrawler Actor template
# TypeScript Crawlee & CheerioCrawler Actor template

This template is a production ready boilerplate for developing with `CheerioCrawler`. Use this to bootstrap your projects using the most up-to-date code.
A template example built with [Crawlee](https://crawlee.dev) to scrape data from a website using [Cheerio](https://cheerio.js.org/) wrapped into [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler).

> We decided to split Apify SDK into two libraries, [Crawlee](https://crawlee.dev) and [Apify SDK v3](https://docs.apify.com/sdk/js). Crawlee will retain all the crawling and scraping-related tools and will always strive to be the best web scraping library for its community. At the same time, Apify SDK will continue to exist, but keep only the Apify-specific features related to building actors on the Apify platform. Read the [upgrading guide](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3) to learn about the changes.
## Included features

If you're looking for examples or want to learn more visit:
- **[Apify SDK](https://docs.apify.com/sdk/js)** - toolkit for building Actors
- **[Crawlee](https://crawlee.dev)** - web scraping and browser automation library
- **[Input schema](https://docs.apify.com/platform/actors/development/input-schema)** - define and easily validate a schema for your Actor's input
- **[Dataset](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets)** - store structured data where each object stored has the same attributes
- **[Cheerio](https://cheerio.js.org/)** - a fast, flexible & elegant library for parsing and manipulating HTML and XML

- [Crawlee + Apify Platform guide](https://crawlee.dev/docs/guides/apify-platform)
- [Cheerio Tutorial](https://crawlee.dev/docs/guides/cheerio-crawler-guide)
- [Documentation](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler)
- [Examples](https://crawlee.dev/docs/examples/cheerio-crawler)
## How it works

This code is a TypeScript script that uses [Crawlee CheerioCralwer](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) framework to crawl a website and extract the data from the crawled URLs with Cheerio. It then stores the website titles in a dataset.

- The crawler starts with URLs provided from the input `startUrls` field defined by the input schema. Number of crawler pages is limited by `maxPagesPerCrawl` field from input schema.
- The crawler uses `requestHandler` for each URL to extract the data from the page with the Cheerio library and to save the title and URL of each page to the dataset. It also logs out each result that is being saved.
4 changes: 2 additions & 2 deletions templates/ts-crawlee-cheerio/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
"node": ">=16.0.0"
},
"dependencies": {
"apify": "^3.0.0",
"crawlee": "^3.0.0"
"apify": "^3.1.8",
"crawlee": "^3.5.0"
},
"devDependencies": {
"@apify/eslint-config-ts": "^0.2.3",
Expand Down
21 changes: 17 additions & 4 deletions templates/ts-crawlee-cheerio/src/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,32 @@
// For more information, see https://docs.apify.com/sdk/js
import { Actor } from 'apify';
// For more information, see https://crawlee.dev
import { CheerioCrawler } from 'crawlee';
import { router } from './routes.js';
import { CheerioCrawler, Dataset } from 'crawlee';

interface Input {
startUrls: string[];
maxRequestsPerCrawl: number;
}

// Initialize the Apify SDK
await Actor.init();

const startUrls = ['https://apify.com'];
const { startUrls, maxRequestsPerCrawl } = await Actor.getInputOrThrow<Input>();
HonzaTuron marked this conversation as resolved.
Show resolved Hide resolved

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new CheerioCrawler({
proxyConfiguration,
requestHandler: router,
maxRequestsPerCrawl,
requestHandler: async ({ enqueueLinks, request, $, log }) => {
log.info('enqueueing new URLs');
await enqueueLinks();

const title = $('title').text();
log.info(`${title}`, { url: request.loadedUrl });

await Dataset.pushData({ url: request.loadedUrl, title });
},
});

await crawler.run(startUrls);
Expand Down
21 changes: 0 additions & 21 deletions templates/ts-crawlee-cheerio/src/routes.ts

This file was deleted.

Loading