Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve usage docs (#10) #17

Merged
merged 2 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/advanced_usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Advanced usage

## Crawling

### Skipping specific requests

The `wacz_crawl_skip` flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:

```python
yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])
```

When this happens, the statistic `webarchive/crawl_skip` is increased.

### Disallowing archived URLs

If the spider has the attribute `archive_disallow_regexp`, all requests returned from the spider that match this regular expression, are ignored. For example, when a product page was returned in `start_requests`, but the product page disappeared and redirected to its category page, the category page can be disallowed, so as to avoid crawling the whole category, which would take much more time and could lead to unknown URLs (e.g. the spider's requested pagination size could be different from the website default).

When this happens, the statistic `wacz/crawl_skip/disallowed` is increased.

### Iterating a WACZ archive index

When using a WACZ file that is not generated by your own spiders, it might be that the spider for crawling is not in place. In order to crawl this WACZ you need to tailor a spider to work with this specific WACZ file. This will require building the spider different to what it is supposed to look like with a live resource.

Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index.

#### Configuration

To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:

```python
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
"scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```

#### Controlling the crawl

Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, `archive_regex`:

```python
class MyWaczSpider(Spider):
name = "myspider"
archive_regex = r"^/tag/[\w-]+/$"
```

If the spider has an `archive_regexp` attribute, only response URLs matching this regexp are presented in `start_requests`. To visualise that, the spider above will only crawl the indented cdxj records below:

```
com,toscrape,quotes)/favicon.ico 20241007081411465 {...}
com,gstatic,fonts)/s/raleway/v34/1ptug8zys_skggpnyc0it4ttdfa.woff2 {...}
com,googleapis,fonts)/css?family=raleway%3A400%2C700 20241007081525229 {...}
com,toscrape,quotes)/static/bootstrap.min.css 20241007081525202 {...}
com,toscrape,quotes)/static/main.css 20241007081525074 {...}
> com,toscrape,quotes)/tag/books/ 20241007081513898 {...}
> com,toscrape,quotes)/tag/friends/ 20241007081520928 {...}
> com,toscrape,quotes)/tag/friendship/ 20241007081519648 {...}
> com,toscrape,quotes)/tag/humor/ 20241007081512594 {...}
> com,toscrape,quotes)/tag/inspirational/ 20241007081506990 {...}
> com,toscrape,quotes)/tag/life/ 20241007081510349 {...}
> com,toscrape,quotes)/tag/love/ 20241007081503814 {...}
> com,toscrape,quotes)/tag/reading/ 20241007081516781 {...}
> com,toscrape,quotes)/tag/simile/ 20241007081524944 {...}
> com,toscrape,quotes)/tag/truth/ 20241007081523804 {...}
```
38 changes: 9 additions & 29 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Usage

The general use for this plugin is separated in two parts, exporting and crawling.

1. **Exporting**; Run your spider with the extension to generate and export a WACZ file. This WACZ archive can be used in future crawls to retrieve historical data or simply to decrease the load on the website when your spider has changed but needs to run on the same data.
2. **Crawling**; Re-run your spider on an WACZ archive that was generated previously. This time we will not be generating a new WACZ archive but simply retrieve each reponse from the WACZ instead of making a request to the live resource (website). The WACZ contains complete response data that will be reconstructed to actual `Response` objects.

## Exporting

### Exporting a WACZ archive
Expand All @@ -12,21 +17,19 @@ EXTENSIONS = {
}
```

This extension also requires you to set the export location using the `SW_EXPORT_URI` settings.
This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).

```python
SW_EXPORT_URI = "s3://scrapy-webarchive/"
leewesleyv marked this conversation as resolved.
Show resolved Hide resolved
```

Running a crawl job using these settings will result in a newly created WACZ file.
Running a crawl job using these settings will result in a newly created WACZ file on the specified output location.

## Crawling

There are 2 ways to crawl against a WACZ archive. Choose a strategy that you want to use for your crawl job, and follow the instruction as described below.

### Lookup in a WACZ archive
### Using the download middleware

One of the ways to crawl against a WACZ archive is to use the `WaczMiddleware` downloader middleware. Instead of fetching the live resource the middleware will instead retrieve it from the archive and recreate a response using the data from the archive.
To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader middleware. Instead of fetching the live resource the middleware will retrieve it from the archive and recreate a `Response` using the data from the archive.

To use the downloader middleware, enable it in the settings like so:

Expand All @@ -42,26 +45,3 @@ Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```

### Iterating a WACZ archive

Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spider middleware will, when enabled, replace the crawl by an iteration through all the entries in the WACZ archive index. Then, similar to the previous strategy, it will recreate a response using the data from the archive.

To use this strategy, enable both middlewares in the spider settings like so:

```python
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}

SPIDER_MIDDLEWARES = {
"scrapy_webarchive.spidermiddlewares.WaczCrawlMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ nav:
- Introduction: index.md
- installation.md
- usage.md
- Advanced Usage: advanced_usage.md
- settings.md
Loading