From 8dc33687589ee0c7501797f55be808521f22aca3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jind=C5=99ich=20B=C3=A4r?= Date: Fri, 17 Jan 2025 16:35:53 +0100 Subject: [PATCH] docs: more details about different `ProxyConfiguration` options (#2793) --- docs/guides/proxy_management.mdx | 69 ++++++++++++++++++- .../version-3.12/guides/proxy_management.mdx | 69 ++++++++++++++++++- 2 files changed, 134 insertions(+), 4 deletions(-) diff --git a/docs/guides/proxy_management.mdx b/docs/guides/proxy_management.mdx index 50ef9b949713..8bf385f1c5b5 100644 --- a/docs/guides/proxy_management.mdx +++ b/docs/guides/proxy_management.mdx @@ -61,7 +61,72 @@ Examples of how to use our proxy URLs with crawlers are shown below in [Crawler All our proxy needs are managed by the `ProxyConfiguration` class. We create an instance using the `ProxyConfiguration` `constructor` function based on the provided options. See the `ProxyConfigurationOptions` for all the possible constructor options. -### Crawler integration +### Static proxy list + +You can provide a static list of proxy URLs to the `proxyUrls` option. The `ProxyConfiguration` will then rotate through the provided proxies. + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + proxyUrls: [ + 'http://proxy-1.com', + 'http://proxy-2.com', + null // null means no proxy is used + ] +}); +``` + +This is the simplest way to use a list of proxies. Crawlee will rotate through the list of proxies in a round-robin fashion. + +### Custom proxy function + +The `ProxyConfiguration` class allows you to provide a custom function to pick a proxy URL. This is useful when you want to implement your own logic for selecting a proxy. + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + newUrlFunction: (sessionId, { request }) => { + if (request?.url.includes('crawlee.dev')) { + return null; // for crawlee.dev, we don't use a proxy + } + + return 'http://proxy-1.com'; // for all other URLs, we use this proxy + } +}); +``` + +The `newUrlFunction` receives two parameters - `sessionId` and `options` - and returns a string containing the proxy URL. + +The `sessionId` parameter is always provided and allows us to differentiate between different sessions - e.g. when Crawlee recognizes your crawlers are being blocked, it will automatically create a new session with a different id. + +The `options` parameter is an object containing a `Request`, which is the request that will be made. Note that this object is not always available, for example when we are using the `newUrl` function directly. Your custom function should therefore not rely on the `request` object being present and provide a default behavior when it is not. + +### Tiered proxies + +You can also provide a list of proxy tiers to the `ProxyConfiguration` class. This is useful when you want to switch between different proxies automatically based on the blocking behavior of the website. + +:::warning + +Note that the `tieredProxyUrls` option requires `ProxyConfiguration` to be used from a crawler instance ([see below](#crawler-integration)). + +Using this configuration through the `newUrl` calls will not yield the expected results. + +::: + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + tieredProxyUrls: [ + [null], // At first, we try to connect without a proxy + ['http://okay-proxy.com'], + ['http://slightly-better-proxy.com', 'http://slightly-better-proxy-2.com'], + ['http://very-good-and-expensive-proxy.com'], + ] +}); +``` + +This configuration will start with no proxy, then switch to `http://okay-proxy.com` if Crawlee recognizes we're getting blocked by the target website. If that proxy is also blocked, we will switch to one of the `slightly-better-proxy` URLs. If those are blocked, we will switch to the `very-good-and-expensive-proxy.com` URL. + +Crawlee also periodically probes lower tier proxies to see if they are unblocked, and if they are, it will switch back to them. + +## Crawler integration `ProxyConfiguration` integrates seamlessly into `HttpCrawler`, `CheerioCrawler`, `JSDOMCrawler`, `PlaywrightCrawler` and `PuppeteerCrawler`. @@ -95,7 +160,7 @@ All our proxy needs are managed by the `proxyConfiguration.newUrl()` allows us to pass a `sessionId` parameter. It will then be used to create a `sessionId`-`proxyUrl` pair, and subsequent `newUrl()` calls with the same `sessionId` will always return the same `proxyUrl`. This is extremely useful in scraping, because we want to create the impression of a real user. See the [session management guide](../guides/session-management) and `SessionPool` class for more information on how keeping a real session helps us avoid blocking. diff --git a/website/versioned_docs/version-3.12/guides/proxy_management.mdx b/website/versioned_docs/version-3.12/guides/proxy_management.mdx index 50ef9b949713..8bf385f1c5b5 100644 --- a/website/versioned_docs/version-3.12/guides/proxy_management.mdx +++ b/website/versioned_docs/version-3.12/guides/proxy_management.mdx @@ -61,7 +61,72 @@ Examples of how to use our proxy URLs with crawlers are shown below in [Crawler All our proxy needs are managed by the `ProxyConfiguration` class. We create an instance using the `ProxyConfiguration` `constructor` function based on the provided options. See the `ProxyConfigurationOptions` for all the possible constructor options. -### Crawler integration +### Static proxy list + +You can provide a static list of proxy URLs to the `proxyUrls` option. The `ProxyConfiguration` will then rotate through the provided proxies. + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + proxyUrls: [ + 'http://proxy-1.com', + 'http://proxy-2.com', + null // null means no proxy is used + ] +}); +``` + +This is the simplest way to use a list of proxies. Crawlee will rotate through the list of proxies in a round-robin fashion. + +### Custom proxy function + +The `ProxyConfiguration` class allows you to provide a custom function to pick a proxy URL. This is useful when you want to implement your own logic for selecting a proxy. + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + newUrlFunction: (sessionId, { request }) => { + if (request?.url.includes('crawlee.dev')) { + return null; // for crawlee.dev, we don't use a proxy + } + + return 'http://proxy-1.com'; // for all other URLs, we use this proxy + } +}); +``` + +The `newUrlFunction` receives two parameters - `sessionId` and `options` - and returns a string containing the proxy URL. + +The `sessionId` parameter is always provided and allows us to differentiate between different sessions - e.g. when Crawlee recognizes your crawlers are being blocked, it will automatically create a new session with a different id. + +The `options` parameter is an object containing a `Request`, which is the request that will be made. Note that this object is not always available, for example when we are using the `newUrl` function directly. Your custom function should therefore not rely on the `request` object being present and provide a default behavior when it is not. + +### Tiered proxies + +You can also provide a list of proxy tiers to the `ProxyConfiguration` class. This is useful when you want to switch between different proxies automatically based on the blocking behavior of the website. + +:::warning + +Note that the `tieredProxyUrls` option requires `ProxyConfiguration` to be used from a crawler instance ([see below](#crawler-integration)). + +Using this configuration through the `newUrl` calls will not yield the expected results. + +::: + +```javascript +const proxyConfiguration = new ProxyConfiguration({ + tieredProxyUrls: [ + [null], // At first, we try to connect without a proxy + ['http://okay-proxy.com'], + ['http://slightly-better-proxy.com', 'http://slightly-better-proxy-2.com'], + ['http://very-good-and-expensive-proxy.com'], + ] +}); +``` + +This configuration will start with no proxy, then switch to `http://okay-proxy.com` if Crawlee recognizes we're getting blocked by the target website. If that proxy is also blocked, we will switch to one of the `slightly-better-proxy` URLs. If those are blocked, we will switch to the `very-good-and-expensive-proxy.com` URL. + +Crawlee also periodically probes lower tier proxies to see if they are unblocked, and if they are, it will switch back to them. + +## Crawler integration `ProxyConfiguration` integrates seamlessly into `HttpCrawler`, `CheerioCrawler`, `JSDOMCrawler`, `PlaywrightCrawler` and `PuppeteerCrawler`. @@ -95,7 +160,7 @@ All our proxy needs are managed by the `proxyConfiguration.newUrl()` allows us to pass a `sessionId` parameter. It will then be used to create a `sessionId`-`proxyUrl` pair, and subsequent `newUrl()` calls with the same `sessionId` will always return the same `proxyUrl`. This is extremely useful in scraping, because we want to create the impression of a real user. See the [session management guide](../guides/session-management) and `SessionPool` class for more information on how keeping a real session helps us avoid blocking.