Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Sep 13, 2025

Description

This PR implements a storage client RedisStorageClient based on Redis v8+. The minimum version 8 requirement is due to the fact that all data structures used are only available starting from Redis Open-Source version 8, without any additional extensions.

Testing

  • Added new unit tests
  • For testing without actual Redis usage, fakeredis is used

@Mantisus Mantisus self-assigned this Sep 13, 2025
@Mantisus Mantisus changed the title feat: Add RedisStorageClient feat: Add RedisStorageClient based on Redis v8.0+ Sep 14, 2025
@Mantisus
Copy link
Collaborator Author

Performance test.

1 client

Code to run

import asyncio

from crawlee import ConcurrencySettings
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import RedisStorageClient

CONNECTION = 'redis://localhost:6379'


async def main() -> None:
    storage_client = RedisStorageClient(connection_string=CONNECTION)
    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        storage_client=storage_client,
        http_client=http_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 358.4ms    │
│ requests_finished_per_minute  │ 3545       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 14min 6.8s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 39.99s     │
└───────────────────────────────┴────────────┘

3 clients

Code to run

import asyncio
from concurrent.futures import ProcessPoolExecutor

from crawlee import ConcurrencySettings, service_locator
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import RedisStorageClient
from crawlee.storages import RequestQueue

CONNECTION = 'redis://localhost:6379'

async def run(queue_name: str) -> None:
    storage_client = RedisStorageClient(connection_string=CONNECTION)

    service_locator.set_storage_client(storage_client)
    queue = await RequestQueue.open(name=queue_name)

    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        http_client=http_client,
        request_manager=queue,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])

def process_run(queue_name: str) -> None:
    asyncio.run(run(queue_name))

def multi_run(queue_name: str = 'multi') -> None:
    workers = 3
    with ProcessPoolExecutor(max_workers=workers) as executor:
        executor.map(process_run, [queue_name for i in range(workers)])

if __name__ == '__main__':
    multi_run()
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 779        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [779]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 356.9ms    │
│ requests_finished_per_minute  │ 2996       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 38.0s │
│ requests_total                │ 779        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 762        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [762]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 360.0ms    │
│ requests_finished_per_minute  │ 2931       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 34.3s │
│ requests_total                │ 762        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 822        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [822]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 342.2ms    │
│ requests_finished_per_minute  │ 3161       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 4min 41.3s │
│ requests_total                │ 822        │
│ crawler_runtime               │ 15.60s     │
└───────────────────────────────┴────────────┘

@Mantisus Mantisus marked this pull request as ready for review September 15, 2025 18:21
@Mantisus
Copy link
Collaborator Author

In RedisRequestQueueClient, I used a Bloom filter in Redis for deduplication and tracking handled requests. This approach differs from what we used in other clients (using set). My main motivation was that Redis is an in-memory database, and memory consumption can be critical in some cases.

Since a Bloom filter is a probabilistic data structure, the final data structure size is affected by the error probability; I used 1e-7. This means that with a probability of 1e-7, we may get a false positive when checking the filter. In our case, this translates to a probability of skipping a request: with probability 1e-7, a request that wasn't added to the queue will be considered as already added. Similarly, with probability 1e-7, a request that hasn't been handled yet will be considered as already processed.

Memory consumption for records in the format 'https://crawlee.dev/{i}' (record size doesn't matter for Bloom filters):

Redis Bloom filter:

  • 100,000 - 427 KB
  • 1,000,000 - 4 MB
  • 10,000,000 - 42 MB

Redis set:

  • 100,000 - 6 MB
  • 1,000,000 - 61 MB
  • 10,000,000 - 662 MB

Discussion about whether it's worth pursuing this approach is welcome!

@janbuchar
Copy link
Collaborator

I haven't read the PR yet, but I did look into bloom filters for request deduplication in the past and what you wrote piqued my interest 🙂 I am a little worried about the chance of dropping a URL completely, even with a super small probability.

Perhaps we should default to a solution that tolerates some percentage of the "opposite" errors and allows a URL to get processed multiple times in rare cases. A fixed size hash table is an example of such data structure. I don't know if anything more sophisticated exists.

But maybe I have an irrational fear of probabilistic stuff 🙂

@Mantisus
Copy link
Collaborator Author

I am a little worried about the chance of dropping a URL completely, even with a super small probability.

Yes, I agree that this may be a little disturbing. And if we go down this route, it will need to be highlighted separately for the user.

But perhaps I am not sufficiently afraid of probabilistic structures, as I have used them before. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants