Crawler doesn't respect `configuration` argument #539

tlinhart · 2024-09-23T10:54:44Z

Consider this sample program:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    crawler = ParselCrawler(request_handler=default_handler, configuration=config)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())

The configuration argument given to ParselCrawler is not respected, during the run it creates the ./storage directory and persist all the (meta)data. I have to work around it by overriding the global configuration likes this:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration.get_global_configuration()
    config.persist_storage = False
    config.write_metadata = False
    crawler = ParselCrawler(request_handler=default_handler)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())

The text was updated successfully, but these errors were encountered:

janbuchar · 2024-09-23T12:35:15Z

Hello, and thanks for the reproduction! It seems that the problem is here:

https://github.com/apify/crawlee-python/blob/master/src/crawlee/storages/_creation_management.py#L122-L132

It looks like service_container.get_storage_client does not consider the adjusted configuration.

Also, we have a test for this - https://github.com/apify/crawlee-python/blob/master/tests/unit/basic_crawler/test_basic_crawler.py#L630-L639 - which probably fails because we're looking inside a different storage directory than the global one.

Closes: #539

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 23, 2024

janbuchar added the bug Something isn't working. label Sep 23, 2024

B4nan assigned vdusek Sep 23, 2024

vdusek added this to the 99th sprint - Tooling team milestone Sep 23, 2024

vdusek added a commit that referenced this issue Oct 1, 2024

fix: do not persist storage when disabled

8cebe25

Closes: #539

vdusek linked a pull request Oct 1, 2024 that will close this issue

fix: Do not persist storage when it is configured #559

Draft

vdusek added a commit that referenced this issue Oct 1, 2024

fix: do not persist storage when disabled

17595f3

Closes: #539

vdusek modified the milestones: 99th sprint - Tooling team, 100th sprint - Tooling team Oct 7, 2024

vdusek added a commit that referenced this issue Oct 8, 2024

fix: do not persist storage when disabled

bf000e4

Closes: #539

vdusek modified the milestones: 100th sprint - Tooling team, 101st sprint - Tooling team Oct 21, 2024

vdusek added a commit that referenced this issue Oct 24, 2024

fix: do not persist storage when disabled

e49a7ae

Closes: #539

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler doesn't respect `configuration` argument #539

Crawler doesn't respect `configuration` argument #539

tlinhart commented Sep 23, 2024

janbuchar commented Sep 23, 2024 •

edited

Loading

Crawler doesn't respect configuration argument #539

Crawler doesn't respect configuration argument #539

Comments

tlinhart commented Sep 23, 2024

janbuchar commented Sep 23, 2024 • edited Loading

Crawler doesn't respect `configuration` argument #539

Crawler doesn't respect `configuration` argument #539

janbuchar commented Sep 23, 2024 •

edited

Loading