Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler doesn't respect configuration argument #539

Open
tlinhart opened this issue Sep 23, 2024 · 1 comment · May be fixed by #559
Open

Crawler doesn't respect configuration argument #539

tlinhart opened this issue Sep 23, 2024 · 1 comment · May be fixed by #559
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@tlinhart
Copy link

Consider this sample program:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    crawler = ParselCrawler(request_handler=default_handler, configuration=config)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())

The configuration argument given to ParselCrawler is not respected, during the run it creates the ./storage directory and persist all the (meta)data. I have to work around it by overriding the global configuration likes this:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration.get_global_configuration()
    config.persist_storage = False
    config.write_metadata = False
    crawler = ParselCrawler(request_handler=default_handler)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 23, 2024
@janbuchar janbuchar added the bug Something isn't working. label Sep 23, 2024
@janbuchar
Copy link
Collaborator

janbuchar commented Sep 23, 2024

Hello, and thanks for the reproduction! It seems that the problem is here:

https://github.com/apify/crawlee-python/blob/master/src/crawlee/storages/_creation_management.py#L122-L132

It looks like service_container.get_storage_client does not consider the adjusted configuration.

Also, we have a test for this - https://github.com/apify/crawlee-python/blob/master/tests/unit/basic_crawler/test_basic_crawler.py#L630-L639 - which probably fails because we're looking inside a different storage directory than the global one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants