Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Split and add extra configuration to export_data method #580

Merged
merged 25 commits into from
Oct 31, 2024

Conversation

deshansh
Copy link
Contributor

@deshansh deshansh commented Oct 7, 2024

Description

  • Split the export_data function into export_data_csv and export_data_json, and added additional configuration options using kwargs

Issues

Testing

  • Added test to check the export functionally as well for extra configurations

Checklist

  • CI passed

@deshansh deshansh changed the title Split and added extra configuration to export_data function feat: Split and added extra configuration to export_data function Oct 7, 2024
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I have a few comments...

Regarding the removal of export_data, I'm not sure we should do that. Removing it would break parity with the JS Crawlee and introduce a breaking change (in which case, you'd need to use the feat! commit prefix).

Maybe we can keep the export_data and call the appropriate methods within it, like this:

async def export_data(
    self,
    path: str | Path,
    *,
    content_type: Literal['csv', 'json'] = 'csv',
    dataset_id: str | None = None,
    dataset_name: str | None = None,
    **kwargs: Any,
) -> None:
    if content_type == 'csv':
        self.export_data_csv(path, dataset_id=dataset_id, dataset_name=dataset_name, **kwargs)
    elif content_type == 'json':
        self.export_data_json(path, dataset_id=dataset_id, dataset_name=dataset_name, **kwargs)
    else:
        raise ValueError(f'Unsupported content type: {content_type}.')

src/crawlee/storages/_dataset.py Outdated Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Outdated Show resolved Hide resolved
src/crawlee/storages/_dataset.py Outdated Show resolved Hide resolved
src/crawlee/storages/_dataset.py Outdated Show resolved Hide resolved
src/crawlee/storages/_dataset.py Outdated Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Show resolved Hide resolved
src/crawlee/basic_crawler/_basic_crawler.py Show resolved Hide resolved
renovate bot and others added 16 commits October 19, 2024 11:53
- closes apify#563

@tlinhart any chance you could test this and confirm it fixes the issue?
- This adds a `get_key_value_store(id, name)` context helper to
`BasicCrawlingContext`
- Also, push_data calls are held until the request handler terminates
successfully (same as in JS version)
    - This is necessary for adaptive crawling (apify#249), among other things
- rm unused pipeline
- use env var for node versions (same as we do for python)
@janbuchar janbuchar requested a review from vdusek October 31, 2024 15:24
src/crawlee/storages/_dataset.py Outdated Show resolved Hide resolved
src/crawlee/storages/_dataset.py Show resolved Hide resolved
@vdusek vdusek changed the title feat: Split and added extra configuration to export_data function feat: Split and add extra configuration to export_data function Oct 31, 2024
@vdusek vdusek changed the title feat: Split and add extra configuration to export_data function feat: Split and add extra configuration to export_data method Oct 31, 2024
@janbuchar janbuchar requested a review from vdusek October 31, 2024 16:22
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek merged commit 6751635 into apify:master Oct 31, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement/document a way how to pass extra configuration to json.dump()
3 participants