Skip to content

Conversation

straux
Copy link

@straux straux commented Sep 26, 2025

This PR add partitions support for input folders.

Before this PR, each activity was running on every file from the input folder, without taking in account the partition, and thus yielded duplicates in the output.

The fix ensures that we only run each activity on the target input partition.

@straux straux self-assigned this Sep 26, 2025
@alineishabouri alineishabouri self-requested a review September 26, 2025 13:13
Comment on lines +24 to +30
def list_input_paths(input_folder):
partitions = flow.FLOW['in'][0].get("partitions", [""])
return [
path
for partition in partitions
for path in input_folder.list_paths_in_partition(partition)
]
Copy link

@clairebehue clairebehue Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm pretty sure we got the same issue in the native* recipe 😢 : always listing all files. could you quickly double check if you already have your partitioned folder in place please? (i'll take care of opening the card if needed)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and I found no issue with the embed doc recipe. The KB do not support partitions, so we read all partitions by default, with the ability to customize the selection in the I/O settings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was missing the I/O selection settings indeed ! thanks for checking this 🙏

Copy link
Collaborator

@alineishabouri alineishabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with multi-dimensional partitions and it works.

Comment on lines 1 to 2
scipy==1.10.1; python_version < "3.12"
scipy==1.13.1; python_version >= "3.12"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.13.1 is compatible with python3.9 and more, why split the requirement ?

matplotlib==3.7.1
packaging==24.0
scikit-image==0.19.3; python_version < "3.12"
scikit-image>=0.21,<0.22; python_version >= "3.12"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pypi.org/project/scikit-image/0.21.0/#files this is the only existing scikit-image version for this version range, and it has no wheels for python3.12, so i'm very puzzled how this version was chosen ? And also, why it was split for 3.12 specifically ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants