-
Notifications
You must be signed in to change notification settings - Fork 0
[sc-266853] Text extraction plugin: add partition support #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
def list_input_paths(input_folder): | ||
partitions = flow.FLOW['in'][0].get("partitions", [""]) | ||
return [ | ||
path | ||
for partition in partitions | ||
for path in input_folder.list_paths_in_partition(partition) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm pretty sure we got the same issue in the native* recipe 😢 : always listing all files. could you quickly double check if you already have your partitioned folder in place please? (i'll take care of opening the card if needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and I found no issue with the embed doc recipe. The KB do not support partitions, so we read all partitions by default, with the ability to customize the selection in the I/O settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was missing the I/O selection settings indeed ! thanks for checking this 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with multi-dimensional partitions and it works.
Bumped torch, opencv and opencv-python
scipy==1.10.1; python_version < "3.12" | ||
scipy==1.13.1; python_version >= "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.13.1 is compatible with python3.9 and more, why split the requirement ?
matplotlib==3.7.1 | ||
packaging==24.0 | ||
scikit-image==0.19.3; python_version < "3.12" | ||
scikit-image>=0.21,<0.22; python_version >= "3.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pypi.org/project/scikit-image/0.21.0/#files this is the only existing scikit-image version for this version range, and it has no wheels for python3.12, so i'm very puzzled how this version was chosen ? And also, why it was split for 3.12 specifically ?
This PR add partitions support for input folders.
Before this PR, each activity was running on every file from the input folder, without taking in account the partition, and thus yielded duplicates in the output.
The fix ensures that we only run each activity on the target input partition.