Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix StructuredDataset empty-str file_format in dc attr access #3027

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 1, 2025

Tracking issue

Closes flyteorg/flyte#6096.

Why are the changes needed?

When users create a StructuredDataset with a specified file_format (e.g., parquet), the file_format information will be accidentally discarded in this case during async_to_literal call. To be concrete, StructuredDatasetType's format attribute is set to GENERIC_FORMAT, which is an empty string "".

What changes were proposed in this pull request?

Override StructuredDatasetType's format attribute when users explicitly set file_format of python native StructuredDataset.

How was this patch tested?

This patch is tested through the newly added integration test and double checked by observing the flyte console I/O and the task pod stdout.

Setup process

For local run, the setup process is summarized as follows:

git clone https://github.com/flyteorg/flytekit.git
gh pr checkout 3027
make setup && pip install -e .

After installation, run the following command:

pytest -svvv tests/flytekit/integration/remote/test_remote.py::test_sd_attr

Screenshots

The following results are expected:

  • Flyte console input
    Screenshot 2025-01-04 at 2 44 54 PM

  • Flyte console output
    Screenshot 2025-01-04 at 2 45 03 PM

  • Task pod stdout
    Screenshot 2025-01-04 at 2 44 43 PM

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR implements comprehensive improvements to Flytekit, including: 1) Enhanced execution engine with improved error handling and worker queue implementation, 2) Fixed StructuredDataset bug regarding file_format preservation, 3) Introduced new Optuna plugin for hyperparameter optimization with async capabilities, 4) Added Environment class and improved thread safety mechanisms.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 5

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Copy link

codecov bot commented Jan 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.82%. Comparing base (3260ddf) to head (416bed8).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3027      +/-   ##
==========================================
+ Coverage   79.61%   79.82%   +0.21%     
==========================================
  Files         202      303     +101     
  Lines       21430    25811    +4381     
  Branches     2760     2760              
==========================================
+ Hits        17061    20604    +3543     
- Misses       3601     4408     +807     
- Partials      768      799      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at eduardo@union.ai.

Signed-off-by: JiaWei Jiang <waynechuang97@gmail.com>
@JiangJiaWei1103 JiangJiaWei1103 changed the title [WIP] Fix StructuredDataset empty-str file_format in dc attr access [BUG] Fix StructuredDataset empty-str file_format in dc attr access Jan 4, 2025
@JiangJiaWei1103
Copy link
Contributor Author

Follow-up

During fixing this bug, we observe another two phenomenon, which are briefly described as follows:

  1. pyflyte run can't handle the dataclass input as a json string, e.g.,
pyflyte run --remote test.py wf --dc '{"dc": {"sd": {"uri": "s3://my-s3-bucket/df.parquet", "file_format": "parquet"}}}'

There are two sub-cases:

  • DC's sd is defined without default_factory
    • Get KeyError: 'sd'
  • DC's sd is defined with default_factory
    • wf always takes the default_factory setup as input no matter what's specified in the json string
  1. uri information loss, which is just similar to what we've solved in this patch
    Screenshot 2025-01-04 at 2 44 43 PM
    As can be observed, StructuredDataset uri becomes None.

We plan to solve these issues in the future which can make attribute access result more accurate. Thanks!

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #2ae7a3

Actionable Suggestions - 0
Review Details
  • Files reviewed - 3 · Commit Range: cdb56c0..46b7b62
    • flytekit/types/structured/structured_dataset.py
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/integration/remote/workflows/basic/sd_attr.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Bug Fix - Fix StructuredDataset file_format Retention

structured_dataset.py - Added logic to retain user-specified file_format in StructuredDataset during transformation

test_remote.py - Added integration test to verify StructuredDataset file_format retention

sd_attr.py - Created new test workflow to validate StructuredDataset attributes

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 11, 2025

Code Review Agent Run #1863ff

Actionable Suggestions - 0
Additional Suggestions - 10
  • tests/flytekit/integration/remote/utils.py - 1
    • Consider making parquet path configurable · Line 89-89
  • plugins/flytekit-optuna/flytekitplugins/optuna/optimizer.py - 1
  • plugins/flytekit-inference/tests/test_vllm.py - 1
    • Consider breaking down long assertion statement · Line 41-41
  • tests/flytekit/unit/core/test_flyte_file.py - 1
  • flytekit/image_spec/default_builder.py - 1
  • tests/flytekit/unit/core/test_flyte_directory.py - 1
  • tests/flytekit/integration/remote/test_remote.py - 1
    • Consider extracting duplicated workflow execution logic · Line 846-847
  • plugins/flytekit-optuna/setup.py - 1
    • Consider adding Python 3.11 support · Line 21-21
  • plugins/flytekit-spark/tests/test_environment.py - 1
    • Consider validating Spark config in test · Line 17-20
  • tests/flytekit/unit/core/image_spec/test_default_builder.py - 1
    • Consider using complete error message pattern · Line 271-271
Review Details
  • Files reviewed - 27 · Commit Range: 46b7b62..cb73c0e
    • flytekit/__init__.py
    • flytekit/core/array_node_map_task.py
    • flytekit/core/environment.py
    • flytekit/core/workflow.py
    • flytekit/image_spec/default_builder.py
    • flytekit/remote/remote.py
    • flytekit/types/directory/types.py
    • flytekit/types/file/file.py
    • plugins/flytekit-inference/flytekitplugins/inference/__init__.py
    • plugins/flytekit-inference/flytekitplugins/inference/vllm/serve.py
    • plugins/flytekit-inference/setup.py
    • plugins/flytekit-inference/tests/test_vllm.py
    • plugins/flytekit-onnx-pytorch/dev-requirements.txt
    • plugins/flytekit-optuna/flytekitplugins/optuna/__init__.py
    • plugins/flytekit-optuna/flytekitplugins/optuna/optimizer.py
    • plugins/flytekit-optuna/setup.py
    • plugins/flytekit-optuna/tests/test_optimizer.py
    • plugins/flytekit-spark/tests/test_environment.py
    • pyproject.toml
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/integration/remote/utils.py
    • tests/flytekit/integration/remote/workflows/basic/attr_access_sd.py
    • tests/flytekit/unit/core/image_spec/test_default_builder.py
    • tests/flytekit/unit/core/test_environment.py
    • tests/flytekit/unit/core/test_flyte_directory.py
    • tests/flytekit/unit/core/test_flyte_file.py
    • tests/flytekit/unit/core/test_workflows.py
  • Files skipped - 3
    • .github/workflows/pythonbuild.yml - Reason: Filter setting
    • plugins/flytekit-inference/README.md - Reason: Filter setting
    • plugins/flytekit-optuna/README.md - Reason: Filter setting
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@Future-Outlier Future-Outlier self-assigned this Jan 13, 2025
@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Jan 17, 2025

For the second follow-up issue, I thought I found why uri information is missing. I'll open another PR and link to this one after this one is merged. Thanks!

Screenshot 2025-01-17 at 7 57 44 PM

Comment on lines 751 to 752
if file_format != GENERIC_FORMAT:
sdt.format = file_format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the file format always be copied over?

Suggested change
if file_format != GENERIC_FORMAT:
sdt.format = file_format
sdt.format = file_format

Copy link
Contributor Author

@JiangJiaWei1103 JiangJiaWei1103 Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan,

Thanks for your suggestion! I think we can't always copy it over. I've come up with the following use case:

@task
def modify_format(sd: Annotated[StructuredDataset, {}, "task-format"]) -> StructuredDataset: 
    return sd

sd = StructuredDataset(uri="s3://my-s3-bucket/df.parquet", file_format="user-format")
sd2 = modify_format(sd=sd)

In this case, we expect sd2.file_format to be task-format (as shown in Annotated), not user-format. If we always use the user-specified file_format, the information set in Annotated will be missing.

Considering sdt.format can be set here, would it be good to do provide a stricter condition as follows?

if sdt.format == GENERIC_FORMAT and file_format != GENERIC_FORMAT:
    sdt.format = file_format

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 19, 2025

Code Review Agent Run #2259be

Actionable Suggestions - 0
Additional Suggestions - 10
  • flytekit/core/worker_queue.py - 3
  • flytekit/core/array_node_map_task.py - 1
    • Consider using proper constructor for metadata · Line 131-132
  • flytekit/exceptions/user.py - 1
    • Consider adding timestamp parameter validation · Line 18-18
  • flytekit/remote/remote.py - 1
  • plugins/flytekit-optuna/flytekitplugins/optuna/optimizer.py - 2
    • Consider adding inputs parameter validation · Line 192-192
    • Consider simplifying validation logic expressions · Line 84-87
  • tests/flytekit/unit/exceptions/test_user.py - 1
    • Consider splitting error message assertion · Line 19-19
  • flytekit/core/type_engine.py - 1
Review Details
  • Files reviewed - 40 · Commit Range: cb73c0e..416bed8
    • flytekit/bin/entrypoint.py
    • flytekit/clis/sdk_in_container/run.py
    • flytekit/core/array_node_map_task.py
    • flytekit/core/context_manager.py
    • flytekit/core/node.py
    • flytekit/core/promise.py
    • flytekit/core/type_engine.py
    • flytekit/core/worker_queue.py
    • flytekit/exceptions/user.py
    • flytekit/image_spec/default_builder.py
    • flytekit/remote/remote.py
    • flytekit/tools/translator.py
    • flytekit/types/structured/structured_dataset.py
    • plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py
    • plugins/flytekit-kf-pytorch/tests/test_elastic_task.py
    • plugins/flytekit-optuna/flytekitplugins/optuna/__init__.py
    • plugins/flytekit-optuna/flytekitplugins/optuna/optimizer.py
    • plugins/flytekit-optuna/setup.py
    • plugins/flytekit-optuna/tests/test_callback.py
    • plugins/flytekit-optuna/tests/test_decorator.py
    • plugins/flytekit-optuna/tests/test_imperative.py
    • plugins/flytekit-optuna/tests/test_optimizer.py
    • plugins/flytekit-optuna/tests/test_validation.py
    • plugins/flytekit-pydantic/flytekitplugins/pydantic/__init__.py
    • plugins/flytekit-pydantic/flytekitplugins/pydantic/basemodel_transformer.py
    • plugins/flytekit-pydantic/flytekitplugins/pydantic/commons.py
    • plugins/flytekit-pydantic/flytekitplugins/pydantic/deserialization.py
    • plugins/flytekit-pydantic/flytekitplugins/pydantic/serialization.py
    • plugins/flytekit-pydantic/setup.py
    • plugins/flytekit-pydantic/tests/folder/test_file1.txt
    • plugins/flytekit-pydantic/tests/folder/test_file2.txt
    • plugins/flytekit-pydantic/tests/test_type_transformer.py
    • tests/flytekit/integration/remote/test_remote.py
    • tests/flytekit/integration/remote/workflows/basic/signal_test.py
    • tests/flytekit/unit/bin/test_python_entrypoint.py
    • tests/flytekit/unit/core/test_array_node_map_task.py
    • tests/flytekit/unit/core/test_type_hints.py
    • tests/flytekit/unit/core/test_worker_queue.py
    • tests/flytekit/unit/exceptions/test_user.py
    • tests/flytekit/unit/remote/test_remote.py
  • Files skipped - 2
    • plugins/flytekit-optuna/README.md - Reason: Filter setting
    • plugins/flytekit-pydantic/README.md - Reason: Filter setting
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

[BUG] StructuredDataset file_format becomes an empty str through dataclass attribute access
4 participants