Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFlow Job in TFX pipeline fails after running for an hour #6565

Closed
sumansaurav-talentica opened this issue Jan 8, 2024 · 6 comments
Closed

Comments

@sumansaurav-talentica
Copy link

If the bug is related to a specific library below, please raise an issue in the
respective repo directly:

System information

  • Have I specified the code to reproduce the issue (Yes, No):Yes
  • Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), - Vertex AI
    Interactive Notebook, Google Cloud, etc):
  • TensorFlow version: 2.13.1
  • TFX Version:1.14.0
  • Python version:3.10.12
  • Python dependencies (from pip freeze output):

Describe the current behavior
Pipeline fails in the first step where it has to import data from BQ using Dataflow job
Describe the expected behavior
It should successfully import the data, as earlier
Standalone code to reproduce the issue
BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,

# Temporary overrides of defaults.
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2'#

]
Other info / logs
Logs attached
downloaded-logs-20240108-182510.csv

@singhniraj08
Copy link
Contributor

@sumansaurav-talentica,

This is a known issue #6386 and the current workaround is to ssh to your container like docker run --rm -it --entrypoint=/bin/bash YOUR_CONTAINER_IMAGE and check if python3-venv package is installed or
add ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 in the TFX docker image before building the container and use this container for Dataflow jobs.
Thank you!

@sumansaurav-talentica
Copy link
Author

Can you please suggest code and steps on how can I add "ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1" in the TFX docker image before building the container.

This is my code where I am creating runner

BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2'
]

PIPELINE_DEFINITION_FILE = 'test_pipeline.json'

runner = tfx.orchestration.experimental.KubeflowV2DagRunner(
config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(),
output_filename=PIPELINE_DEFINITION_FILE)
_ = runner.run(
_create_pipeline(
pipeline_name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
query=QUERY,
module_file=os.path.join(MODULE_ROOT, _trainer_module_file),
endpoint_name=ENDPOINT_NAME,
project_id=GOOGLE_CLOUD_PROJECT,
region=GOOGLE_CLOUD_REGION,
beam_pipeline_args=BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS))

@sumansaurav-talentica
Copy link
Author

sumansaurav-talentica commented Jan 10, 2024

Thanks for the solution @singhniraj08 , it worked and I am putting it here.
Since I was creating tfx pipeline on colab and running it on vertex ai, below is the code I ran.

!gcloud artifacts repositories create REPO-NAME
--repository-format=docker
--location=REGION
--async

!gcloud auth configure-docker REGION-docker.pkg.dev

dockerfile_content = """
FROM tensorflow/tfx:1.14.0

ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
"""

with open("Dockerfile", "w") as dockerfile:
dockerfile.write(dockerfile_content)

!gcloud builds submit --tag REGION-docker.pkg.dev/PROJECT-ID/REPO-NAME/dataflow/DOCKERNAME:TAG

and finally I passed this new custom docker image container in beam_pipeline_args

BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--runner=DataflowRunner',
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--region=' + GOOGLE_CLOUD_REGION,
'--disk_size_gb=200',
'--machine_type=e2-standard-8',
'--experiments=use_runner_v2',
'--sdk_container_image=us-central1-docker.pkg.dev/calm-snowfall-385011/chicago-taxi/dataflow/tfx114:1.0'
]

@singhniraj08
Copy link
Contributor

@sumansaurav-talentica,

We have a similar issue to track this issues and the long term solution for this issue is to add the environment variable to TFX base image to avoid these issues in future. This is blocked by other issue #6468. Once that issue is fixed, we will implement the environment variable in tFX base image. I would request you to close this issue and follow similar issue for update.
Thank you!

@sumansaurav-talentica
Copy link
Author

thanks for the support

Copy link
Contributor

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants