Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-766772: Unable to GET multiple parquet files from internal stage to local directory #1485

Closed
ndamclean opened this issue Mar 22, 2023 · 8 comments

Comments

@ndamclean
Copy link

  1. What version of Python are you using?
Python 3.8.13
  1. What operating system and processor architecture are you using?
Linux-5.19.0-35-generic-x86_64-with-glibc2.35
  1. What are the component versions in the environment (pip freeze)?
asn1crypto==1.5.1
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
cryptography==39.0.2
filelock==3.10.1
idna==3.4
oscrypto==1.3.0
packaging==23.0
pycparser==2.21
pycryptodomex==3.17
PyJWT==2.6.0
pyOpenSSL==23.0.0
pytz==2022.7.1
requests==2.28.2
snowflake-connector-python==3.0.1
typing_extensions==4.5.0
urllib3==1.26.15
  1. What did you do?

Attempt to download multiple parquet files from an internal Snowflake stage to my local machine as described in the documentation here: https://docs.snowflake.com/en/sql-reference/sql/get

This replicates a use case where I was using the snowpark.DataFrame.write.copy_into_location function with a partition_by value set and attempting to retrieve the partitioned data.

My snowpark code looks something like this:

dataframe.write.copy_into_location(
    remote_location,
    partition_by=F.col("event_id").cast("string"),
    file_format_type="parquet",
)
snowpark_session.file.get(remote_location, local_path, pattern=".*parquet", parallel=16)

This caused an error. The underlying cause of the error seems to be the snowflake-connector-python library itself, so I reproduced the problem without using snowpark.

import logging
import os
from pathlib import Path

from snowflake.connector import SnowflakeConnection, connect


for logger_name in ('snowflake.connector',):
   logger = logging.getLogger(logger_name)
   logger.setLevel(logging.DEBUG)
   ch = logging.StreamHandler()
   ch.setLevel(logging.DEBUG)
   ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
   logger.addHandler(ch)


def snowflake_connect(**kwargs) -> SnowflakeConnection:
    """Create a Snowflake connection based on environment variables."""
    return connect(
        account=os.environ.get("SNOWSQL_ACCOUNT"),
        user=os.environ.get("SNOWSQL_USER"),
        password=os.environ.get("SNOWSQL_PWD"),
        database=os.environ.get("SNOWSQL_DATABASE"),
        schema=os.environ.get("SNOWSQL_SCHEMA"),
        role=os.environ.get("SNOWSQL_ROLE"),
        warehouse=os.environ.get("SNOWSQL_WAREHOUSE"),
    )


stage = "TEMP"
parquet_file_path = str(Path("~/tmp.parquet").expanduser())

with snowflake_connect() as conn:
    cursor = conn.cursor()
    ret = cursor.execute(
        f"""
        CREATE OR REPLACE TEMPORARY STAGE {stage}
        """
    )
    print(ret.fetchall())
    ret = cursor.execute(
        f"""
        PUT file://{parquet_file_path} @{stage}/data/1/
        """
    )
    print(ret.fetchall())
    ret = cursor.execute(
        f"""
        PUT file://{parquet_file_path} @{stage}/data/2/
        """
    )
    print(ret.fetchall())
    ret = cursor.execute(
        f"""
        LIST @{stage}
        """
    )
    print(ret.fetchall())
    ret = cursor.execute(
        f"""
        GET @{stage} file:///tmp PATTERN='.*parquet'
        """
    )
    print(ret.fetchall())
  1. What did you expect to see?
    The parquet files that I successfully uploaded should be downloaded to the /tmp directory on my local machine.

  2. Can you set logging to DEBUG and collect the logs?
    (truncated because of github issue character limits)

[('tmp.parquet', 'tmp.parquet', 144664, 144672, 'PARQUET', 'PARQUET', 'UPLOADED', '')]
2023-03-22 13:46:30,527 - MainThread cursor.py:673 - execute() - DEBUG - executing SQL/command
2023-03-22 13:46:30,527 - MainThread cursor.py:593 - _preprocess_pyformat_query() - DEBUG - binding: [LIST @TEMP] with input=[None], processed=[{}]
2023-03-22 13:46:30,528 - MainThread cursor.py:738 - execute() - INFO - query: [LIST @TEMP]
2023-03-22 13:46:30,528 - MainThread connection.py:1363 - _next_sequence_counter() - DEBUG - sequence counter: 6
2023-03-22 13:46:30,528 - MainThread cursor.py:468 - _execute_helper() - DEBUG - Request id: 95041de0-8b17-4f15-916d-f3167ce38c29
2023-03-22 13:46:30,528 - MainThread cursor.py:470 - _execute_helper() - DEBUG - running query [LIST @TEMP]
2023-03-22 13:46:30,528 - MainThread cursor.py:477 - _execute_helper() - DEBUG - is_file_transfer: True
2023-03-22 13:46:30,528 - MainThread connection.py:1035 - cmd_query() - DEBUG - _cmd_query
2023-03-22 13:46:30,528 - MainThread connection.py:1058 - cmd_query() - DEBUG - sql=[LIST @TEMP], sequence_id=[6], is_file_transfer=[False]
2023-03-22 13:46:30,528 - MainThread network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 1/1 active sessions
2023-03-22 13:46:30,529 - MainThread network.py:846 - _request_exec_wrapper() - DEBUG - remaining request timeout: None, retry cnt: 1
2023-03-22 13:46:30,529 - MainThread network.py:827 - add_request_guid() - DEBUG - Request guid: 9b8b52e2-2494-44ed-bc10-f8792d33cb50
2023-03-22 13:46:30,529 - MainThread network.py:1025 - _request_exec() - DEBUG - socket timeout: 60
2023-03-22 13:46:30,699 - MainThread connectionpool.py:456 - _make_request() - DEBUG - https://jbulliw-main.snowflakecomputing.com:443 "POST /queries/v1/query-request?requestId=95041de0-8b17-4f15-916d-f3167ce38c29&request_guid=9b8b52e2-2494-44ed-bc10-f8792d33cb50 HTTP/1.1" 200 None
2023-03-22 13:46:30,700 - MainThread network.py:1051 - _request_exec() - DEBUG - SUCCESS
2023-03-22 13:46:30,701 - MainThread network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 0/1 active sessions
2023-03-22 13:46:30,701 - MainThread network.py:726 - _post_request() - DEBUG - ret[code] = None, after post request
2023-03-22 13:46:30,701 - MainThread network.py:750 - _post_request() - DEBUG - Query id: 01ab1f9e-0001-0477-0001-777a05ab0cde
2023-03-22 13:46:30,701 - MainThread cursor.py:745 - execute() - DEBUG - sfqid: 01ab1f9e-0001-0477-0001-777a05ab0cde
2023-03-22 13:46:30,701 - MainThread cursor.py:751 - execute() - INFO - query execution done
2023-03-22 13:46:30,701 - MainThread cursor.py:765 - execute() - DEBUG - SUCCESS
2023-03-22 13:46:30,701 - MainThread cursor.py:784 - execute() - DEBUG - PUT OR GET: False
2023-03-22 13:46:30,701 - MainThread cursor.py:876 - _init_result_and_meta() - DEBUG - Query result format: json
2023-03-22 13:46:30,702 - MainThread result_batch.py:432 - _parse() - DEBUG - parsing for result batch id: 2
2023-03-22 13:46:30,702 - MainThread cursor.py:890 - _init_result_and_meta() - INFO - Number of results in first chunk: 2
2023-03-22 13:46:30,702 - MainThread result_set.py:57 - result_set_iterator() - DEBUG - beginning to schedule result batch downloads
[('temp/data/1/tmp.parquet', 144672, '324301a7a7000744d7c923439d596970', 'Wed, 22 Mar 2023 20:46:29 GMT'), ('temp/data/2/tmp.parquet', 144672, '78acf70324146b817f3597e6e91c156b', 'Wed, 22 Mar 2023 20:46:30 GMT')]
2023-03-22 13:46:30,702 - MainThread cursor.py:673 - execute() - DEBUG - executing SQL/command
2023-03-22 13:46:30,702 - MainThread cursor.py:593 - _preprocess_pyformat_query() - DEBUG - binding: [GET @TEMP file:///tmp PATTERN='.*parquet'] with input=[None], processed=[{}]
2023-03-22 13:46:30,702 - MainThread cursor.py:738 - execute() - INFO - query: [GET @TEMP file:///tmp PATTERN='.*parquet']
2023-03-22 13:46:30,702 - MainThread connection.py:1363 - _next_sequence_counter() - DEBUG - sequence counter: 7
2023-03-22 13:46:30,702 - MainThread cursor.py:468 - _execute_helper() - DEBUG - Request id: 2bffc446-b12c-4e21-a578-0edf7161fad2
2023-03-22 13:46:30,702 - MainThread cursor.py:470 - _execute_helper() - DEBUG - running query [GET @TEMP file:///tmp PATTERN='.*parquet']
2023-03-22 13:46:30,702 - MainThread cursor.py:477 - _execute_helper() - DEBUG - is_file_transfer: True
2023-03-22 13:46:30,703 - MainThread connection.py:1035 - cmd_query() - DEBUG - _cmd_query
2023-03-22 13:46:30,703 - MainThread connection.py:1058 - cmd_query() - DEBUG - sql=[GET @TEMP file:///tmp PATTERN='.*parquet'], sequence_id=[7], is_file_transfer=[True]
2023-03-22 13:46:30,703 - MainThread network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 1/1 active sessions
2023-03-22 13:46:30,703 - MainThread network.py:846 - _request_exec_wrapper() - DEBUG - remaining request timeout: None, retry cnt: 1
2023-03-22 13:46:30,703 - MainThread network.py:827 - add_request_guid() - DEBUG - Request guid: 43a31885-3fda-40ca-9005-25d506bfc3ff
2023-03-22 13:46:30,703 - MainThread network.py:1025 - _request_exec() - DEBUG - socket timeout: 60
2023-03-22 13:46:30,967 - MainThread connectionpool.py:456 - _make_request() - DEBUG - https://jbulliw-main.snowflakecomputing.com:443 "POST /queries/v1/query-request?requestId=2bffc446-b12c-4e21-a578-0edf7161fad2&request_guid=43a31885-3fda-40ca-9005-25d506bfc3ff HTTP/1.1" 200 None
2023-03-22 13:46:30,970 - MainThread network.py:1051 - _request_exec() - DEBUG - SUCCESS
2023-03-22 13:46:30,970 - MainThread network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 0/1 active sessions
2023-03-22 13:46:30,970 - MainThread network.py:726 - _post_request() - DEBUG - ret[code] = None, after post request
2023-03-22 13:46:30,970 - MainThread network.py:750 - _post_request() - DEBUG - Query id: 01ab1f9e-0001-04d6-0001-777a05ab3266
2023-03-22 13:46:30,970 - MainThread cursor.py:745 - execute() - DEBUG - sfqid: 01ab1f9e-0001-04d6-0001-777a05ab3266
2023-03-22 13:46:30,970 - MainThread cursor.py:751 - execute() - INFO - query execution done
2023-03-22 13:46:30,970 - MainThread cursor.py:765 - execute() - DEBUG - SUCCESS
2023-03-22 13:46:30,970 - MainThread cursor.py:784 - execute() - DEBUG - PUT OR GET: True
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:822 - _init_encryption_material() - DEBUG - DOWNLOAD
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:915 - _parse_command() - DEBUG - data/1/tmp.parquet
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:915 - _parse_command() - DEBUG - data/2/tmp.parquet
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:949 - _init_file_metadata() - DEBUG - command type: DOWNLOAD
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:392 - execute() - DEBUG - parallel=[10]
2023-03-22 13:46:30,971 - MainThread file_transfer_agent.py:415 - transfer() - DEBUG - Chunk ThreadPoolExecutor size: 10
2023-03-22 13:46:30,971 - MainThread connection.py:653 - cursor() - DEBUG - cursor
2023-03-22 13:46:30,972 - MainThread gcs_storage_client.py:80 - __init__() - DEBUG - No access token received from GS, requesting presigned url
2023-03-22 13:46:30,972 - MainThread gcs_storage_client.py:227 - _update_presigned_url() - DEBUG - Updating presigned url
2023-03-22 13:46:30,972 - MainThread connection.py:653 - cursor() - DEBUG - cursor
2023-03-22 13:46:30,972 - MainThread gcs_storage_client.py:80 - __init__() - DEBUG - No access token received from GS, requesting presigned url
2023-03-22 13:46:30,972 - MainThread gcs_storage_client.py:227 - _update_presigned_url() - DEBUG - Updating presigned url
2023-03-22 13:46:30,973 - ThreadPoolExecutor-11_1 file_transfer_agent.py:450 - preprocess_done_cb() - DEBUG - Finished preparing file tmp.parquet
2023-03-22 13:46:30,974 - ThreadPoolExecutor-11_0 file_transfer_agent.py:450 - preprocess_done_cb() - DEBUG - Finished preparing file tmp.parquet
2023-03-22 13:46:30,974 - ThreadPoolExecutor-10_0 network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'gcpuscentral1-qfr7mrx-stage.storage.googleapis.com', SessionPool 1/1 active sessions
2023-03-22 13:46:30,974 - ThreadPoolExecutor-10_0 storage_client.py:275 - _send_request_with_retry() - DEBUG - storage client request with session <snowflake.connector.vendored.requests.sessions.Session object at 0x7fa63f6e0be0>
2023-03-22 13:46:30,976 - ThreadPoolExecutor-10_1 retry.py:351 - from_int() - DEBUG - Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redirect=None, status=None)
2023-03-22 13:46:30,976 - ThreadPoolExecutor-10_1 retry.py:351 - from_int() - DEBUG - Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redirect=None, status=None)
2023-03-22 13:46:30,976 - ThreadPoolExecutor-10_1 network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'gcpuscentral1-qfr7mrx-stage.storage.googleapis.com', SessionPool 2/2 active sessions
2023-03-22 13:46:30,977 - ThreadPoolExecutor-10_1 storage_client.py:275 - _send_request_with_retry() - DEBUG - storage client request with session <snowflake.connector.vendored.requests.sessions.Session object at 0x7fa63e5e36d0>
2023-03-22 13:46:30,978 - ThreadPoolExecutor-10_1 connectionpool.py:1003 - _new_conn() - DEBUG - Starting new HTTPS connection (1): gcpuscentral1-qfr7mrx-stage.storage.googleapis.com:443
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ssl_wrap_socket.py:81 - ssl_wrap_socket_with_ocsp() - DEBUG - OCSP Mode: FAIL_OPEN, OCSP response cache file name: None
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:523 - reset_ocsp_response_cache_uri() - DEBUG - ocsp_response_cache_uri: file:///home/nmclean/.cache/snowflake/ocsp_response_cache.json
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:526 - reset_ocsp_response_cache_uri() - DEBUG - OCSP_VALIDATION_CACHE size: 213
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:333 - reset_ocsp_dynamic_cache_server_url() - DEBUG - OCSP response cache server is enabled: http://ocsp.snowflakecomputing.com/ocsp_response_cache.json
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:346 - reset_ocsp_dynamic_cache_server_url() - DEBUG - OCSP dynamic cache server RETRY URL: None
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:956 - validate() - DEBUG - validating certificate: gcpuscentral1-qfr7mrx-stage.storage.googleapis.com
2023-03-22 13:46:31,081 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:435 - extract_certificate_chain() - DEBUG - # of certificates: 3
2023-03-22 13:46:31,082 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:440 - extract_certificate_chain() - DEBUG - subject: OrderedDict([('common_name', '*.storage.googleapis.com')]), issuer: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS CA 1C3')])
2023-03-22 13:46:31,082 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:440 - extract_certificate_chain() - DEBUG - subject: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS CA 1C3')]), issuer: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS Root R1')])
2023-03-22 13:46:31,082 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:440 - extract_certificate_chain() - DEBUG - subject: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS Root R1')]), issuer: OrderedDict([('country_name', 'BE'), ('organization_name', 'GlobalSign nv-sa'), ('organizational_unit_name', 'Root CA'), ('common_name', 'GlobalSign Root CA')])
2023-03-22 13:46:31,083 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:463 - create_pair_issuer_subject() - DEBUG - not found issuer_der: OrderedDict([('country_name', 'BE'), ('organization_name', 'GlobalSign nv-sa'), ('organizational_unit_name', 'Root CA'), ('common_name', 'GlobalSign Root CA')])
2023-03-22 13:46:31,084 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:722 - find_cache() - DEBUG - hit cache for subject: OrderedDict([('common_name', '*.storage.googleapis.com')])
2023-03-22 13:46:31,084 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:722 - find_cache() - DEBUG - hit cache for subject: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS CA 1C3')])
2023-03-22 13:46:31,086 - ThreadPoolExecutor-10_1 ocsp_asn1crypto.py:233 - is_valid_time() - DEBUG - Verifying the attached certificate is signed by the issuer. Valid Not After: 2023-04-15 00:00:00+00:00
2023-03-22 13:46:31,086 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:722 - find_cache() - DEBUG - hit cache for subject: OrderedDict([('country_name', 'US'), ('organization_name', 'Google Trust Services LLC'), ('common_name', 'GTS Root R1')])
2023-03-22 13:46:31,087 - ThreadPoolExecutor-10_1 ocsp_snowflake.py:1013 - _validate() - DEBUG - ok
2023-03-22 13:46:31,106 - ThreadPoolExecutor-10_0 connectionpool.py:456 - _make_request() - DEBUG - https://gcpuscentral1-qfr7mrx-stage.storage.googleapis.com:443 "GET /stages/a24fba3f-14a6-44bf-98f1-6cef989b4c13/data/2/tmp.parquet?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcpuscentral1-qfr7mrx-stage%40us-central1-stage1-6e1d.iam.gserviceaccount.com%2F20230322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230322T204630Z&X-Goog-Expires=3600&X-Goog-SignedHeaders=host&X-Goog-Signature=1b552e2b012d58caf64c925cf808eb173e6f94a703a2f2eeb55637667e294363285271450d3263195b8103adffb319d65d27c77fe19d680d73316555de4fe09392ae45ef79505767c972547db226cc343724f8f5aee357c253fb8603a0603237bea9f3a1daff3ded5e003d8d07075fce3e91fd7be6c4704b9dcd316418cff1b3994792659a4924a860ca6d1a0d6a4a70c7610993be4d53863910356bd43480d6705c2843fe6fcdb6f7d922976a35b77d9e161ea6765bbfb2719f4214759f09c6fa5be914bbfe569bb25bdc6c87f6cb6fbf33a56868853b2a4458d83e8665187d70f3e3f371563869c43ce9b9c330fc061e5e4016e8cd698530778318abb73966 HTTP/1.1" 200 144672
2023-03-22 13:46:31,107 - ThreadPoolExecutor-10_0 network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'gcpuscentral1-qfr7mrx-stage.storage.googleapis.com', SessionPool 1/2 active sessions
2023-03-22 13:46:31,228 - ThreadPoolExecutor-10_0 file_transfer_agent.py:496 - transfer_done_cb() - DEBUG - Chunk 0/1 of file tmp.parquet reached callback
2023-03-22 13:46:31,229 - ThreadPoolExecutor-10_0 file_transfer_agent.py:512 - transfer_done_cb() - DEBUG - Chunk progress: tmp.parquet: completed: 1 failed: 0 total: 1
2023-03-22 13:46:31,229 - ThreadPoolExecutor-12_0 storage_client.py:352 - finish_download() - DEBUG - encrypted data file=/tmp/tmp.parquet
2023-03-22 13:46:31,229 - ThreadPoolExecutor-10_0 file_transfer_agent.py:532 - transfer_done_cb() - DEBUG - submitting tmp.parquet to done_postprocess
2023-03-22 13:46:31,229 - ThreadPoolExecutor-12_0 encryption_util.py:246 - decrypt_file() - DEBUG - encrypted file: /tmp/tmp.parquet.part, tmp file: /tmp/tmp1jt09xar/tmp.parquet.part#qnbiqtmipg
2023-03-22 13:46:31,231 - ThreadPoolExecutor-12_0 file_transfer_agent.py:574 - function_and_callback_wrapper() - ERROR - An exception was raised in <bound method SnowflakeGCSRestClient.finish_download of <snowflake.connector.gcs_storage_client.SnowflakeGCSRestClient object at 0x7fa63f6d95e0>>
Traceback (most recent call last):
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/file_transfer_agent.py", line 571, in function_and_callback_wrapper
    work(*args, **kwargs),
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/gcs_storage_client.py", line 220, in finish_download
    self.meta.src_file_size = os.path.getsize(self.intermediate_dst_path)
  File "/usr/local/lib/python3.8/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,232 - ThreadPoolExecutor-12_0 file_transfer_agent.py:542 - postprocess_done_cb() - DEBUG - File tmp.parquet reached postprocess callback
2023-03-22 13:46:31,232 - ThreadPoolExecutor-12_0 file_transfer_agent.py:547 - postprocess_done_cb() - DEBUG - File tmp.parquet failed to transfer for unexpected exception [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,233 - ThreadPoolExecutor-10_1 connectionpool.py:456 - _make_request() - DEBUG - https://gcpuscentral1-qfr7mrx-stage.storage.googleapis.com:443 "GET /stages/a24fba3f-14a6-44bf-98f1-6cef989b4c13/data/1/tmp.parquet?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcpuscentral1-qfr7mrx-stage%40us-central1-stage1-6e1d.iam.gserviceaccount.com%2F20230322%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230322T204630Z&X-Goog-Expires=3600&X-Goog-SignedHeaders=host&X-Goog-Signature=3027b086b04da44ba4ef4350d12d6ba54741a5635b74d04933f6063ee05057ab77bd77dd8e586b1c98fef480e5023626d34ed3a025b8ac4bdbd81171baea7e35f50a9262418d68f98ae0a6fda15a80b88edd589a080f6a388c5e9b3fed71aaacb905e795ca66c60bacc0c9f95fabce4f8d13ca879d15774dd58e7b3826dc025c68fea61068a7fadc381bf1d5a88633e17a662a8db06c2c825c1c64fab0cb61396ce143f8bcbe25637b3ca6eb02fb9c086925ce24986f0e2d7aab07a85e757eb317b26c3760d1c0a089d2e723cd4baa955655b1dfacb0cdd0ffe0e6256ac81c70c3194a780f0bd2c8584e59df1d2fe4bf7cb82a1ea9d89a731bc71f9cfbf7e82a HTTP/1.1" 200 144672
2023-03-22 13:46:31,234 - ThreadPoolExecutor-10_1 network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'gcpuscentral1-qfr7mrx-stage.storage.googleapis.com', SessionPool 0/2 active sessions
2023-03-22 13:46:31,440 - ThreadPoolExecutor-10_1 file_transfer_agent.py:574 - function_and_callback_wrapper() - ERROR - An exception was raised in <bound method SnowflakeGCSRestClient.download_chunk of <snowflake.connector.gcs_storage_client.SnowflakeGCSRestClient object at 0x7fa63f6d93d0>>
Traceback (most recent call last):
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/file_transfer_agent.py", line 571, in function_and_callback_wrapper
    work(*args, **kwargs),
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/gcs_storage_client.py", line 193, in download_chunk
    self.write_downloaded_chunk(chunk_id, response.content)
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/storage_client.py", line 343, in write_downloaded_chunk
    with self.intermediate_dst_path.open("rb+") as fd:
  File "/usr/local/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/local/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,441 - ThreadPoolExecutor-10_1 file_transfer_agent.py:496 - transfer_done_cb() - DEBUG - Chunk 0/1 of file tmp.parquet reached callback
2023-03-22 13:46:31,441 - ThreadPoolExecutor-10_1 file_transfer_agent.py:507 - transfer_done_cb() - DEBUG - Chunk 0 of file tmp.parquet failed to transfer for unexpected exception [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,441 - ThreadPoolExecutor-10_1 file_transfer_agent.py:512 - transfer_done_cb() - DEBUG - Chunk progress: tmp.parquet: completed: 0 failed: 1 total: 1
2023-03-22 13:46:31,441 - ThreadPoolExecutor-10_1 file_transfer_agent.py:532 - transfer_done_cb() - DEBUG - submitting tmp.parquet to done_postprocess
2023-03-22 13:46:31,441 - ThreadPoolExecutor-12_0 storage_client.py:382 - finish_download() - ERROR - Failed to download a file: /tmp/tmp.parquet
NoneType: None
2023-03-22 13:46:31,441 - ThreadPoolExecutor-12_0 file_transfer_agent.py:574 - function_and_callback_wrapper() - ERROR - An exception was raised in <bound method SnowflakeGCSRestClient.finish_download of <snowflake.connector.gcs_storage_client.SnowflakeGCSRestClient object at 0x7fa63f6d93d0>>
Traceback (most recent call last):
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/file_transfer_agent.py", line 571, in function_and_callback_wrapper
    work(*args, **kwargs),
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/gcs_storage_client.py", line 220, in finish_download
    self.meta.src_file_size = os.path.getsize(self.intermediate_dst_path)
  File "/usr/local/lib/python3.8/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,442 - ThreadPoolExecutor-12_0 file_transfer_agent.py:542 - postprocess_done_cb() - DEBUG - File tmp.parquet reached postprocess callback
2023-03-22 13:46:31,442 - ThreadPoolExecutor-12_0 file_transfer_agent.py:547 - postprocess_done_cb() - DEBUG - File tmp.parquet failed to transfer for unexpected exception [Errno 2] No such file or directory: '/tmp/tmp.parquet.part'
2023-03-22 13:46:31,444 - MainThread connection.py:586 - close() - INFO - closed
2023-03-22 13:46:31,444 - MainThread telemetry.py:211 - close() - DEBUG - Closing telemetry client.
2023-03-22 13:46:31,446 - MainThread telemetry.py:176 - send_batch() - DEBUG - Sending 1 logs to telemetry. Data is {'logs': [{'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_imported_packages', 'value': "{'types', 'subprocess', 'signal', 'collections', 'pickle', 'warnings', 'errno', 'struct', 'datetime', 'tokenize', 'string', 'ntpath', 'pycparser', 'ast', 'urllib', 'xml', 'select', 'pwd', 'gzip', 'oscrypto', 'reprlib', 'Cryptodome', 'textwrap', 'fnmatch', 'hmac', 'cffi', 'grp', 'OpenSSL', 'snowflake', 'locale', 'zipfile', 'bisect', 'glob', 'difflib', 'hashlib', 'importlib', 'typing_extensions', 'ipaddress', 'bz2', 'fcntl', 're', 'html', 'keyword', 'contextlib', 'dataclasses', 'sre_constants', 'json', 'ssl', 'certifi', 'copyreg', 'numbers', 'binascii', 'marshal', 'pytz', 'base64', 'weakref', 'traceback', 'fractions', 'threading', 'copy', 'cython_runtime', 'uu', 'jwt', 'time', 'inspect', 'sre_compile', 'shlex', 'os', 'tempfile', 'posix', 'unicodedata', 'quopri', 'lzma', 'atexit', 'decimal', 'zipimport', 'math', 'sys', 'dis', 'mimetypes', 'ctypes', 'stat', 'charset_normalizer', 'io', 'linecache', 'asn1crypto', 'heapq', 'operator', 'configparser', 'uuid', 'codecs', 'selectors', 'requests', 'filelock', 'genericpath', 'platform', 'idna', 'stringprep', 'builtins', 'cryptography', 'queue', 'csv', 'http', 'itertools', 'sre_parse', 'sysconfig', 'posixpath', 'urllib3', 'encodings', 'site', 'concurrent', 'random', 'socket', 'abc', 'zlib', 'typing', 'opcode', 'token', 'enum', 'functools', 'email', 'packaging', 'logging', 'pathlib', 'pyexpat', 'webbrowser', 'shutil', 'calendar'}"}, 'timestamp': '1679517988701'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_first_result', 'query_id': '01ab1f9e-0001-0481-0001-777a05ab414e', 'value': 52}, 'timestamp': '1679517988932'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_last_result', 'query_id': '01ab1f9e-0001-0481-0001-777a05ab414e', 'value': 1}, 'timestamp': '1679517988933'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_last_result', 'query_id': '01ab1f9e-0001-0494-0001-777a05ab182e', 'value': 895}, 'timestamp': '1679517989992'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_last_result', 'query_id': '01ab1f9e-0001-0477-0001-777a05ab0cda', 'value': 413}, 'timestamp': '1679517990527'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_first_result', 'query_id': '01ab1f9e-0001-0477-0001-777a05ab0cde', 'value': 49}, 'timestamp': '1679517990701'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'type': 'client_time_consume_last_result', 'query_id': '01ab1f9e-0001-0477-0001-777a05ab0cde', 'value': 1}, 'timestamp': '1679517990702'}, {'message': {'driver_type': 'PythonConnector', 'driver_version': '3.0.1', 'source': 'PythonConnector', 'Stacktrace': (False, '  File "snowflake/connector/cursor.py", line 806, in execute\n  File "snowflake/connector/file_transfer_agent.py", line 764, in result\n  File "snowflake/connector/errors.py", line 290, in errorhandler_wrapper\n  File "snowflake/connector/errors.py", line 345, in hand_to_other_handler\n  File "snowflake/connector/errors.py", line 221, in default_errorhandler\n', None), 'sql_state': 'n/a', 'reason': '253002', 'ErrorNumber': '253002', 'type': 'client_sql_exception', 'exception': 'OperationalError'}, 'timestamp': '1679517991443'}]}.
2023-03-22 13:46:31,446 - MainThread network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 1/1 active sessions
2023-03-22 13:46:31,446 - MainThread network.py:846 - _request_exec_wrapper() - DEBUG - remaining request timeout: 5, retry cnt: 1
2023-03-22 13:46:31,446 - MainThread network.py:827 - add_request_guid() - DEBUG - Request guid: 7dd4639b-4bfd-44e1-b56e-37615bf708ee
2023-03-22 13:46:31,446 - MainThread network.py:1025 - _request_exec() - DEBUG - socket timeout: 60
2023-03-22 13:46:31,551 - MainThread connectionpool.py:456 - _make_request() - DEBUG - https://jbulliw-main.snowflakecomputing.com:443 "POST /telemetry/send?request_guid=7dd4639b-4bfd-44e1-b56e-37615bf708ee HTTP/1.1" 200 86
2023-03-22 13:46:31,551 - MainThread network.py:1051 - _request_exec() - DEBUG - SUCCESS
2023-03-22 13:46:31,552 - MainThread network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 0/1 active sessions
2023-03-22 13:46:31,552 - MainThread network.py:726 - _post_request() - DEBUG - ret[code] = None, after post request
2023-03-22 13:46:31,552 - MainThread telemetry.py:200 - send_batch() - DEBUG - Successfully uploading metrics to telemetry.
2023-03-22 13:46:31,552 - MainThread connection.py:589 - close() - INFO - No async queries seem to be running, deleting session
2023-03-22 13:46:31,552 - MainThread network.py:1166 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 1/1 active sessions
2023-03-22 13:46:31,552 - MainThread network.py:846 - _request_exec_wrapper() - DEBUG - remaining request timeout: 5, retry cnt: 1
2023-03-22 13:46:31,552 - MainThread network.py:827 - add_request_guid() - DEBUG - Request guid: 5d6472d2-e22b-4039-9cff-80e90a19db67
2023-03-22 13:46:31,552 - MainThread network.py:1025 - _request_exec() - DEBUG - socket timeout: 60
2023-03-22 13:46:31,682 - MainThread connectionpool.py:456 - _make_request() - DEBUG - https://jbulliw-main.snowflakecomputing.com:443 "POST /session?delete=true&request_guid=5d6472d2-e22b-4039-9cff-80e90a19db67 HTTP/1.1" 200 76
2023-03-22 13:46:31,682 - MainThread network.py:1051 - _request_exec() - DEBUG - SUCCESS
2023-03-22 13:46:31,682 - MainThread network.py:1171 - _use_requests_session() - DEBUG - Session status for SessionPool 'jbulliw-main.snowflakecomputing.com', SessionPool 0/1 active sessions
2023-03-22 13:46:31,682 - MainThread network.py:726 - _post_request() - DEBUG - ret[code] = None, after post request
2023-03-22 13:46:31,685 - MainThread connection.py:600 - close() - DEBUG - Session is closed
Traceback (most recent call last):
  File "snowflake_test.py", line 59, in <module>
    ret = cursor.execute(
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 806, in execute
    data = sf_file_transfer_agent.result()
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/file_transfer_agent.py", line 764, in result
    Error.errorhandler_wrapper(
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/errors.py", line 290, in errorhandler_wrapper
    handed_over = Error.hand_to_other_handler(
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/errors.py", line 345, in hand_to_other_handler
    cursor.errorhandler(connection, cursor, error_class, error_value)
  File "/home/nmclean/.cache/pypoetry/virtualenvs/snowflake-test-wdXDAp35-py3.8/lib/python3.8/site-packages/snowflake/connector/errors.py", line 221, in default_errorhandler
    raise error_class(
snowflake.connector.errors.OperationalError: 253002: 253002: While getting file(s) there was an error: 'FileNotFoundError(2, 'No such file or directory')', this might be caused by your access to the blob storage provider, or by Snowflake.
2023-03-22 13:46:31,711 - MainThread connection.py:577 - close() - DEBUG - Rest object has been destroyed, cannot close session
2023-03-22 13:46:31,711 - MainThread cache.py:511 - _load() - DEBUG - Fail to read cache from disk due to error: name 'open' is not defined
2023-03-22 13:46:31,711 - MainThread cache.py:549 - _save() - DEBUG - Fail to write cache to disk due to error: name 'open' is not defined
2023-03-22 13:46:31,712 - MainThread storage_client.py:438 - delete_client_data() - DEBUG - cleaning up tmp dir: /tmp/tmprh8fv6y7
2023-03-22 13:46:31,712 - MainThread storage_client.py:438 - delete_client_data() - DEBUG - cleaning up tmp dir: /tmp/tmp1jt09xar
@github-actions github-actions bot changed the title Unable to GET multiple parquet files from internal stage to local directory SNOW-766772: Unable to GET multiple parquet files from internal stage to local directory Mar 22, 2023
@ndamclean
Copy link
Author

I made a SF support request related to this.

A SF engineer explained that this issue is only for GCP-backed Snowflake deployments and does not occur when AWS S3 is used for internal stage storage.

@sfc-gh-achandrasekaran
Copy link
Contributor

@ndamclean can you link the support case here? Is there any more help you need from eng here?

@ndamclean
Copy link
Author

@sfc-gh-achandrasekaran
My support case ID is 00499944

Here's a link to the support case. I'm not sure if this is accessible to anyone with a snowflake community account or if it's only valid for my account (since it's a case I submitted).
https://community.snowflake.com/s/case/500Do000004GMHAIA4/unable-to-copy-partitioned-data-to-local-file-system-in-snowpark

I have been talking to a Snowflake customer service representative and they are working with their engineering team to resolve the issue. I'm not sure who's looking at it.

@sfc-gh-achandrasekaran
Copy link
Contributor

Thanks!

@sfc-gh-achandrasekaran
Copy link
Contributor

@ndamclean looks like a convo is ongoing with support. I will close this ticket so we have one issue to track. Please feel free to reopen if you need the snowpark eng team to dig in further.

@patrickhowerter
Copy link

@ndamclean @sfc-gh-achandrasekaran are either one of you able to look up what the fix was for this? We have a client that is getting this same error but is using their Azure Snowflake account from an AWS EC2 instance. I would like to know if this is some setting they can change or if it was a bug that snowflake needed to fix.

@ndamclean
Copy link
Author

@patrickhowerter my opinion was that it was a bug that Snowflake should fix. The support staff from Snowflake considered a known limitation that may be addressed in the future.

Here's my explanation of the problem to the support team:

Snowflake allows staging paths that contain "/" characters, which I was using in the paths in my internal stage to separate files generated by different dataframes.

When I try to get these files with snowpark_session.file.get the code crashes with an error.

I found this error also occurs with the GET command.
The GET command supports retrieving multiple files that match a (regular expression) PATTERN argument.
https://docs.snowflake.com/en/sql-reference/sql/get

It does not state anywhere in the Snowflake documentation that GET does not support multiple files if the file paths include "/" characters.

When I try to GET multiple files from a stage and the paths include "/" characters, the Snowflake python connector breaks.

If this is a known limitation, it should be stated clearly in the documentation.

Also, the code should raise an error message indicating the problem clearly to the user. As-is, the code crashes inelegantly with a cryptic FileNotFound error.

I personally feel that your team should consider this a bug. It seems like unexpected behaviour to be able to write multiple files to stage with paths containing "/" characters and not having the ability to retrieve those files with a GET command.

It would probably be a simple fix to allow the Snowflake python connector to create directories on my local machine at the target location in order to support downloading staged files with "/" characters to my local file system in a directory structure.

In the end, I had to write my own code to download to the multiple files, which (as far as I can tell according to the current documentation) should be supported behaviour.

I hope this helps!

@patrickhowerter
Copy link

Thanks for the quick reply @ndamclean! It appears that our issue my be different the the one you experienced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants