Skip to content

Commit

Permalink
hf-transfer progress bar (#1792)
Browse files Browse the repository at this point in the history
* hf-transfer refactoring

* Still display download infos at the end

* Make style

* Less failures + mypy more happy ?

* Like that ?

* Why not from __future__ import annotations @Wauplin ?

* Let's just type: ignore

* Ruff --fix

* Post-rebase fix

* HF_TRANSFER threshold

* Try ascii=True

* Revert "Try ascii=True"

This reverts commit 166d3ab.

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lucain <lucainp@gmail.com>

* Outdated hf-transfer warning

* hf-transfer consistency check

* [docs] Update download.md

* [docs] Update environment_variables.md

* Update src/huggingface_hub/file_download.py

Co-authored-by: Lucain <lucainp@gmail.com>

* [docs] Update download.md

---------

Co-authored-by: Lucain <lucainp@gmail.com>
  • Loading branch information
cbensimon and Wauplin authored Nov 6, 2023
1 parent 4c93d89 commit d306c90
Show file tree
Hide file tree
Showing 4 changed files with 96 additions and 56 deletions.
8 changes: 7 additions & 1 deletion docs/source/en/guides/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,14 @@ For more details about the CLI download command, please refer to the [CLI guide]

If you are running on a machine with high bandwidth, you can increase your download speed with [`hf_transfer`](https://github.com/huggingface/hf_transfer), a Rust-based library developed to speed up file transfers with the Hub. To enable it, install the package (`pip install hf_transfer`) and set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.

<Tip>

Progress bars are supported when downloading with `hf_transfer` starting from version `0.1.4`. Consider upgrading (`pip install -U hf-transfer`) if you plan to enable faster downloads.

</Tip>

<Tip warning={true}>

`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like progress bars or advanced error handling. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).

</Tip>
2 changes: 1 addition & 1 deletion docs/source/en/package_reference/environment_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ Set to `True` for faster uploads and downloads from the Hub using `hf_transfer`.

By default, `huggingface_hub` uses the Python-based `requests.get` and `requests.post` functions. Although these are reliable and versatile, they may not be the most efficient choice for machines with high bandwidth. [`hf_transfer`](https://github.com/huggingface/hf_transfer) is a Rust-based package developed to maximize the bandwidth used by dividing large files into smaller parts and transferring them simultaneously using multiple threads. This approach can potentially double the transfer speed. To use `hf_transfer`, you need to install it separately [from PyPI](https://pypi.org/project/hf-transfer/) and set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.

Please note that using `hf_transfer` comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, `hf_transfer` lacks several user-friendly features such as progress bars, resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, `hf_transfer` is not enabled by default in `huggingface_hub`.
Please note that using `hf_transfer` comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, `hf_transfer` lacks several user-friendly features such as resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, `hf_transfer` is not enabled by default in `huggingface_hub`.

## Deprecated environment variables

Expand Down
2 changes: 2 additions & 0 deletions src/huggingface_hub/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ def _as_int(value: Optional[str]) -> Optional[int]:
DEFAULT_ETAG_TIMEOUT = 10
DEFAULT_DOWNLOAD_TIMEOUT = 10
DEFAULT_REQUEST_TIMEOUT = 10
DOWNLOAD_CHUNK_SIZE = 10 * 1024 * 1024
HF_TRANSFER_CONCURRENCY = 100

# Git-related constants

Expand Down
140 changes: 86 additions & 54 deletions src/huggingface_hub/file_download.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import copy
import fnmatch
import inspect
import io
import json
import os
Expand Down Expand Up @@ -27,12 +28,14 @@
DEFAULT_ETAG_TIMEOUT,
DEFAULT_REQUEST_TIMEOUT,
DEFAULT_REVISION,
DOWNLOAD_CHUNK_SIZE,
ENDPOINT,
HF_HUB_CACHE,
HF_HUB_DISABLE_SYMLINKS_WARNING,
HF_HUB_DOWNLOAD_TIMEOUT,
HF_HUB_ENABLE_HF_TRANSFER,
HF_HUB_ETAG_TIMEOUT,
HF_TRANSFER_CONCURRENCY,
HUGGINGFACE_CO_URL_TEMPLATE,
HUGGINGFACE_HEADER_X_LINKED_ETAG,
HUGGINGFACE_HEADER_X_LINKED_SIZE,
Expand Down Expand Up @@ -441,29 +444,21 @@ def http_get(
transient error (network outage?). We log a warning message and try to resume the download a few times before
giving up. The method gives up after 5 attempts if no new data has being received from the server.
"""
if not resume_size:
if HF_HUB_ENABLE_HF_TRANSFER:
hf_transfer = None
if HF_HUB_ENABLE_HF_TRANSFER:
if resume_size != 0:
warnings.warn("'hf_transfer' does not support `resume_size`: falling back to regular download method")
elif proxies is not None:
warnings.warn("'hf_transfer' does not support `proxies`: falling back to regular download method")
else:
try:
# Download file using an external Rust-based package. Download is faster
# (~2x speed-up) but support less features (no progress bars).
from hf_transfer import download

logger.debug(f"Download {url} using HF_TRANSFER.")
max_files = 100
chunk_size = 10 * 1024 * 1024 # 10 MB
download(url, temp_file.name, max_files, chunk_size, headers=headers)
return
import hf_transfer # type: ignore[no-redef]
except ImportError:
raise ValueError(
"Fast download using 'hf_transfer' is enabled"
" (HF_HUB_ENABLE_HF_TRANSFER=1) but 'hf_transfer' package is not"
" available in your environment. Try `pip install hf_transfer`."
)
except Exception as e:
raise RuntimeError(
"An error occurred while downloading using `hf_transfer`. Consider"
" disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling."
) from e

initial_headers = headers
headers = copy.deepcopy(headers) or {}
Expand Down Expand Up @@ -492,53 +487,90 @@ def http_get(
if len(displayed_name) > 40:
displayed_name = f"(…){displayed_name[-40:]}"

consistency_error_message = (
f"Consistency check failed: file should be of size {expected_size} but has size"
f" {{actual_size}} ({displayed_name}).\nWe are sorry for the inconvenience. Please retry download and"
" pass `force_download=True, resume_download=False` as argument.\nIf the issue persists, please let us"
" know by opening an issue on https://github.com/huggingface/huggingface_hub."
)

# Stream file to buffer
progress = tqdm(
with tqdm(
unit="B",
unit_scale=True,
total=total,
initial=resume_size,
desc=displayed_name,
disable=bool(logger.getEffectiveLevel() == logging.NOTSET),
)
try:
) as progress:
if hf_transfer and total is not None and total > 5 * DOWNLOAD_CHUNK_SIZE:
supports_callback = "callback" in inspect.signature(hf_transfer.download).parameters
if not supports_callback:
warnings.warn(
"You are using an outdated version of `hf_transfer`. "
"Consider upgrading to latest version to enable progress bars "
"using `pip install -U hf_transfer`."
)
try:
hf_transfer.download(
url=url,
filename=temp_file.name,
max_files=HF_TRANSFER_CONCURRENCY,
chunk_size=DOWNLOAD_CHUNK_SIZE,
headers=headers,
parallel_failures=3,
max_retries=5,
**({"callback": progress.update} if supports_callback else {}),
)
except Exception as e:
raise RuntimeError(
"An error occurred while downloading using `hf_transfer`. Consider"
" disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling."
) from e
if not supports_callback:
progress.update(total)
if expected_size is not None and expected_size != os.path.getsize(temp_file.name):
raise EnvironmentError(
consistency_error_message.format(
actual_size=os.path.getsize(temp_file.name),
)
)
return
new_resume_size = resume_size
for chunk in r.iter_content(chunk_size=10 * 1024 * 1024):
if chunk: # filter out keep-alive new chunks
progress.update(len(chunk))
temp_file.write(chunk)
new_resume_size += len(chunk)
# Some data has been downloaded from the server so we reset the number of retries.
_nb_retries = 5
except (requests.ConnectionError, requests.ReadTimeout) as e:
# If ConnectionError (SSLError) or ReadTimeout happen while streaming data from the server, it is most likely
# a transient error (network outage?). We log a warning message and try to resume the download a few times
# before giving up. Tre retry mechanism is basic but should be enough in most cases.
if _nb_retries <= 0:
logger.warning("Error while downloading from %s: %s\nMax retries exceeded.", url, str(e))
raise
logger.warning("Error while downloading from %s: %s\nTrying to resume download...", url, str(e))
time.sleep(1)
reset_sessions() # In case of SSLError it's best to reset the shared requests.Session objects
return http_get(
url=url,
temp_file=temp_file,
proxies=proxies,
resume_size=new_resume_size,
headers=initial_headers,
expected_size=expected_size,
_nb_retries=_nb_retries - 1,
)

if expected_size is not None and expected_size != temp_file.tell():
raise EnvironmentError(
f"Consistency check failed: file should be of size {expected_size} but has size"
f" {temp_file.tell()} ({displayed_name}).\nWe are sorry for the inconvenience. Please retry download and"
" pass `force_download=True, resume_download=False` as argument.\nIf the issue persists, please let us"
" know by opening an issue on https://github.com/huggingface/huggingface_hub."
)
try:
for chunk in r.iter_content(chunk_size=DOWNLOAD_CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
progress.update(len(chunk))
temp_file.write(chunk)
new_resume_size += len(chunk)
# Some data has been downloaded from the server so we reset the number of retries.
_nb_retries = 5
except (requests.ConnectionError, requests.ReadTimeout) as e:
# If ConnectionError (SSLError) or ReadTimeout happen while streaming data from the server, it is most likely
# a transient error (network outage?). We log a warning message and try to resume the download a few times
# before giving up. Tre retry mechanism is basic but should be enough in most cases.
if _nb_retries <= 0:
logger.warning("Error while downloading from %s: %s\nMax retries exceeded.", url, str(e))
raise
logger.warning("Error while downloading from %s: %s\nTrying to resume download...", url, str(e))
time.sleep(1)
reset_sessions() # In case of SSLError it's best to reset the shared requests.Session objects
return http_get(
url=url,
temp_file=temp_file,
proxies=proxies,
resume_size=new_resume_size,
headers=initial_headers,
expected_size=expected_size,
_nb_retries=_nb_retries - 1,
)

progress.close()
if expected_size is not None and expected_size != temp_file.tell():
raise EnvironmentError(
consistency_error_message.format(
actual_size=temp_file.tell(),
)
)


@validate_hf_hub_args
Expand Down

0 comments on commit d306c90

Please sign in to comment.