-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run out of memory when forecast much longer steps #14
Comments
Yes, we are aware of that problem and are looking into it. |
Try using these commands before running graphcast. Explanation is at https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html
|
I have tried your solution, but it failed with the following message: %pad.151 = bf16[3114720,8]{1,0} pad(bf16[3114720,4]{1,0} %constant.369, bf16[] %constant.771), padding=0_0x0_4, metadata={op_name="jit()/jit(main)/while/body/remat/mesh2grid_gnn/_embed/mesh2grid_gnn/sequential/encoder_edges_mesh2grid_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/home/dyf/anaconda3/envs/graphcast/bin/ai-models" source_line=8} This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time. If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results. %pad.151 = bf16[3114720,8]{1,0} pad(bf16[3114720,4]{1,0} %constant.369, bf16[] %constant.771), padding=0_0x0_4, metadata={op_name="jit()/jit(main)/while/body/remat/mesh2grid_gnn/_embed/mesh2grid_gnn/sequential/encoder_edges_mesh2grid_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/home/dyf/anaconda3/envs/graphcast/bin/ai-models" source_line=8} This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time. If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-10-26 10:50:47,794 INFO Doing full rollout prediction in JAX: 35 seconds. The above exception was the direct cause of the following exception: Traceback (most recent call last):
I0000 00:00:1698288647.995927 7562 tfrt_cpu_pjrt_client.cc:352] TfrtCpuClient destroyed. |
I'm using a Tesla V100-SXM2-32GB. |
|
Good day all, |
Initially, I used a 3090 with 24GB, but the run failed immediately. It was not until I rented an A100 with 80GB that I found everything could operate normally. Following the example command ai-models --input cds --date 20230110 --time 0000 graphcast, I successfully obtained a file named graphcast.grib. Additionally, during the run, I observed that both memory and video memory usage were around 60GB (the Jax library directly utilizes about 3/4 of the video memory). (ai) root@747c17acba6c:/# /opt/conda/envs/ai/bin/ai-models --input cds --date 20230110 --time 0000 graphcast
2023-11-28 06:50:35,205 INFO Writing results to graphcast.grib.
/opt/conda/envs/ai/lib/python3.10/site-packages/ecmwflibs/__init__.py:81: UserWarning: /lib/x86_64-linux-gnu/libgobject-2.0.so.0: undefined symbol: ffi_type_uint32, version LIBFFI_BASE_7.0
warnings.warn(str(e))
2023-11-28 06:50:35,531 INFO Model description:
GraphCast model at 0.25deg resolution, with 13 pressure levels. This model is
trained on ERA5 data from 1979 to 2017, and fine-tuned on HRES-fc0 data from
2016 to 2021 and can be causally evaluated on 2022 and later years. This model
does not take `total_precipitation_6hr` as inputs and can make predictions in an
operational setting (i.e., initialised from HRES-fc0).
2023-11-28 06:50:35,531 INFO Model license:
The model weights are licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). You
may obtain a copy of the License at:
https://creativecommons.org/licenses/by-nc-sa/4.0/.
The weights were trained on ERA5 data, see README for attribution statement.
2023-11-28 06:50:35,531 INFO Loading params/GraphCast_operational - ERA5-HRES 1979-2021 - resolution 0.25 - pressure levels 13 - mesh 2to6 - precipitation output only.npz: 0.3 second.
2023-11-28 06:50:35,531 INFO Building model: 0.3 second.
2023-11-28 06:50:35,531 INFO Loading surface fields from CDS
2023-11-28 06:50:35,656 INFO Loading pressure fields from CDS
2023-11-28 06:50:48,418 INFO Creating forcing variables: 12 seconds.
2023-11-28 06:50:53,993 INFO Converting GRIB to xarray: 5 seconds.
2023-11-28 06:50:57,666 INFO Reindexing: 3 seconds.
2023-11-28 06:50:57,706 INFO Creating training data: 22 seconds.
2023-11-28 06:51:04,715 INFO Extracting input targets: 6 seconds.
2023-11-28 06:51:04,715 INFO Creating input data (total): 29 seconds.
2023-11-28 06:51:05,098 INFO Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
2023-11-28 06:51:05,102 INFO Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2023-11-28 06:52:10.480192: E external/xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 1s:
%pad.149 = bf16[3114720,8]{1,0} pad(bf16[3114720,4]{1,0} %constant.365, bf16[] %constant.768), padding=0_0x0_4, metadata={op_name="jit(<unnamed wrapped function>)/jit(main)/while/body/remat/mesh2grid_gnn/_embed/mesh2grid_gnn/sequential/encoder_edges_mesh2grid_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/opt/conda/envs/ai/bin/ai-models" source_line=8}
This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.
If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-11-28 06:52:18.177603: E external/xla/xla/service/slow_operation_alarm.cc:133] The operation took 8.697529096s
Constant folding an instruction is taking > 1s:
%pad.149 = bf16[3114720,8]{1,0} pad(bf16[3114720,4]{1,0} %constant.365, bf16[] %constant.768), padding=0_0x0_4, metadata={op_name="jit(<unnamed wrapped function>)/jit(main)/while/body/remat/mesh2grid_gnn/_embed/mesh2grid_gnn/sequential/encoder_edges_mesh2grid_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/opt/conda/envs/ai/bin/ai-models" source_line=8}
This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.
If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-11-28 06:52:20.556910: E external/xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 2s:
%pad.1 = bf16[1618752,8]{1,0} pad(bf16[1618745,4]{1,0} %constant.374, bf16[] %constant.687), padding=0_7x0_4, metadata={op_name="jit(<unnamed wrapped function>)/jit(main)/while/body/remat/grid2mesh_gnn/_embed/grid2mesh_gnn/sequential/encoder_edges_grid2mesh_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/opt/conda/envs/ai/bin/ai-models" source_line=8}
This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.
If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-11-28 06:52:22.395348: E external/xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.838535293s
Constant folding an instruction is taking > 2s:
%pad.1 = bf16[1618752,8]{1,0} pad(bf16[1618745,4]{1,0} %constant.374, bf16[] %constant.687), padding=0_7x0_4, metadata={op_name="jit(<unnamed wrapped function>)/jit(main)/while/body/remat/grid2mesh_gnn/_embed/grid2mesh_gnn/sequential/encoder_edges_grid2mesh_mlp/linear_0/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=bfloat16]" source_file="/opt/conda/envs/ai/bin/ai-models" source_line=8}
This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.
If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2023-11-28 06:52:37,380 INFO Doing full rollout prediction in JAX: 1 minute 32 seconds.
2023-11-28 06:52:37,380 INFO Converting output xarray to GRIB and saving
2023-11-28 06:54:53,203 INFO Saving output data: 2 minutes 15 seconds.
2023-11-28 06:54:53,276 INFO Total time: 4 minutes 19 seconds.
(ai) root@747c17acba6c:/# ls
graphcast.grib params sspaas-fs sspaas-tmp stats test.py tf-logs
(ai) root@747c17acba6c:/# du -lh graphcast.grib
6.5G graphcast.grib
|
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running FusedMatMul node. Name:'MatMul_With_Transpose_FusedMatMulAndScale' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 1851310080 it also happens when running pangu, but after a few repeated attempts., it works fine |
I have the same issue above, and I am using a Tesla M40-24GB GPU. Is there any solution? |
I have the same issue above, and I am using a NVIDIA 2080ti-11GB GPU. Is there any solution? Is it possible to solve this issue by |
Hello all, |
May I ask what is the minimum graphics card configuration and minimum memory requirement for running the model? |
From my experience, the minimum GPU memory requirement for running the model (like Pangu-wather) is 12 to 16 GiB, and for RAM when using CPU only for inference is at least 16 GiB. |
How about Graphcast? |
I'm still testing Graphcast that uses GFS analysis data as the input. It used at least 18 GiB of my GPU memory but OOM at last. |
May I ask what is the minimum graphics card configuration and minimum memory requirement for running the model?I am still bothered by this issue, Is that A100 necessary? |
I am trying to run the models with fewer steps and rerun the models with the output grib file as the new input file, but got errors, it seems the output grib file cannot be used as input file for those models. Any idea on why it is like this? |
May be your output file only contains the NWP data in 13 pressure levels, rather than 37 pressure levels, which depends on your params files of models.
…---Original---
From: ***@***.***>
Date: Thu, Apr 25, 2024 23:52 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [ecmwf-lab/ai-models] Run out of memory when forecast much longersteps (Issue #14)
I am trying to run the models with fewer steps and rerun the models with the output grib file as the new input file, but got errors, it seems the output grib file cannot be used as input file for those models. Any idea on why it is like this?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
I have written codes to solved this issue by myself, and you can check my forked repository. |
I found your forked repository of graphcast, but after I installed it and run the ai-models the inference process got kill at first step: (ai-models) [@localhost ~]$ ai-models --date 20240425 --time 1200 --lead-time 384 --path graphcast/2024-04-25z12:00:00.grib graphcast 2024-04-26 17:32:59,768 INFO Model license: 2024-04-26 17:32:59,768 INFO Loading params/GraphCast_operational - ERA5-HRES 1979-2021 - resolution 0.25 - pressure levels 13 - mesh 2to6 - precipitation output only.npz: 0.3 second. |
If you have GFS analysis data, you can run |
Change the code to run the GraphCast_small model solved this problem for me:
Those changes reduce the resolution to 1.0 and thus requires much less memory |
Can you provide the specific operation details? For example, how to download the data? |
For the Pangu model, manually create the InferenceSession and destroy it after each use inside the stepper solved my problem |
In model.py you can change the code lines 57 58 to:
... you can manually download the file from the download_url above. |
Once you make those change how do you actually run ai-models? |
Getting a segmentation fault 2024-08-14 21:24:09,142 INFO Loading params/GraphCast_small - ERA5 1979-2015 - resolution 1.0 - pressure levels 13 - mesh 2to5 - precipitation input and output.npz: 0.4 second. |
Yes, you need Mars access.
Best,
Han
…On Wed, Aug 14, 2024 at 3:46 PM Sean Wang ***@***.***> wrote:
So I've tried ai-models --assets ./graphcast_assets graphcast where
./graphcast_assets holds params/GraphCast_small - ERA5 1979-2015 -
resolution 1.0 - pressure levels 13 - mesh 2to5 - precipitation input and
output.npz but im running into the following: ecmwfapi.api.APIException:
"ecmwf.API error 1: User ' has no access to services/mars"
Do you need access to mars?
—
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHM32CGLCV67XWEMJQXUEVTZRO6Y3AVCNFSM6AAAAAA5Y3CJVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZHA2DMOBZGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For
ai-models-graphcast
, it works fine when I predict only a few time steps. However, it fails with an "out of memory" error when I try to predict over a longer lead time, such as 10 days. I have 188 GB of memory for CPU or 24 GB for GPU. Is there any solution to avoid this issue? It appears that the memory used by the model is not released after completing each step.Thanks for your reply in advance!
The text was updated successfully, but these errors were encountered: