Download CDF #115

Beforerr · 2024-02-04T22:22:09Z

Beforerr
Feb 4, 2024

Is it possible to just download CDF files from CDASWEB without loading them?

Feb 5, 2024

@Beforerr, another thing you might consider is using the archive module it might be faster than your actual pipeline. The main downside is that you have to write some YAML files to describe the archive you want to access. You have an example shipped with Speasy. It will both caches CDF files and SpeasyVariables we use it for MMS data and it is way faster than using regular access methods, it can even be faster than using PyCDFPP if your CDF are compressed (to be verified).
Do not hesitate to ask if you need help to elaborate one of those YAML files.

View full answer

jeandet · 2024-02-05T14:49:27Z

jeandet
Feb 5, 2024
Maintainer

@Beforerr not easily, the main issue is that the Speasy/CDAWEB module doesn't directly access cdf files from the archive, it instead uses the CDAWEB webservice to generate cdf files on the fly with only the requested variable on the requested time range. Plus those CDF files are never saved on disk, they are read and converted to Speasy variables immediately.

Depending on what you want to achieve, there could be other solutions.

If you only want to download a bunch of files from CDAWEB, maybe you can use the list_files function to list all available files at a given URL then downloading them with requests module.

Something like this:

from speasy.core.any_files import list_files
import requests
import os
import tqdm

remote_dir ="https://cdaweb.gsfc.nasa.gov/pub/data/mms/mms1/fgm/srvy/l2/2016/06/"
remote_files = list_files(remote_dir, file_regex=".*\.cdf")
destdir = "data"
os.makedirs(destdir, exist_ok=True)
for file_name in tqdm.tqdm(remote_files):
    with open(f"{destdir}/{file_name}", 'wb') as f:
        f.write(requests.get(f"{remote_dir}{file_name}").content)

If you just want to save SpeasyVariables on your disk then you can simply use the pickle module like this:

import speasy as spz
import pickle
mms1_fgm_b_bcs_srvy_l2 = spz.get_data(spz.inventories.data_tree.cda.MMS.MMS1.FGM.MMS1_FGM_SRVY_L2.mms1_fgm_b_bcs_srvy_l2, "2018-01-01", "2018-01-02")
fname = f"{mms1_fgm_b_bcs_srvy_l2.name}-{mms1_fgm_b_bcs_srvy_l2.time[0]}-{mms1_fgm_b_bcs_srvy_l2.time[-1]}.pkl"
with open(fname, "wb") as f:
    f.write(pickle.dumps(mms1_fgm_b_bcs_srvy_l2))


mms1_fgm_b_bcs_srvy_l2_loaded = pickle.load(open(fname, "rb"))

mms1_fgm_b_bcs_srvy_l2 == mms1_fgm_b_bcs_srvy_l2_loaded

Do not hesitate if you have more questions or if your use-case if different.

10 replies

Beforerr Feb 5, 2024
Author

Hi @jeandet, thanks for your quick response and detailed explanation regarding Speasy/CDAWEB.

My specific use case involves conducting extensive statistical analysis across multiple years of data. I attempted to utilize get_data, but it appears to be ineffective (which is not unexpected given that the cdf files would amount to 100 GB). At present, my strategy involves downloading cdfs using pyspedas and then loading them lazily using pycdfpp and polars.

I don't know the list_files function previously, but it seems very useful!

jeandet Feb 5, 2024
Maintainer

@Beforerr, another thing you might consider is using the archive module it might be faster than your actual pipeline. The main downside is that you have to write some YAML files to describe the archive you want to access. You have an example shipped with Speasy. It will both caches CDF files and SpeasyVariables we use it for MMS data and it is way faster than using regular access methods, it can even be faster than using PyCDFPP if your CDF are compressed (to be verified).
Do not hesitate to ask if you need help to elaborate one of those YAML files.

Answer selected by Beforerr

Beforerr Feb 8, 2024
Author

@jeandet , this may indeed be the function I need! But would you please provide a working example? I tried for local files but it does not seem to work.

Configuration

WI_H4_RTN_MFI:
  inventory_path: local/WI_H4_RTN_MFI
  master_cdf: https://cdaweb.gsfc.nasa.gov/pub/software/cdawlib/0MASTERS/wi_h4-rtn_mfi_00000000_v01.cdf
  split_frequency: daily
  split_rule: regular
  url_pattern: /Users/beforerr/data/wind/mfi/mfi_h4-rtn/{Y}/wi_h4-rtn_mfi_{Y}{M:02d}{D:02d}_v\d+.cdf
  use_file_list: true

Then I tried

spz.get_data(
    spz.inventories.data_tree.archive.local.WI_H4_RTN_MFI.BF1,
    time_range=['2016-01-01', '2016-01-03'],
)

But received

AttributeError: 'SpeasyIndex' object has no attribute 'BF1'

jeandet Feb 8, 2024
Maintainer

@Beforerr I didn't find how to get wi_h4-rtn_mfi with PySPEDAS but tried with the following code and wi_h0_mfi, while doing so I noticed that you had WI_H4_RTN_MFI both in the path and as dataset name so the Speasy inventory path should be spz.inventories.data_tree.archive.local.WI_H4_RTN_MFI.WI_H4_RTN_MFI.BF1 instead. I also noticed that you passed time_range=['2016-01-01', '2016-01-03'] to spz.get_data, I guess this is a copy past mistake because this should not work 😅.

%matplotlib widget
import speasy as spz
import yaml
from speasy.webservices.generic_archive import user_inventory_dir
import pyspedas
from datetime import datetime

pyspedas.wind.mfi(trange=[datetime(2017,12,1),datetime(2018,1,1)], downloadonly=True, notplot=True, varnames=['BF1'])

inventory = {
    "wi_h0_mfi": {
        "inventory_path": "local/wi_h0_mfi",
        "master_cdf": "https://cdaweb.gsfc.nasa.gov/pub/software/cdawlib/0MASTERS/wi_h0_mfi_00000000_v01.cdf",
        "split_frequency": "daily",
        "split_rule": "regular",
        "url_pattern": "/home/jeandet/data/wind_data/mfi/mfi_h0/{Y}/wi_h0_mfi_{Y}{M:02d}{D:02d}_v\d+.cdf",
        "use_file_list": True,
    }
}

with open(f"{user_inventory_dir()}/wi_h0_mfi.yaml", "w") as inv_f:
    yaml.dump(inventory, inv_f)


spz.update_inventories()
spz.get_data(spz.inventories.data_tree.archive.local.wi_h0_mfi.wi_h0_mfi.B1F1, "2017-12-2", "2017-12-5").plot()

Beforerr Feb 8, 2024
Author

Thank you. It worked fantastically !!!

Beforerr Feb 15, 2024
Author

Why sometimes loading from archive return no data while loading from web service could? For example, I clone THB_L2_FGM dataset from SPDF to local computer and add an entry in the configuration. But loading thb_fgl_gse variable locally version does not work

Non compliant ISTP file: No data variable found, this is suspicious

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download CDF #115

{{title}}

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Download CDF #115

Beforerr Feb 4, 2024

Replies: 1 comment · 10 replies

jeandet Feb 5, 2024 Maintainer

Beforerr Feb 5, 2024 Author

jeandet Feb 5, 2024 Maintainer

Beforerr Feb 8, 2024 Author

jeandet Feb 8, 2024 Maintainer

Beforerr Feb 8, 2024 Author

Beforerr Feb 15, 2024 Author

Beforerr
Feb 4, 2024

Replies: 1 comment 10 replies

jeandet
Feb 5, 2024
Maintainer

Beforerr Feb 5, 2024
Author

jeandet Feb 5, 2024
Maintainer

Beforerr Feb 8, 2024
Author

jeandet Feb 8, 2024
Maintainer

Beforerr Feb 8, 2024
Author

Beforerr Feb 15, 2024
Author