DataAssimBench

This work follows the effort initiated by Rasp et al. in the WeatherBench https://github.com/pangeo-data/WeatherBench. Here, we create training sets and a process required to develop data assimilation methods and transition them from conception to full scale Earth system models.

The field of data assimilation (DA) studies the integration of theory with observations. Models alone cannot make predictions. Data assimilation originated out of the need for operational weather forecast models to ingest observations in realtime so that computer models of the atmosphere could be initialized from a "best guess" state of the current conditions.

Today, applied DA has matured in operational weather forecasting to include the entire online cycled process of continually ingesting numerous disparate observational data sources and integrating them with numerical prediction models to make regular forecasts, while estimating errors and uncertainties in this process and accounting for them along the way. The process can also include correcting inaccuracies in the model formulations or applying post-processing to forecasts to improve agreement with observations.

Most of the software here was adapted from code developed for the following publications. Please cite these works when using this software package:

Solvik, K., Penny, S. G., & Hoyer, S. (2025). 4D-Var using hessian approximation and backpropagation applied to automatically differentiable numerical and machine learning models. Journal of Advances in Modeling Earth Systems, 17, e2024MS004608. https://doi.org/10.1029/2024MS004608
Penny, S. G., Smith, T. A., Chen, T.-C., Platt, J. A., Lin, H.-Y., Goodliff, M., & Abarbanel, H. D. I. (2022). Integrating recurrent neural networks with data assimilation for scalable data-driven state estimation. Journal of Advances in Modeling Earth Systems, 14, e2021MS002843. https://doi.org/10.1029/2021MS002843
Smith, T. A., Penny, S. G., Platt, J. A., & Chen, T.-C. (2023). Temporal subsampling diminishes small spatial scales in recurrent neural network emulators of geophysical turbulence. Journal of Advances in Modeling Earth Systems, 15, e2023MS003792. https://doi.org/10.1029/2023MS003792
Platt, J.A., S.G. Penny, T.A. Smith, T.-C. Chen, H.D.I. Abarbanel, (2022). A systematic exploration of reservoir computing for forecasting complex spatiotemporal dynamics, Neural Networks,Volume 153, 530-552, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2022.06.025.

Installation

We recommend setting up a virtual environment using either conda or virtualenv.

Clone Repo:

git clone git@github.com:StevePny/DataAssimBench.git
cd ./DataAssimBench

Setup virtual environment

In your cloned project directory, create a virtual environment
```
python3 -m venv ./.venv
```
Activate the virtual environment
```
source ./.venv/bin/activate
```

Install dabench

pip install -e ".[full]"

This will create a full installation including the ability to access cloud data or interface with other packages such as qgs. Alternatively, for a minimal installation, run:

pip install -e .

Quick Start

For more detailed examples, go to the DataAssimBench-Examples repo.

Importing data generators

import dabench as dab
help(dab.data) # View data classes, e.g. data.Lorenz96
help(dab.data.Lorenz96) # Get more info about Lorenz96 class

Generating data

All of the data objects are set up with reasonable defaults. Generating data is as easy as:

l96_obj = dab.data.Lorenz96() # Create data generator object
ds_l96 = l96_obj.generate(n_steps=1000) # Generate Lorenz96 simulation data as Xarray Dataset
ds_l96.dab.flatten().values # View output values flattened along time dimension

This example is for a Lorenz96 model, but all of the data objects work in a similar way.

Sampling observations

Now that we have a generated dataset, we can easily generate noisy observations from it like so:

obs = dab.observer.Observer(
        ds_l96, # Our generated Dataset object
        random_time_density=0.4, # Randomly sampling at ~40% of times
        random_location_density=0.3, # Randomly sample ~30% of variables
        # random_location_count = 10, # Alternatively, can specify number of locations to sample
        error_sd=1.2 # Add Gaussian Noise with SD = 1.2
)
obs_vec = obs.observe() # Run observe() method to generate observations
obs_vec

The Observer class is very flexible, allowing users to provide specific times and locations or randomly generate them. You can also choose to use "stationary" or "nonstationary" observers, indicating whether to sample the same locations at each observation time step or to sample different ones (default is "stationary").

Customizing generation options

All data objects are customizable.

For data-generators (e.g. numerical models such as Lorenz63, Lorenz96, SQGTurb), this means you can change initial conditions, model parameters, timestep size, number of timesteps, etc.

For data-downloaders (e.g. ENSOIDX, GCP), this means changing which variables you download, the lat/lon bounding box, the time period, etc.

The recommended way of specifying options is to pass a keyword argument (kwargs) dictionary. The exact options vary between the different types of data objects, so be sure to check the specific documentation for your chosen generator/downloader more info.

For example, for the Lorenz96 data-generator we can change the forcing term, system_dim, and integration timestep delta_t like this:

l96_options = {'forcing_term': 7.5,
               'system_dim': 5,
               'delta_t': 0.05}
l96_obj = dab.data.Lorenz96(**l96_options) # Create data generator object
ds_l96 = l96_obj.generate(n_steps=1000) # Generate Lorenz96 simulation data
ds_l96 # View the output values

For example, for the Google Cloud (GCP) ERA5 data-downloader, we can select our variables and time period like this:

gcp_options = {'variables': ['2m_temperature', 'sea_surface_temperature'],
               'date_start': '2020-06-01'
               'date_end': '2020-06-07'}
gcp_obj = dab.data.GCP(**gcp_options) # Create data generator object
ds_gcp = gcp_obj.load() # Loads data. Can also use gcp_obj.generate()
ds_gcp # View the output values

Reference docs

For more detail, see our documentation

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
dabench		dabench
docs		docs
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataAssimBench

Installation

Clone Repo:

Setup virtual environment

Install dabench

Quick Start

Importing data generators

Generating data

Sampling observations

Customizing generation options

Reference docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

StevePny/DataAssimBench

Folders and files

Latest commit

History

Repository files navigation

DataAssimBench

Installation

Clone Repo:

Setup virtual environment

Install dabench

Quick Start

Importing data generators

Generating data

Sampling observations

Customizing generation options

Reference docs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages