Skip to content

Asdf read speed #514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 90 commits into
base: main
Choose a base branch
from
Open

Asdf read speed #514

wants to merge 90 commits into from

Conversation

SolarDrew
Copy link
Contributor

Fixes #500

@Cadair
Copy link
Member

Cadair commented Feb 4, 2025

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code
import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.


I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

  • Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
  • We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
  • This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.

Copy link

codspeed-hq bot commented Apr 8, 2025

CodSpeed Performance Report

Merging #514 will improve performances by 13.82%

Comparing SolarDrew:asdf-read-speed (a5282c9) with main (731e7f0)

Summary

⚡ 2 improvements
✅ 10 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_tileddataset_repr[simple-masked] 1.9 ms 1.7 ms +12.14%
test_tileddataset_repr[simple-nomask] 2 ms 1.7 ms +13.82%

Copy link
Member

@Cadair Cadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closer than we've ever been.

@SolarDrew
Copy link
Contributor Author

pre-commit.ci autofix

@SolarDrew
Copy link
Contributor Author

If I remember rightly, the current notebooks failure is a known sunpy issue and not a problem with this PR, which means this is actually finally ready to go, pending final review I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Run downstream CI Run's the downstream CI workflow on a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading a DL-NIRSP ASDF is very slow
2 participants