Asdf read speed #514

SolarDrew · 2025-01-30T14:50:18Z

Fixes #500

dkist/dataset/tiled_dataset.py

Cadair · 2025-02-04T09:46:30Z

I just did a quick experiment locally and if we convert the Table to a numpy recarray before we save it (Table.as_array()) then asdf will automatically only write the one binary block with the full data and will save slices in the tree as references

Full code

import dkist; from dkist.data.sample import VBI_AJQWW
tds = dkist.load_dataset(VBI_AJQWW)

whole_table = tds.combined_headers
import asdf

small1 = whole_table[0:10]
small2 = whole_table[10:20]

new_tree = {"whole": whole_table, "small1":small1, "small2": small2}
with asdf.AsdfFile(tree=new_tree) as af:
    af.write_to("test.asdf")

<duplicates the data>

whole_array = whole_table.as_array()
array_tree = {"whole": whole_array, "small1":whole_array[0:10], "small2": whole_array[10:20]}
with asdf.AsdfFile(tree=array_tree) as af:
    af.write_to("array.asdf")

<does not duplicate the data>

For a small example:

whole_table2 = whole_table[["INSTRUME", "DATE-AVG"]]
whole_array2 = whole_table2.as_array()
array_tree2 = {"whole": whole_array2, "small1":whole_array2[0:10], "small2": whole_array2[10:20]}
with asdf.AsdfFile(tree=array_tree2) as af:
    af.write_to("array2.asdf")

yields this asdf:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.5.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    manifest_software: !core/software-1.0.0 {name: asdf_standard, version: 1.1.1}
    software: !core/software-1.0.0 {name: asdf, version: 3.5.0}
small1: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
small2: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [10]
  offset: 1160
whole: !core/ndarray-1.0.0
  source: 0
  datatype:
  - byteorder: little
    datatype: [ucs4, 3]
    name: INSTRUME
  - byteorder: little
    datatype: [ucs4, 26]
    name: DATE-AVG
  byteorder: big
  shape: [27]
...
[BINARY BLOCK]
%YAML 1.1
---
- 1231
...

notice the source: 0 for all the arrays, and offset: 1160 for small2.

I think this might be a good idea.

My main worry is that it severely limits how rich we can make the metadata table, i.e. #265 becomes something custom we have to glue on the side rather than being able to use built-in features of astropy Table.

However I think this approach has many advantages:

Obvious performance improvements for single tables (probably), but especially for what you are doing on this PR.
We can still convert up to a table either in the converter or in Dataset itself, without copying the memory.
This is almost certainly more portable to other languages, as the ndarray tag and schema are in the core spec. It would be worth testing to see what happens in IDL.

dkist/io/asdf/resources/manifests/dkist-1.2.0.yaml

dkist/io/asdf/converters/tiled_dataset.py

codspeed-hq · 2025-04-08T10:34:01Z

CodSpeed Performance Report

Merging #514 will improve performances by 13.82%

_{Comparing SolarDrew:asdf-read-speed (a5282c9) with main (731e7f0)}

Summary

⚡ 2 improvements
✅ 10 untouched benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`test_tileddataset_repr[simple-masked]`	1.9 ms	1.7 ms	+12.14%
⚡	`test_tileddataset_repr[simple-nomask]`	2 ms	1.7 ms	+13.82%

Cadair

Closer than we've ever been.

dkist/dataset/tiled_dataset.py

dkist/io/asdf/converters/tiled_dataset.py

SolarDrew · 2025-05-12T09:53:41Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

SolarDrew · 2025-06-19T10:39:31Z

If I remember rightly, the current notebooks failure is a known sunpy issue and not a problem with this PR, which means this is actually finally ready to go, pending final review I guess.

SolarDrew added 8 commits January 28, 2025 10:32

Add mechanism for datasets to know if they're a tile

40af8bf

Stack headers and store canonically on TiledDataset

2d35c18

Don't save out headers on mosaic tile Datasets

ec5aaef

Minor test upgrade

afd6f70

Pass headers to TiledDataset in simple_tiled_dataset fixture

d2d29c0

Need to stack the headers

a4d1efe

Make TiledDataset converter read and write headers

28f19ab

Schema nonsense

3601236

Cadair reviewed Jan 30, 2025

View reviewed changes

dkist/dataset/tiled_dataset.py Outdated Show resolved Hide resolved

SolarDrew added 3 commits February 3, 2025 10:23

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

8bb16cc

Needed to point the manifest at the right schema

6343fa8

Replace changes to manifest with new file

2b2b125

SolarDrew added 2 commits February 4, 2025 11:21

More schema schenanigans

4e42f22

Save header table as rec array

374526d

Cadair reviewed Feb 4, 2025

View reviewed changes

dkist/io/asdf/resources/manifests/dkist-1.2.0.yaml Outdated Show resolved Hide resolved

dkist/io/asdf/converters/tiled_dataset.py Outdated Show resolved Hide resolved

SolarDrew added 3 commits February 5, 2025 16:23

Changelog

f3eef71

Save Dataset headers as rec arrays as well

f3e9095

Update some asdf tags

7c61d63

Cadair mentioned this pull request Apr 3, 2025

Add a how-to guide about manipulating a masked Table of headers #553

Open

SolarDrew added 5 commits April 7, 2025 11:58

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

93be016

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

2128324

Forgot to update entry_points in the merge

5c9323a

Fix header fetching in init

bb81acf

Spaces are important apparently

2f2242b

SolarDrew added 4 commits April 8, 2025 12:05

Update schema tag

7d6917d

Allow empty headers (not 100% sure we want to)

ac443a8

Fix header save/load

0c4903f

Update schema in converter

5b9d4bc

SolarDrew added 4 commits April 29, 2025 11:06

Rework header broadcasting test

d37f89b

Change comparison of headers so it works for the test

affc06c

Update schema test fixture file list

12f0094

Update it some more and add the asdf files

c3e9b89

Cadair requested changes Apr 30, 2025

View reviewed changes

dkist/dataset/tiled_dataset.py Outdated Show resolved Hide resolved

dkist/io/asdf/converters/tiled_dataset.py Show resolved Hide resolved

SolarDrew added 4 commits April 30, 2025 10:49

Fix header broadcast test so inventory comparison can go back how it was

08917e7

Don't change dataset meta in place when writing out becuse that's Bad

d65737e

Add test for header copying

9d2244d

Turns out it needed more copy

44ce952

Cadair reviewed Apr 30, 2025

View reviewed changes

dkist/io/asdf/converters/tiled_dataset.py Outdated Show resolved Hide resolved

SolarDrew added 5 commits May 6, 2025 11:38

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

fed52ff

Reduce dataset copying in converter

4164dff

Fixes

852dc64

Fix tests, including using ds with actual headers

fdd16e4

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

c9285f3

pre-commit-ci bot and others added 13 commits May 12, 2025 09:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

066449b

for more information, see https://pre-commit.ci

!*#$ you linting, don't change my variable names

9edbb14

Add test for tiled dataset header slicing

8b89f9b

Fixes maybe

d6ea1e9

Merge branch 'main' of github.com:DKISTDC/dkist into asdf-read-speed

6878e84

Fix tiled ds header slicing test

9650c69

Update test VBI asdf

6deb843

Needed to update test file name

e8ec283

Well obviously hard-coding dimensions into tests is a bad idea

a7c3ba3

Add test for header ordering when slicing TiledDatasets

2200460

Fix offset calculation

a1ebabd

No idea how this file is different but it has the benefit of working

c1247c3

Hopefully fix notebooks build

a5282c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Asdf read speed #514

Asdf read speed #514

Uh oh!

SolarDrew commented Jan 30, 2025

Uh oh!

Uh oh!

Cadair commented Feb 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 8, 2025 •

edited

Loading

Uh oh!

Cadair left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SolarDrew commented May 12, 2025

Uh oh!

SolarDrew commented Jun 19, 2025

Uh oh!

Uh oh!

Asdf read speed #514

Are you sure you want to change the base?

Asdf read speed #514

Uh oh!

Conversation

SolarDrew commented Jan 30, 2025

Uh oh!

Uh oh!

Cadair commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #514 will improve performances by 13.82%

Summary

Benchmarks breakdown

Uh oh!

Cadair left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SolarDrew commented May 12, 2025

Uh oh!

SolarDrew commented Jun 19, 2025

Uh oh!

Uh oh!

Cadair commented Feb 4, 2025 •

edited

Loading

codspeed-hq bot commented Apr 8, 2025 •

edited

Loading