Skip to content

Conversation

@normanrz
Copy link
Member

@normanrz normanrz commented Feb 24, 2025

  • Blosc
  • LZ4
  • Zstd
  • Zlib
  • GZip
  • BZ2
  • LZMA
  • Shuffle
  • CRC32
  • CRC32C
  • Adler32
  • Fletcher32
  • JenkinsLookup3
  • PCodec
  • ZFPY

@normanrz
Copy link
Member Author

normanrz commented Mar 1, 2025

I validated the schema.jsons agains the numcodecs fixtures:

# /// script
# dependencies = [ "jsonschema" ]
# ///

from jsonschema import validate
import json
from pathlib import Path

numcodecs_fixture_path = (
    Path.home() / "numcodecs" / "fixture"
)
for path in Path("codecs").glob("numcodecs.*/schema.json"):
    _, name = path.parent.name.split(".")
    print(name)
    for fixture_path in (numcodecs_fixture_path / name).glob("**/config.json"):
        print("  ", fixture_path)
        config_json = json.loads(fixture_path.read_text())
        config_json.pop("id", None)
        config_json = {"name": f"numcodecs.{name}", "configuration": config_json}

        validate(
            instance=config_json,
            schema=json.loads(path.read_bytes()),
        )

@jbms
Copy link
Contributor

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

@normanrz
Copy link
Member Author

normanrz commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

I agree and would welcome contributions. Unfortunately, the numcodecs documentation is also pretty sparse on encoding details. So, for every codec we need to go through the code and write a spec.
It is strongly encouraged to write a specification, but not a must. In the interest of time, I wanted to have these specification scaffolds in to reserve the names and leave the spec details for later.

@jbms
Copy link
Contributor

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

I see --- I did not realize that zarr-python had added all of the numcodecs codecs for zarr v3 as numcodecs.xxx.

I imagine it was done to make it very easy for someone using zarr-python to migrate to using zarr v3 -- which is understandable.

However, from an interoperability perspective this is kind of unfortunate --- someone using zarr-python with zarr v3 and a numcodecs.XXX codec may not realize that they are producing a zarr array that is not interoperable with any other zarr implementation, because the codec gets recorded as numcodecs.xxx. That is particularly unfortunate for cases like gzip or blosc or zstd where other implementations do in fact support those codecs both with zarr v2 and zarr v3, and had the zarr-python user specified the codec in exactly the same way but used zarr v2 instead of zarr v3 they would also produce an interoperable array, but by specifying zarr v3 they produce a non-interoperable array.

@rabernat
Copy link

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

@normanrz
Copy link
Member Author

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

I think that would just be another extension that needs to be registered, with some co-references in the readme.

@rabernat
Copy link

To be clear about my position, I think we should have such aliases or double entries (numcodecs.XXX and XXX) in every case where the codec is a general purpose interoperable codec.

@normanrz
Copy link
Member Author

That makes sense. I think that can be followup, though, once we have better specifications for the individual codecs.

@normanrz
Copy link
Member Author

I think this PR is blocked by zarr-developers/numcodecs#742 (comment). Instead of registering all codecs as numcodecs.* it would be better to register them individually. However, that would require harmonization both in zarr-python and numcodecs, for example w.r.t to numcodecs.blosc and blosc.

@LDeakin
Copy link
Member

LDeakin commented Jul 20, 2025

zarr-developers/numcodecs#767 raised that shuffle is listed in this PR as an array-to-array codec. It should be a bytes-to-bytes codec, as it is currently implemented in zarr-python + zarrs.

cc @d-v-b

Comment on lines +13 to +14
- `encode_dtype`
- `decode_dtype` (optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to be clear about which data type identifiers are permitted here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 style codecs will use numpy-style dtype identifiers, which are not valid in v3

Comment on lines +15 to +16
- `dtype`
- `astype` (optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the permitted values for the data types needs to be made clear, e.g. that they must be the JSON form of zarr v3 data types

@d-v-b
Copy link
Contributor

d-v-b commented Nov 3, 2025

over in zarr python I have a PR that touches on the implementation of the codecs discussed here. I think the most important thing is to decide how much of the current JSON structure of the zarr.codecs.numcodecs codecs should be treated as read-only aliases for v3-native JSON metadata.

This is most acute for the codecs that have data type identifiers in their configuration. Currently in zarr python codecs like zarr.numcodecs.astype use v2 data type identifiers like '|i1' when saving Zarr V3 metadata. My PR changes this behavior and ensures that astype and related codecs use zarr v3 data type identifiers (like 'int16') when saving Zarr V3 metadata, while treating the old JSON form as a read-only alias.

I think the spec for astype should require that JSON metadata for the codec use zarr v3 data type identifiers, but recommend treating codec metadata with zarr v2 data type identifiers as read-only aliases for the equivalent zarr v3 data type metadata.

There are also codecs in zarr.codecs.numcodecs for which we already have a zarr v3 spec:

  • gzip
  • zstd
  • blosc
  • crc32c

In some cases (gzip) the zarr.codecs.numcodecs version only differs by the name used in JSON ('gzip' vs 'numcodecs.gzip'). But in other cases, like blosc and crc32c, the codec configuration is different as well. the zarr.codecs.numcodecs version of blosc uses the v2 style configuration, meaning no typesize parameter and the use of ints for the shuffle parameter, which is incompatible with the Zarr v3 spec version. Similarly, the zarr.codecs.numcodecs version of crc32c takes a location parameter, but the zarr v3 spec version does not.

The simplest way to handle these is probably separate specs, with a disclaimer that these are not recommended codecs. We could also consider amending to zarr v3 codec specs to describe read-only aliases.


## Current maintainers

* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)

This comment was marked as resolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. I think that would be a decision for the zarr-python team to make.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are trying to decouple zarr python from numcodecs, I'm not sure we will end up having an entirely separate numcodecs team.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll just mark this as resolved with no need for a separate team

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants