adds codecs that numcodecs defines #2

normanrz · 2025-02-24T15:34:22Z

normanrz · 2025-03-01T20:53:54Z

I validated the schema.jsons agains the numcodecs fixtures:

# /// script
# dependencies = [ "jsonschema" ]
# ///

from jsonschema import validate
import json
from pathlib import Path

numcodecs_fixture_path = (
    Path.home() / "numcodecs" / "fixture"
)
for path in Path("codecs").glob("numcodecs.*/schema.json"):
    _, name = path.parent.name.split(".")
    print(name)
    for fixture_path in (numcodecs_fixture_path / name).glob("**/config.json"):
        print("  ", fixture_path)
        config_json = json.loads(fixture_path.read_text())
        config_json.pop("id", None)
        config_json = {"name": f"numcodecs.{name}", "configuration": config_json}

        validate(
            instance=config_json,
            schema=json.loads(path.read_bytes()),
        )

jbms · 2025-03-05T01:26:00Z

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

normanrz · 2025-03-05T15:04:43Z

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

I agree and would welcome contributions. Unfortunately, the numcodecs documentation is also pretty sparse on encoding details. So, for every codec we need to go through the code and write a spec.
It is strongly encouraged to write a specification, but not a must. In the interest of time, I wanted to have these specification scaffolds in to reserve the names and leave the spec details for later.

jbms · 2025-03-05T17:13:51Z

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

I see --- I did not realize that zarr-python had added all of the numcodecs codecs for zarr v3 as numcodecs.xxx.

I imagine it was done to make it very easy for someone using zarr-python to migrate to using zarr v3 -- which is understandable.

However, from an interoperability perspective this is kind of unfortunate --- someone using zarr-python with zarr v3 and a numcodecs.XXX codec may not realize that they are producing a zarr array that is not interoperable with any other zarr implementation, because the codec gets recorded as numcodecs.xxx. That is particularly unfortunate for cases like gzip or blosc or zstd where other implementations do in fact support those codecs both with zarr v2 and zarr v3, and had the zarr-python user specified the codec in exactly the same way but used zarr v2 instead of zarr v3 they would also produce an interoperable array, but by specifying zarr v3 they produce a non-interoperable array.

rabernat · 2025-04-10T13:58:48Z

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

normanrz · 2025-04-10T14:00:43Z

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

I think that would just be another extension that needs to be registered, with some co-references in the readme.

rabernat · 2025-04-11T14:20:41Z

To be clear about my position, I think we should have such aliases or double entries (numcodecs.XXX and XXX) in every case where the codec is a general purpose interoperable codec.

normanrz · 2025-04-11T14:30:05Z

That makes sense. I think that can be followup, though, once we have better specifications for the individual codecs.

normanrz · 2025-04-24T15:07:47Z

I think this PR is blocked by zarr-developers/numcodecs#742 (comment). Instead of registering all codecs as numcodecs.* it would be better to register them individually. However, that would require harmonization both in zarr-python and numcodecs, for example w.r.t to numcodecs.blosc and blosc.

LDeakin · 2025-07-20T11:06:53Z

zarr-developers/numcodecs#767 raised that shuffle is listed in this PR as an array-to-array codec. It should be a bytes-to-bytes codec, as it is currently implemented in zarr-python + zarrs.

cc @d-v-b

d-v-b · 2025-07-21T10:40:12Z

codecs/numcodecs.astype/README.md

+- `encode_dtype`
+- `decode_dtype` (optional)


we need to be clear about which data type identifiers are permitted here.

v2 style codecs will use numpy-style dtype identifiers, which are not valid in v3

d-v-b · 2025-07-21T10:41:40Z

codecs/numcodecs.fixedscaleoffset/README.md

+- `dtype`
+- `astype` (optional)


the permitted values for the data types needs to be made clear, e.g. that they must be the JSON form of zarr v3 data types

d-v-b · 2025-11-03T16:34:02Z

over in zarr python I have a PR that touches on the implementation of the codecs discussed here. I think the most important thing is to decide how much of the current JSON structure of the zarr.codecs.numcodecs codecs should be treated as read-only aliases for v3-native JSON metadata.

This is most acute for the codecs that have data type identifiers in their configuration. Currently in zarr python codecs like zarr.numcodecs.astype use v2 data type identifiers like '|i1' when saving Zarr V3 metadata. My PR changes this behavior and ensures that astype and related codecs use zarr v3 data type identifiers (like 'int16') when saving Zarr V3 metadata, while treating the old JSON form as a read-only alias.

I think the spec for astype should require that JSON metadata for the codec use zarr v3 data type identifiers, but recommend treating codec metadata with zarr v2 data type identifiers as read-only aliases for the equivalent zarr v3 data type metadata.

There are also codecs in zarr.codecs.numcodecs for which we already have a zarr v3 spec:

gzip
zstd
blosc
crc32c

In some cases (gzip) the zarr.codecs.numcodecs version only differs by the name used in JSON ('gzip' vs 'numcodecs.gzip'). But in other cases, like blosc and crc32c, the codec configuration is different as well. the zarr.codecs.numcodecs version of blosc uses the v2 style configuration, meaning no typesize parameter and the use of ints for the shuffle parameter, which is incompatible with the Zarr v3 spec version. Similarly, the zarr.codecs.numcodecs version of crc32c takes a location parameter, but the zarr v3 spec version does not.

The simplest way to handle these is probably separate specs, with a disclaimer that these are not recommended codecs. We could also consider amending to zarr v3 codec specs to describe read-only aliases.

codecs/numcodecs.shuffle/README.md

normanrz · 2025-11-04T16:14:18Z

codecs/numcodecs.shuffle/README.md

+
+## Current maintainers
+
+* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)


That sounds good to me. I think that would be a decision for the zarr-python team to make.

While we are trying to decouple zarr python from numcodecs, I'm not sure we will end up having an entirely separate numcodecs team.

ok, I'll just mark this as resolved with no need for a separate team

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>

normanrz added 5 commits February 24, 2025 16:15

add bz2

0a0f41c

gzip

34ac80c

adds numcodecs codecs

1befbd9

schema.json

6cf3d67

fixes schema jsons

73d9610

normanrz mentioned this pull request Mar 18, 2025

refactor v3 data types zarr-developers/zarr-python#2874

Merged

update schema

083d621

normanrz mentioned this pull request Apr 17, 2025

Reserve characters in names #7

Open

LDeakin mentioned this pull request Apr 19, 2025

chore: update codec / data type statuses zarrs/zarrs#179

Merged

normanrz marked this pull request as draft May 6, 2025 17:12

normanrz mentioned this pull request Jun 2, 2025

Define codec for LZW compression #16

Draft

This was referenced Jul 13, 2025

Unexpected zarr numcodecs userwarning when using to_icechunk zarr-developers/VirtualiZarr#447

Open

Discrepancy in numcodecs.shuffle classification as a BytesBytesCodec vs ArrayArrayCodec zarr-developers/numcodecs#767

Closed

d-v-b reviewed Jul 21, 2025

View reviewed changes

d-v-b mentioned this pull request Jul 21, 2025

codec specification in v3 zarr-developers/zarr-specs#293

Open

maxrjones mentioned this pull request Nov 3, 2025

Add v2 and v3 metadata support to codecs zarr-developers/zarr-python#3332

Open

maxrjones reviewed Nov 3, 2025

View reviewed changes

Update codecs/numcodecs.shuffle/README.md

bc6a858

Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>


		## Current maintainers

		* [zarr-python core development team](https://github.com/orgs/zarr-developers/teams/python-core-devs)

adds codecs that numcodecs defines #2

Are you sure you want to change the base?

adds codecs that numcodecs defines #2

Uh oh!

Conversation

normanrz commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanrz commented Mar 1, 2025

Uh oh!

jbms commented Mar 5, 2025

Uh oh!

normanrz commented Mar 5, 2025

Uh oh!

jbms commented Mar 5, 2025

Uh oh!

rabernat commented Apr 10, 2025

Uh oh!

normanrz commented Apr 10, 2025

Uh oh!

rabernat commented Apr 11, 2025

Uh oh!

normanrz commented Apr 11, 2025

Uh oh!

normanrz commented Apr 24, 2025

Uh oh!

LDeakin commented Jul 20, 2025

Uh oh!

d-v-b Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Nov 3, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

normanrz Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

maxrjones Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

normanrz commented Feb 24, 2025 •

edited

Loading