Skip to content

Conversation

@ikrommyd
Copy link

@ikrommyd ikrommyd commented Oct 31, 2025

I noticed that the vector constructor methods do not raise consistent errors with each other when invalid combinations of arguments are given or duplicate arguments that map to the same coordinate.
For example,

vector.obj(pt=1.1, eta=2.2, phi=3.3, mass=4.4, energy=1001)

errors but the following does not.

vector.array(
    {
        "pt": np.random.exponential(5, 10000),
        "phi": np.random.uniform(-np.pi, np.pi, 10000),
        "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
        "mass": np.full(10000, 0.000511),
        "energy": np.full(10000, 0.000511),
    }
)

The same case for the awkward constructor does error though

vector.Array(
    ak.Array(
        {
            "pt": np.random.exponential(5, 10000),
            "phi": np.random.uniform(-np.pi, np.pi, 10000),
            "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
            "mass": np.full(10000, 0.000511),
            "energy": np.full(10000, 0.000511),
        }
    )
)

What I tried to do here is add checks for the numpy backend in the same way that take place for the awkward backend.
I also think I found a couple of cases (even for vector.obj) where I think an error should be raised but it's not.

I also noticed that invalid constructors are not tested so I added a big test file that tests such cases.

This is my first time even looking at the vector codebase help is appreciated.

@ikrommyd ikrommyd changed the title Patch feat: improve errors for invalid combinations of arguments in vector constructor methods Oct 31, 2025
Copy link
Member

@Saransh-cpp Saransh-cpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @ikrommyd! Thanks for working on this!!

Please see my comments below:

if "pz" in coordinates:
is_momentum = True
generic_coordinates["z"] = coordinates.pop("pz")
if "E" in coordinates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also have the and "t" not in generic_coordinates condition?

if "energy" in coordinates and "t" not in generic_coordinates:
is_momentum = True
generic_coordinates["t"] = coordinates.pop("energy")
if "M" in coordinates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, for tau

Comment on lines +2303 to +2304
# Validate coordinates using dimension-guard pattern (same as awkward _check_names)
_validate_numpy_coordinates(names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector.array is just a wrapper around individual constructors of Vector/MomentumNumpy*D, which can be used to construct vectors (unlike the Awkward backend). Hence, it would be better if we move this check to the __array_finalize__ method of each class:

def __array_finalize__(self, obj: typing.Any) -> None:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember that I had some issue with __array_finalize__ regarding when is it being ran when I was making these edits but I will take a look again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these tests are related to constructing vectors, we should move them to the appropriate test backend files. The object ones can go in:

def test_constructors_2D():

The NumPy ones (we don't have much constructor tests for NumPy at the moment, so this is perfect 😅):

def test_xy():

The Awkward ones (same as NumPy):

def test_basic(backend):

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just thought that maybe it should be a new test because it's like 1K lines but yeah I can move them under their respective backend tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it might make sense to create a new function under test_issues.py for this then 🤔

@ikrommyd
Copy link
Author

ikrommyd commented Nov 1, 2025

Thanks @Saransh-cpp for the feedback. I'll tackle it as soon as find a time slot.

There is one more case that I'd like to solve which was the original inspiration for this. That's when people do ak.zip and use with_name but I don't see at all how we can solve this easily. For example

In [12]: x = ak.zip({"pt": events.Electron.pt, "eta": events.Electron.eta, "phi": events.Electron.phi, "mass": events.Electron.mass, "energy": -999, "rho": 1001}, with_name="Momentum4D")

In [13]: x.pt
Out[13]: <Array [[1001], [], [], ..., [1001, 1001], [1001]] type='55342 * var * int64'>

In [14]: x.mass
Out[14]: <Array [[-3.98e+03], [], ..., [...], [-1.53e+03]] type='55342 * var * float64'>

People can assign nonsense fields together like rho and energy here without knowing that rho it is an alias for pt for example and that entirely messes up your 4-vector math (silently).
Or people can do jets["rho"] = something. But I really don't know how to prevent this. Awkward just blindly assigns a name without caring at all what that implies and without knowing at all what behaviors it carries over.
Even worse things happen when it comes straight from the root file. In CMS, we have a flavor of nanoaod that has pt, eta, phi, phi, mass and energy. But "energy" is not the "correct" energy you'd get from the pt-eta-phi-mass 4-vector. It's a different energy measurement. And when you define a 4-vector with those 5 using with_name, that's bad.
But I really don't know if it's even possible to solve this because it's awkward who is assigning the names and behaviors without any knowledge.

@Saransh-cpp
Copy link
Member

I think @pfackeldey might be able to help with this - can we add constructor checks to our custom Awkward behaviors?

@pfackeldey
Copy link
Collaborator

I think @pfackeldey might be able to help with this - can we add constructor checks to our custom Awkward behaviors?

I'm not sure about the following:
I think you can't really change the way __init__ works through behaviors because awkward updates it's base class after the constructor has been run.

If my understanding here is wrong, you can do the following:

class MyBehavior:
  def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    # check
    if set(self._layout.fields) != {"pt", "eta", "phi", "mass"}:
      raise ValueError('got different fields than this behavior expected')

However, @ikrommyd and me talked about this and these constructor checks are indeed helpful, but do not prevent to break the behaviors. What would instead prevent this thoroughly would be:

  1. Awkward Arrays are fully immutable (you can't even set/update/delete fields)
  2. we disallow getattr to retrieve fields so it's clear to the user if they invoke a property or access a field
  3. vector constructors are only allowed to have access to certain fixed fields

(1.) and (3.) solves this issue entirely I think, (2.) is in addition to make people aware of when they invoke the mass property and when they access the mass field.

This is of course major break in API, so I'm not sure if we can/want to do this. I'm adding this here to write down the fix that solves this issue thoroughly.
There could be ways to enforce this immutability wrt fields per array instance. Then we can preserve the current functionality in general awkward, but vector could enforce true immutability which does not allow to update/add/del fields to arrays with vector behaviors.
About (2.) I don't think there's really a way to solve this except for a API breakage in awkward. But this is not the issue, it's just hard to understand when one accesses a property and when a field through dot-access.

I could add a way in awkward-array to make them truly immutable that you can define through behaviors in vector. And in theory I could also think of a way to make post init constructor checks in awkward arrays possible for behaviors.

If you want @ikrommyd & @Saransh-cpp I can prepare a PR and you can have a look? Or what do you think?

@Saransh-cpp
Copy link
Member

Sounds good to me, @pfackeldey! I'll be happy to review the PR in awkward and subsequently propagate the new immutability way to vector 🙂

@eduardo-rodrigues
Copy link
Member

Hi all. The subject sounded interesting/perplexing hence I had a read.

Indeed good catches, @ikrommyd 👍.

Now, there are 2 different matters being mixed together, as it were, it seems to me:

The real problem of one being able to create a vector with some nonsensical combination is the main thing (to be) addressed in this PR, which is important to get fixed while providing relevant error messages to users so that they can see the issues in the way they wrote things. All good here and I won't comment on the details as @Saransh-cpp knows them infinitely better than me,

The other issue mentioned relates to constructions such as In [12]: x = ak.zip({"pt": events.Electron.pt, "eta": events.Electron.eta, "phi": events.Electron.phi, "mass": events.Electron.mass, "energy": -999, "rho": 1001}, with_name="Momentum4D"), which relates to vector only by the time the actual 4-vector is put together behind the scenes. At construction level a user could in principle assign events.Electron.pt to px, events.Electron.eta to phi, etc., if they had some crazy idea for renaming variables. It is only the with_name that will impose constraints on what variables to assign to what. Admittedly, in the case you @ikrommyd provide, it seems to me that the issue should be trivial to catch because 5 variables to make a 4-vector is over-constraining the system and should not be allowed (this comment holds for the "real problem" just above). Or am I reading too quickly and misundestanding things? In any case I don't think you can in general leverage 4-vector knowledge in the Vector package at the level of ak.zip since the latter is generic. Unless you explicitly make the connection awkward-vector with a with_name parameter that is a Vector class. Else you implicitly make an awkward coupling with subtle behaviours, I feel.

@ikrommyd
Copy link
Author

ikrommyd commented Nov 3, 2025

Hi @eduardo-rodrigues, it's not trivial to catch when you do ak.zip though and you don't use any of the vector constructor methods. Because ak.zip does not know what Momentum4D means. It does not know or care what behaviors it carries over. More than that, we can hard-code the vector behavior checks explicitly in awkward-array sure but that's not the full story. You can subclass and define your own behaviors (coffea does that a lot). So then you can do with_name="PtEtaPhiMCandidate" which is something that coffea defines. So what behaviors do we check? We can't hard-code all the user-defined stuff? And with_name, all it does is it assigns a __record__ parameter to the array's layout to a string. It is completely blind as to what this implies.

@eduardo-rodrigues
Copy link
Member

You are emphasising of my comments in some way, meaning that it is hard and/or dangerous to couple the 2 packages, but if you have no knowledge, then you can easily get into trouble. That's why I mentioned the possibility of having an explicit coupling using a class rather than a string, so with_name=vector.MomentumObject4D rather than with_name="Momentum4D". At least then somebody subclassing would be on its own, taking responsibility, and for other standard use cases, there would be some checks implemented. But I'm still likely being naive here and am showing how little I know about the internals of awkward, sorry if this is not helpful.

@ikrommyd
Copy link
Author

ikrommyd commented Nov 3, 2025

Did a full write-up here of the issue: #660
This PR solves part of it. For further discussion, I suggest we move there to not make the PR discussion messy.

Raises TypeError if duplicate or conflicting coordinates are detected.
"""
complaint1 = "duplicate coordinates (through momentum-aliases): " + ", ".join(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a picky comment but it's a bit sub-optimal that we duplicate these "complaint strings" in several submodules. For better maintainability it's probably best to move them to a trival submodule and import the strings in the several places, such as here but also in awkward_constructions.py for example.

@pfackeldey
Copy link
Collaborator

Linking my proposed solution here as well: #660 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants