Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: validate atoms modulo leading and trailing whitespace #6012

Merged

Conversation

david-christiansen
Copy link
Contributor

@david-christiansen david-christiansen commented Nov 8, 2024

This PR improves the validation of new syntactic tokens. Previously, the validation code had inconsistencies: some atoms would be accepted only if they had a leading space as a pretty printer hint. Additionally, atoms with internal whitespace are no longer allowed.

Closes #6011

@david-christiansen david-christiansen added the changelog-language Language features, tactics, and metaprograms label Nov 8, 2024
@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Nov 8, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Nov 8, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Nov 8, 2024
@leanprover-community-bot leanprover-community-bot added the breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan label Nov 8, 2024
@david-christiansen
Copy link
Contributor Author

The failing test defines this operator:

/-- `f '' s` denotes the image of `s : Set α` under the function `f : α → β`. -/
infixl:80 " '' " => image

Should this be allowed? I would think not, because deleting the leading space makes current versions of Lean reject it, and I would expect that pretty printing instructions were orthogonal to token rules. On that assumption, I'll update the test.

@david-christiansen
Copy link
Contributor Author

This notation is also used in Mathlib.

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Nov 8, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Nov 8, 2024
@digama0
Copy link
Collaborator

digama0 commented Nov 8, 2024

Should this be allowed? I would think not, because deleting the leading space makes current versions of Lean reject it, and I would expect that pretty printing instructions were orthogonal to token rules. On that assumption, I'll update the test.

Why wouldn't this be allowed? Whitespace aside, it seems like a perfectly reasonable infix declaration. (Well, I do recall there was a hack added specifically to allow '' to be a token rather than an empty character literal, but besides that it's a regular token.)

@david-christiansen
Copy link
Contributor Author

If it should be allowed, then shouldn't it be allowed whether or not the pretty-printer is instructed to insert whitespace before it?

Today, "''" is rejected as an atom, but " '' " is accepted. I think that either both or neither should be accepted here. Does that make sense?

@digama0
Copy link
Collaborator

digama0 commented Nov 10, 2024

I agree. They should both be accepted.

@digama0
Copy link
Collaborator

digama0 commented Nov 10, 2024

The exception for empty character literals was added in #1931 . I think we need a similar exception at Lean.Elab.Term.toParserDescr.isValidAtom.

@Kha
Copy link
Member

Kha commented Nov 13, 2024

The exception for empty character literals was added in #1931 . I think we need a similar exception at Lean.Elab.Term.toParserDescr.isValidAtom.

agreed

@david-christiansen
Copy link
Contributor Author

Great, '' is now allowed with or without whitespace as of b2a0735.

While I'm at it, we presently allow atoms with internal spaces. Should we?

This works, and I'm not sure it should:

infix:70 " <I like apples, they are delicious> " => Nat.add
#eval 3 <I like apples, they are delicious> 4

@david-christiansen
Copy link
Contributor Author

Just to see if it breaks Mathlib, I gave it a go

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Nov 13, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Nov 13, 2024
@david-christiansen
Copy link
Contributor Author

It seems that open private is a single atom.

@david-christiansen
Copy link
Contributor Author

leanprover-community/batteries#1050 makes batteries build for me with this. I think we're OK.

@david-christiansen david-christiansen added the will-merge-soon …unless someone speaks up label Nov 14, 2024
@digama0
Copy link
Collaborator

digama0 commented Nov 14, 2024

While I think batteries#1050 is reasonable, I don't see a particular reason to ban tokens including whitespace? That is a thing you might want to do, it gives additional flexibility to DSL writers.

@david-christiansen david-christiansen added this pull request to the merge queue Nov 14, 2024
Merged via the queue into leanprover:master with commit 8e1ddbc Nov 14, 2024
20 checks passed
@david-christiansen david-christiansen deleted the atom-validation-ws branch November 14, 2024 16:16
@david-christiansen
Copy link
Contributor Author

This was a balance of footgun vs flexibility. A consequence of the way open private was specified in Batteries was that open private or open /- Extracting the secret treasures! -/ private was banned, which could be surprising. This is also an easy mistake to make - it seems to work well in most cases, and then suddenly there's weird error messages for a few downstream users.

If someone really needs precisely that level of flexiblity for their DSL, it's still possible to drop down and write the corresponding ParserFn and parser info. But if your DSL has token rules that are very different from Lean's, then you're probably better off working mostly at that level anyway (like I do in Verso).

@digama0
Copy link
Collaborator

digama0 commented Nov 14, 2024

This was a balance of footgun vs flexibility. A consequence of the way open private was specified in Batteries was that open private or open /- Extracting the secret treasures! -/ private was banned, which could be surprising. This is also an easy mistake to make - it seems to work well in most cases, and then suddenly there's weird error messages for a few downstream users.

I agree on this, it's not the first time this mistake has been made and I don't think there are any commands in batteries or mathlib that would want to do this. But that's justification for a warning at best.

@kim-em
Copy link
Collaborator

kim-em commented Nov 17, 2024

@david-christiansen, I'm seeing

/-- `f ''ᵁ U` is notation for the image (as an open set) of `U` under an open immersion `f`. -/
scoped[AlgebraicGeometry] notation3:90 f:91 " ''ᵁ " U:90 => (Scheme.Hom.opensFunctor f).obj U

erroring with "invalid atom" now. Can we add some further flexibility to allow this?

@david-christiansen
Copy link
Contributor Author

I agree on this, it's not the first time this mistake has been made and I don't think there are any commands in batteries or mathlib that would want to do this. But that's justification for a warning at best.

If a real-world use case arises, then we can revisit this. Right now, it doesn't seem like a good use of development resources - "no internal whitespace" is easy to implement and solves the problem, while making it a warning would require more time and increase the complexity of the code overall.

github-merge-queue bot pushed a commit that referenced this pull request Nov 18, 2024
This PR liberalizes atom rules by allowing `''` to be a prefix of an
atom, after #6012 only added an exception for `''` alone, and also adds
some unit tests for atom validation.
@digama0
Copy link
Collaborator

digama0 commented Nov 23, 2024

I had a use case for this come up:

axiom SProp : Type
axiom SProp.ip : UInt64 → SProp
prefix:80 "RIP ↦ " => SProp.ip

from some separation logic project I'm working on. As I understand it this will not work in the next version of lean, and the prefix command does not allow adding multiple string literals in that position. Could we make the prefix command either split the input on whitespace (preserving formatting), or allow multiple string tokens in that position?

The other unfortunate side effect of using two tokens here is that it makes RIP into a keyword, which is okay but suboptimal in this situation (I would prefer it is only special when immediately followed by in this notation).

@david-christiansen
Copy link
Contributor Author

I'm not totally convinced by this use case for spaces in the tokens. It's not that it doesn't seem like what's being asked for is useful, it's just that it seems to me that the usefulness really comes from new features rather than from undoing this new validation step.

Keeping the old behavior would have resulted in a keyword with a single space in it. RIP ↦ would be valid syntax, while RIP ↦ or RIP /- This does the thing! -/ ↦ would not. This is precisely the footgun to be avoided, and I don't think that it's a compelling case for whitespace characters in tokens.

It doesn't seem bonkers to allow multi-token operators, though I'd like to see why a notation doesn't do what you want here. I think it's OK if operators make easy things easy, but don't cover a wide range of syntax, and from what I can see, you'd get all the benefits of a two-token prefix operator from a notation. RIP would indeed be a keyword, of course, but this is another one of those balancing acts between flexibility and consistency that doesn't strike me as a huge problem, though I of course don't have a ton of insight into the specifics of your project. What am I missing?

@digama0
Copy link
Collaborator

digama0 commented Nov 25, 2024

It's not that it doesn't seem like what's being asked for is useful, it's just that it seems to me that the usefulness really comes from new features rather than from undoing this new validation step.

No disagreement from me on this! I just wanted to give you a reasonable use case to center your thinking on this one.

RIP would indeed be a keyword, of course, but this is another one of those balancing acts between flexibility and consistency that doesn't strike me as a huge problem, though I of course don't have a ton of insight into the specifics of your project. What am I missing?

The idea here is that RIP would be a normal identifier (in fact it may even be a definition), but when followed by (I don't care if comments are allowed in between the token or not) it would be treated as the above composite syntax. There are a few tricks already in use in lean core to avoid e.g. tactic names from being treated as tokens and banning their use in other positions, and this is a similar issue. I'm not fussed on the particular implementation choices needed to get there, but I don't think notation &"RIP" " ↦ " x => SProp.ip x works currently.

Context on the project isn't too relevant to this feature request, but to explain a bit more: RIP ↦ x is a separating proposition which says that the instruction pointer points to x, but RIP isn't a memory location, and it's not even a regular register name, which is why it has special syntax here. RIP is however the name of a projection out of the machine state (in a relatively separate part of the project), and these two things clash if you require RIP to be a token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks-mathlib This is not necessarily a blocker for merging: but there needs to be a plan changelog-language Language features, tactics, and metaprograms toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN will-merge-soon …unless someone speaks up
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation of atoms should ignore whitespace
5 participants