Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Signed Fixed-width Integer Support #3162

Closed
pnwamk opened this issue Jan 10, 2024 · 10 comments
Closed

RFC: Signed Fixed-width Integer Support #3162

pnwamk opened this issue Jan 10, 2024 · 10 comments
Labels
P-high We will work on this issue RFC Request for comments

Comments

@pnwamk
Copy link
Contributor

pnwamk commented Jan 10, 2024

Proposal

Lean currently supports unsigned fixed-width integers (UInt8, UInt16, UInt23, UInt64).

This proposal is regarding similar support for signed fixed-width integers that provide similar efficiency benefits to the UIntN types (e.g., native C representations).

Design Points for discussion

  1. Semantics for overflow and avoiding UB.

    • Division by zero returning zero seems an obvious choice here like for existing UIntN types.
    • Signed arithmetic overflow in C/C++ is undefined behavior. To avoid this, an implementation could
      i. simply behave more like Fin and BitVec in Lean4 today and provide predictable wraparound behavior on overflow (there is a slight bias for these semantics in this RFC at the moment), or
      ii. choose to detect and panic on such occurrences (perhaps similar to Rust in debug mode?)
  2. C representation

    • Using the expected intn_t types could make for the least surprising set of C definitions/FFI APIs
    • A case for reusing uintn_t types could also be made as they are already used by the compiler for some types
  3. Where would IntN definitions ideally live in the short and long term?

Approaches

Approach 1: UIntN Subtypes

A UIntN is used to represent each IntN type, e.g. here is Int8:

structure Int8 where
  ofUInt8 ::
  toUInt8 : UInt8
  deriving DecidableEq, Inhabited, Hashable

Efficient Representation

This approach already provides an efficient representation in compiled code with no compiler extensions by being defined directly in terms of UInt8. I.e., all compiled definitions operating on these Int8s will use uint8_t directly. Note that this does mean that FFI code will need to manually track which generated C definitions are using uint8_t for Int8 and which are using uint8_t for some other type (e.g., UInt8, Bool, etc).

Operators

Arithmetic operations which are "sign-insensitive" can be defined directly in terms of their corresponding UInt8 operation, e.g. Int8.add:

def Int8.add (a b : Int8) : Int8 := ⟨UInt8.add a.toUInt8 b.toUInt8⟩

The remaining few sign-sensitive operations can be given the appropriate implementation in Lean (and backed with a more efficient C representation where needed), e.g., Int8.div:

@[extern "lean_int8_div"]
def Int8.div (a b : Int8) : Int8 :=
  match a.isNeg, b.isNeg with
  | true,  true => ⟨(-a).toUInt8 / (-b).toUInt8⟩
  | true,  false => -⟨(-a).toUInt8 / b.toUInt8⟩
  | false, true => -⟨a.toUInt8 / (-b).toUInt8⟩
  | false, false => ⟨a.toUInt8 / b.toUInt8⟩
instance : Div Int8 := ⟨Int8.div⟩

and the corresponding C definition lean_int8_div:

extern "C" LEAN_EXPORT uint8_t lean_int8_div(uint8_t a, uint8_t b) {
  if ((a == 0) || ((a == INT8_MIN) && (b == ((uint8_t)((int8_t)-1))))) {
    return a;
  } else {
    return ((int8_t) a) / ((int8_t) b);
  }
}

Here's an initial draft for Int8 for feedback here if we want to use this approach:

Potential pros

  • requires no compiler changes required for efficient representation
  • requires minimal new Lean code since many Int8 ops simply use the same UInt8 op
  • requires minimal new C code since UInt8 already provides many appropriate C definitions that are re-used

Potential cons

  • without some compiler additions, generated C functions operating on IntN types will all be in terms of uintn_t types and not intn_t types. The casts between a uintn_t and intn_t type are free and semantically correct. The book keeping to track which uintn_t means what in the FFI/C layer, however, could be painful and a common Lean4/FFI footgun.

Approach 2: Fin Subtype

A Fin is used to represent each IntN type, e.g. here is Int8:

/-- The size of type `Int8`, that is, `2^8 = 256`. -/
abbrev Int8.size : Nat := 256

/--
The type of signed 8-bit two's compliment integers.
-/
structure Int8 where
  val : Fin Int8.size

Efficient Representation

This approach would require a compiler extension like what exists for UIntN types to use the expected intn_t C type. This would obviously mean more up front work, but the resulting compiled C/FFI code would be in terms of the expected signed integer type, which seems likely to be the least surprising choice for a user to encounter.

Operators

Arithmetic operations which are "sign-insensitive" can be defined directly in terms of their corresponding Fin operation in Lean and given the expected efficient representations in C, e.g. Int8.add:

def Int8.add (a b : Int8) : Int8 := ⟨a.val + b.val⟩

The remaining few sign-sensitive operations can be given a more complex definition when appropriate in Lean (along with a more efficient C implementation where needed), e.g., Int8.div:

@[extern "lean_int8_div"]
def Int8.div (a b : Int8) : Int8 :=
  match a.isNeg, b.isNeg with
  | true,  true => ⟨(-a).val / (-b).val⟩
  | true,  false => -⟨(-a).val / b.val⟩
  | false, true => -⟨a.val / (-b).val⟩
  | false, false => ⟨a.val / b.val⟩
instance : Div Int8 := ⟨Int8.div⟩

and the corresponding C definition lean_int8_div:

extern "C" LEAN_EXPORT int8_t lean_int8_div(int8_t a, int8_t b) {
  if ((a == 0) || ((a == INT8_MIN) && (b == -1))) {
    return a;
  } else {
    return a / b;
  }
}

Here's an initial draft for Int8 using Fin for feedback here if we want to use this approach:

Potential pros

  • has the least surprising C definitions/FFI
  • defining signed fixed width integers in terms of Fin may be "more direct" or "more natural" than defining them in terms of unsigned fixed-width integers (debatable of course!)

Potential cons

  • requires some compiler additions to get the expected efficient C representation, likely similar in scope to what exists for UIntN

Approach 3: Int Subtypes

Each IntN type could also be defined as a subtype of Int similar to the following:

/--
`IntN n` is an integer `i` with the constraint that `-2^(n-1) ≤ i < 2^(n-1) - 1`.
-/
structure IntN (n : Nat)where
  /-- If `i : IntN n`, then `i.val : ℤ` is the described number. It can also be
  written as `i.1` or just `i` when the target type is known. -/
  val  : Int
  /-- If `i : IntN n`, then `i.2` is a proof that `-2^(n-1) ≥ i.1`. -/
  isGe : LE.le (2^(n-1)) val
  /-- If `i : Fin2C n`, then `i.3` is a proof that `i.1 < (n-1)`. -/
  isLt : LT.lt val (2^(n-1)-1)

Zulip discussion steered me away from this and towards the aforementioned UIntN approach, so less experimenting has been done with this approach.

Potential pros

  • generated C interfaces are in terms of the signed integer types, which could be more intuitive/less error prone when writing FFI-involved code.

Potential cons

  • Requires more Lean and C definitions per type than UIntN approach.
  • Requires compiler extension(s) to achieve a native representation.

Community Feedback

The lack of signed fixed-width integer types has come up in discussion with Lean FRO members and has been discussed some in the lean4 dev community. It was also noted as a pain point in the community when working with external code during Lean Together 2024.

Thank you to those who have helped in discussing this topic so far.

Impact

Add 👍 to issues you consider important. If others benefit from the changes in this proposal being added, please ask them to add 👍 to it.

Related Work

The following also involve supporting signed fixed-width integer types:

  • Std.Data.BitVec: supports similar functionality but for arbitrary bitwidths and wraps a Nat
  • SciLean.Data.Int64: Int64 impl using the same proposed design of wrapping an unsigned int (USize instead of than UIntN). The proposed Int8 impl is similar with a few slight differences (e.g., fewer bitwise operations, leverages a few C definitions for efficiency, etc)
  • Galois.Data.Bitvec: somewhat outdated but still a helpful reference, similar to Std.Data.BitVec
@pnwamk pnwamk added the RFC Request for comments label Jan 10, 2024
@fgdorais
Copy link
Contributor

Are there considerations to support overflow detection?

@pnwamk
Copy link
Contributor Author

pnwamk commented Jan 10, 2024

Are there considerations to support overflow detection?

I'm not aware of specific considerations here as of yet. The tinkering so far has focused on efficiency, avoiding signed integer-related UB we may inherit from C, and just general correctness.

Does anyone know: is there any prior art in this space for the UIntN types? (If no, would such considerations equally apply to both UIntN and IntN types? And, do we want such considerations to be a part of an initial IntN addition/PR, or a part of a follow-up feature/enhancement?)

@fgdorais
Copy link
Contributor

The usual semantics for unsigned types are to wrap around, without overflow. (However, add with carry/subtract with borrow/extended multiplication are often supported.) The overflow semantics for signed types are quite different; they are well supported in LLVM and as builtins in major C compilers. So support is available but might require compiler support.

@pnwamk
Copy link
Contributor Author

pnwamk commented Jan 10, 2024

The usual semantics for unsigned types are to wrap around, without overflow. (However, add with carry/subtract with borrow/extended multiplication are often supported.) The overflow semantics for signed types are quite different; they are well supported in LLVM and as builtins in major C compilers. So support is available but might require compiler support.

The proposed Int8 implementation linked in the RFC uses the corresponding unsigned operations where possible. This gives them predictable wrap around semantics and avoids any UB from signed integer overflow, e.g. (127 + 1) : Int8 = -128.

For the sign-sensitive operations, it specifically checks for and avoids the problematic cases which would trigger signed integer overflow (e.g., div and mod explicitly check the few cases where that can happen, so you get predictable albeit perhaps initially odd results like(-128 / -1) : Int8 = -128 instead of UB).

Perhaps I misunderstood what you meant by "support overflow detection". The proposed design explicitly avoids signed integer overflow in the C sense and values will wrap around, but it doesn't attempt to dynamically detect specific cases at runtime in order to warn/error for the user.

@fgdorais
Copy link
Contributor

The runtime detection (with catch support?) is what I'm asking about.

Just to be clear: I'm just asking whether this is currently considered, I don't think this is important for basic signed integer support.

@alexkeizer
Copy link
Contributor

alexkeizer commented Jan 11, 2024

This seems closely related to the Std.BitVec type, which implements a signless approach to bitvectors.
For example, it has Std.BitVec.udiv for unsigned division and Std.BitVec.sdiv for signed division.
Ultimately, though, the BitVec is just a wrapper around a Nat, just like the proposed IntN being a wrapper around UIntN.

Things like overflow detection and such might make more sense to implement on BitVec, and are in fact on my roadmap for them already. The proposed IntN types seem more suited for when you want to squeeze out every last bit of performance, and thus probably don't want to pay the cost of overflow detection.

@pnwamk
Copy link
Contributor Author

pnwamk commented Jan 11, 2024

This seems closely related to the Std.BitVec type

Yes, that code was helpful to reference =) I just added a Related Work section to the RFC I should have included initially acknowledging the related implementations I'm aware of.

Things like overflow detection and such might make more sense to implement on BitVec ... the proposed IntN types seem more suited for when you want to squeeze out every last bit of performance

I am increasingly curious if Lean could reasonably help a user avoid/detect overflow while using these sorts of "primitive" UIntN and IntN types... but I think that may be a topic to dive into separately (at least based on the fact that the UIntN types today also don't feature this). Happy to hear disagreement here though =)

@fgdorais
Copy link
Contributor

C/C++ make signed integer overflow undefined but efficient tools are generally available.

LLVM support: https://llvm.org/docs/LangRef.html#llvm-sadd-with-overflow-intrinsics
Clang support: https://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins
GCC support: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

A generic operation would have to trap overflow by checking the inputs.

@leanprover-bot leanprover-bot added the P-high We will work on this issue label Sep 20, 2024
@hargoniX
Copy link
Contributor

hargoniX commented Nov 5, 2024

This is now implemented through #5790, #5885 and #5961. Additionally UIntX was refactored to use BitVec instead in #5323 so IntX can now make use of the BitVec primitives that we already put in a lot of work over the course of this year.

To the best of our knowledge we have addressed all UB concerns by both mimicking semantics of BitVec in C and enabling the -fwrapv option in the compiler such that overflow etc. is now defined behavior.

Additional things such as proper proof APIs for both UIntX and IntX as well as things like catching overflows are future work items that are not part of this RFC afaict so I'm closing this as completed.

@hargoniX hargoniX closed this as completed Nov 5, 2024
@omentic
Copy link

omentic commented Nov 5, 2024

Awesome!! This is very useful, thank you 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-high We will work on this issue RFC Request for comments
Projects
None yet
Development

No branches or pull requests

6 participants