-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Hashable to builtin package and make it easier to make objects Hashable #157
Open
mfelsche
wants to merge
2
commits into
main
Choose a base branch
from
move-hashable-builtin
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
- Feature Name: move_hashable_to_builtin | ||
- Start Date: 2019-12-19 | ||
- RFC PR: (leave this empty) | ||
- Pony Issue: (leave this empty) | ||
|
||
# Summary | ||
|
||
Move `Hashable` to the builtin package and make it easier to make objects `Hashable` | ||
by defining convenience functions that use a common good and fast way of | ||
combining hashes of object fields into a single hash value. Also extend documentation | ||
and tutorial with information on how to make objects hashable. | ||
|
||
# Motivation | ||
|
||
It is possible to make any object implement the interface `Hashable` or `Hashable64` but | ||
if we want to state that by explicitly extending it we need to add a `use "collections"`, | ||
although we do not intend to use any pony collection: | ||
|
||
```pony | ||
use "collections" | ||
|
||
class MyClass is Hashable | ||
fun hash(): USize => 42 | ||
``` | ||
|
||
The main point here is that `Hashable`s main use case is its usage in `HashMap` etc. but | ||
it is not limited to that. The notion of Hashability is generic enough to earn a rank | ||
among other interfaces like `Equatable` which already reside in the `builtin` package. | ||
|
||
In fact, there exists a fundamental relationship between `Equatable` and `Hashable`: If | ||
a class `C` is both `Equatable` and `Hashable`, if two instances `c1` and `c2` are considered | ||
equal (as far as structural equality is concerned), their hash should also be the same. | ||
|
||
This RFC also suggests to add a primitive called `Hashing` which provides convenience methods for | ||
defining fast and good (enough) `fun hash(): USize` methods for arbitrary objects, combining hashes of the objects fields | ||
with a good hash function ([Siphash-2-4](https://131002.net/siphash/)). | ||
|
||
Consider the following example: | ||
|
||
```pony | ||
|
||
class val MyClass | ||
let _field1: String | ||
let _field2: U32 | ||
|
||
new val create(f1: String, f2: String) => | ||
_field1 = f1 | ||
_field2 = f2 | ||
``` | ||
|
||
To make this class implement `Hashable` or `Hashable64`, we need to find a way to | ||
combine the hashes of both fields if these should make up the identity for this class. | ||
In order to make it more convenient for people to write useful classes | ||
it is favorable to advertise a common way to define hash functions for objects and offer a convenient way | ||
to implement them without too much boilerplate. | ||
|
||
This is a cranky, crufty and ugly, hand-crafted hash-function for the class above | ||
(adapted from how most IDEs in java auto-create hashCode methods): | ||
|
||
```pony | ||
fun hash(): USize => | ||
var result = USize(17) | ||
result = (result * 37) + _field1.hash() | ||
result = (result * 37) + _field2.hash() | ||
result | ||
``` | ||
|
||
A hash function implemented using a convenience function suggested by this RFC | ||
would look as nice and clean as this: | ||
|
||
```pony | ||
fun hash(): USize => | ||
Hashing.hash_2(_field1, _field2) | ||
``` | ||
|
||
# Detailed design | ||
|
||
## Move Hashable to Builtin | ||
|
||
We move the `Hashable` and `Hashable64` interfaces to the `builtin` package. | ||
|
||
## Make hashing objects convenient | ||
|
||
A primitive called `Hashing` will be introduced into the `builtin` package, alongside with a pure pony `Siphash` implementation. | ||
Currently libponyrt contains code for hashing byte-sequences with a C-implementation of Siphash | ||
in https://github.com/ponylang/ponyc/blob/master/src/libponyrt/ds/fun.c#L92 . As the implementation is already iterating over | ||
the array in 8-byte (4-byte for 32-bit platforms) increments, we can also use it in a streaming fashion, feeding it consecutive U64. | ||
This way we can feed it hashes of object fields and compute the sip-hash-2-4 of all the hashes of the object fields. | ||
|
||
When hashing objects based on their fields, we must take all those fields into account. They might be primitive types, objects with fields themselves, | ||
their hash might be using the fields identiy using the `digestof` operator etc. etc. So generating a hash boils down to a clever way to combine the hashes of all fields | ||
that should be considered. The hashes of the fields are chosen as a common denominator. | ||
And, ideally, those hashes should reflect all of the fields contents. | ||
So if we build a hash of those hashes using a good enough hash function, we should, by transitivity, also get a good enough hash of the whole object in return. | ||
|
||
As far as I know, [SipHash-2-4](https://131002.net/siphash/) can be considered a good enough hash-function. | ||
It is also already in use for byte sequences and strings in Pony. | ||
|
||
This `Hashing` primitive will contain methods for hashing `ByteSeqs`, as e.g. `Array[U8]` does not implement `Hashable` but can be easily hashed, | ||
in fact all Arrays of primitive numeric types can and should be very easily hashed. | ||
It will contain methods for hashing `ReadSeq[Hashable #read]` by combining the element hashes into one hash value | ||
using the [SipHash-2-4](https://131002.net/siphash/) implemention described above. | ||
It will also contain methods for conveniently combining hashes from different objects, like: | ||
|
||
```pony | ||
|
||
primitive Hashing | ||
fun hash_3(a: Hashable box, b: Hashable box, c: Hashable box): USize => | ||
""" | ||
convenience function for combining the hashes of three object implementing `Hashable` | ||
into one hash value. | ||
""" | ||
ifdef ilp32 then | ||
let sip = HalfSipHash24Streaming.create() | ||
sip.update(a.hash().u32()) | ||
sip.update(b.hash().u32()) | ||
sip.update(c.hash().u32()) | ||
sip.finish() | ||
else | ||
let sip = SipHash24Streaming.create() | ||
sip.update(a.hash().u64()) | ||
sip.update(b.hash().u64()) | ||
sip.update(c.hash().u64()) | ||
sip.finish() | ||
end | ||
``` | ||
|
||
It will contain such functions for 2 arguments up to 12 arguments, maybe more. | ||
|
||
# How We Teach This | ||
|
||
The Tutorial or the pony patterns (or both) will get a chapter on how to create `value classes`, classes fully represented by their fields. | ||
In other category theory contexts these are also known as Product types. | ||
|
||
`value classes` (The actual name we use for these classes is very much up for debate) should implement Hashable using the `Hashing` primitive if possible and Equatable. | ||
Equality should be based upon the same fields used for generating the hash. We should also mention implementing `Comparable`, if a partial order can/should be defined for | ||
the value class in question. | ||
|
||
Also the docstring of `Hashing` and its methods should give usage examples. | ||
|
||
# How We Test This | ||
|
||
Tests for the pony siphash-2-4 implementation should be written, at least ensuring it is equivalent to the C implementation in libponyrt. | ||
|
||
If possible, it should be tested that the hash of a pony object using `Hashing` is a good hash function itself, possibly using some | ||
test suite like SMHasher ( https://github.com/aappleby/smhasher ) or something similar. | ||
|
||
# Drawbacks | ||
|
||
* By introducing a pure pony implementation of SipHash-2-4, we have more pony code to test and maintain. Maybe it does not behave 100% like the C code | ||
would. Maybe differences will only appear on different platforms etc. | ||
* We actually cannot remove the siphash implementation from libponyrt as the stringtab in libponyc still uses it. | ||
|
||
# Alternatives | ||
|
||
* A simple-stupid hash function (something like the bernstein hash function (http://www.cse.yorku.ca/~oz/hash.html)) could be used: | ||
|
||
```pony | ||
fun hash(): USize => | ||
var result = USiz(17) | ||
result = (result * 37) + _field1.hash() | ||
result = (result * 37) + _field2.hash() | ||
result | ||
``` | ||
|
||
It might be faster than siphash-2-4 but might lack some good properties expected from hash functions. | ||
|
||
* Dot not reimplement SipHash in pure pony, but add the possibility of streaming uint64_t/uint32_t through it in libponyrt in C. | ||
After all, we can't really get rid of the siphash code in libponyrt, as we need the in the compiler for the `stringtab`s. | ||
|
||
# Unresolved questions | ||
|
||
* Naming: | ||
- What should be the name for classes whose identity is fully determined by its fields. In java-land, these are often called *value-classes*. | ||
In scala, inspired by category theory, these are called `case classes` or `Product types`. | ||
- Should the primitive containing convenience functions for hashing such objects be called `Hashing` or something else? | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative to this approach that I want to mention is something like what Crystal does.
Every object in Crystal has a
hash
function that takes one argument of typeHasher
. By default, that argument is a new instance ofHasher
, which is a class similar in concept to what you've proposed here with theHashing
primitive. The difference is that it is implemented not as a primitive, but as an immutable/persistent data structure (similar to thosecollections/persistent
in Pony).If we took a similar approach, someone could implement hashing with an example looking something like this:
The advantage to this kind of approach is that it won't require us to define N different
hash_1
,hash_2
,hash_3
, [...] functions as proposed here.This approach also adds the ability to hash a dynamic number of inner fields to hash, rather than being limited to a static function signature with N arguments. For example, an implementation of a
List
style of data structure might want to hash everything present in the list, and in this approach it could be as simple as this:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the beauty in this approach, but it also feels like a weird indirection for the user.
I would not see a requirement for keeping the Hasher immutable/persistent, but would rather make it mutable, as i did with
Siphash24Streaming
here: https://github.com/ponylang/ponyc/compare/hashing-builtin?expand=1#diff-3fcbdccf4a6378f88bdb211e3838307bR9Whatever is faster here.
So, this approach would change the interface
Hashable
to also include thehash_with
method and would require to add aHasher
interface and implementation.Also notice, that my
Hashing
implementation could conveniently solve the case of dynamic number of fields by adding ahash_iter(iter: Iterator[Hashable box])
method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general i think both approaches are interesting and would be willing to rewrite the RFC to incorporate @jemc's suggestion as the RFCs main suggestion, if people tend towards it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note on performance - I'm not sure how the other differences would affect perforamance, but one difference that is worth noting is that the
Hashing.hash_2(x, y)
approach would need virtual calls, due to the calls being done on theHashable
interface arguments without a concrete type. The workaround for this would be to use type parameters to specialize for different types, but that leads to a much more verbose call site (i.e.Hashing.hash_2[TypeX, TypeY](x, y)
).I believe that the
Hasher
class approach wouldn't need any virtual calls, but maybe I'm missing something.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding mutable vs immutable
Hasher
class, I am not strongly opinionated, and would in general lean toward whichever approach is more performant. Probably the mutable approach is more performant in current Pony, since immutable objects need to allocated many more times, and in the absence of Heap-to-stack optimization, that's kind of a big deal.Crystal addresses this by using a
struct
instead of aclass
for this type. In Crystal, astruct
is passed "by value" instead of "by reference", so it's something like a Pony tuple, but with callable methods for it. This also explains why it is implemented as an immutable data structure, since that's the only sane way to deal with a "by value" object without major headaches.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I mentioned If we wanted to do absolutely zero-allocation (and still zero-virtual-call), we could use a tuple ourselves. Something like:
This would probably be the most performant solution (if I had to guess), while also being generally nice in terms of developer experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going, @jemc !
Using SipHash-2-4 needs a state of
(U64, U64, U64. U64, USize)
in 64 bit systems, and(U32, U32, U32, U32, USize)
on 32 bit systems. So we need some dispatching, or we work with USize all the way down. Anyway, it seems very much doable.