-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Hashable to builtin package and make it easier to make objects Hashable #157
base: main
Are you sure you want to change the base?
Conversation
Is there anything in the standard library that is outside of the collections package that implements these interfaces? |
Only builtin classes like the numeric types, String and Pointer implement a |
end | ||
``` | ||
|
||
It will contain such functions for 2 arguments up to 12 arguments, maybe more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative to this approach that I want to mention is something like what Crystal does.
Every object in Crystal has a hash
function that takes one argument of type Hasher
. By default, that argument is a new instance of Hasher
, which is a class similar in concept to what you've proposed here with the Hashing
primitive. The difference is that it is implemented not as a primitive, but as an immutable/persistent data structure (similar to those collections/persistent
in Pony).
If we took a similar approach, someone could implement hashing with an example looking something like this:
class MyClass
let foo: Foo
let bar: Bar
let baz: Baz
fun hash(): U32 =>
hash_with(Hasher.create()).result()
fun hash_with(hasher': Hasher): Hasher =>
var hasher = hasher'
hasher = this.foo.hash_with(hasher)
hasher = this.bar.hash_with(hasher)
hasher = this.baz.hash_with(hasher)
hasher
The advantage to this kind of approach is that it won't require us to define N different hash_1
, hash_2
, hash_3
, [...] functions as proposed here.
This approach also adds the ability to hash a dynamic number of inner fields to hash, rather than being limited to a static function signature with N arguments. For example, an implementation of a List
style of data structure might want to hash everything present in the list, and in this approach it could be as simple as this:
fun hash_with(hasher': Hasher): Hasher =>
var hasher = hasher'
for item in values() do
hasher = this.foo.hash_with(hasher)
end
hasher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the beauty in this approach, but it also feels like a weird indirection for the user.
I would not see a requirement for keeping the Hasher immutable/persistent, but would rather make it mutable, as i did with Siphash24Streaming
here: https://github.com/ponylang/ponyc/compare/hashing-builtin?expand=1#diff-3fcbdccf4a6378f88bdb211e3838307bR9
fun hash_with(hasher: Hasher ref): Hasher ref =>
this.foo.hash_with(hasher)
this.bar.hash_with(hasher)
this.baz.hash_with(hasher)
hasher
Whatever is faster here.
So, this approach would change the interface Hashable
to also include the hash_with
method and would require to add a Hasher
interface and implementation.
Also notice, that my Hashing
implementation could conveniently solve the case of dynamic number of fields by adding a hash_iter(iter: Iterator[Hashable box])
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general i think both approaches are interesting and would be willing to rewrite the RFC to incorporate @jemc's suggestion as the RFCs main suggestion, if people tend towards it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note on performance - I'm not sure how the other differences would affect perforamance, but one difference that is worth noting is that the Hashing.hash_2(x, y)
approach would need virtual calls, due to the calls being done on the Hashable
interface arguments without a concrete type. The workaround for this would be to use type parameters to specialize for different types, but that leads to a much more verbose call site (i.e. Hashing.hash_2[TypeX, TypeY](x, y)
).
I believe that the Hasher
class approach wouldn't need any virtual calls, but maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding mutable vs immutable Hasher
class, I am not strongly opinionated, and would in general lean toward whichever approach is more performant. Probably the mutable approach is more performant in current Pony, since immutable objects need to allocated many more times, and in the absence of Heap-to-stack optimization, that's kind of a big deal.
Crystal addresses this by using a struct
instead of a class
for this type. In Crystal, a struct
is passed "by value" instead of "by reference", so it's something like a Pony tuple, but with callable methods for it. This also explains why it is implemented as an immutable data structure, since that's the only sane way to deal with a "by value" object without major headaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I mentioned If we wanted to do absolutely zero-allocation (and still zero-virtual-call), we could use a tuple ourselves. Something like:
type HashState is (U32, U32) // or whatever other kind of state we need...
primitive Hashing
[...]
class MyClass
let foo: Foo
let bar: Bar
let baz: Baz
fun hash(): U32 =>
Hashing.finish(hash_with(Hashing.begin()))
fun hash_with(state': HashState): HashState =>
var state = state'
state = this.foo.hash_with(state)
state = this.bar.hash_with(state)
state = this.baz.hash_with(state)
state
This would probably be the most performant solution (if I had to guess), while also being generally nice in terms of developer experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going, @jemc !
Using SipHash-2-4 needs a state of (U64, U64, U64. U64, USize)
in 64 bit systems, and (U32, U32, U32, U32, USize)
on 32 bit systems. So we need some dispatching, or we work with USize all the way down. Anyway, it seems very much doable.
During sync it was mentioned, that keeping hashing free from allocation is very important as hashing often happens in hot-paths. It was discussed that moving Will update the RFC accordingly. |
I did a POC of this RFC in on the hashing_builtin branch of the pony repo and created a small benchmark to compare the different approaches to hashing (main difference is in hashing numbers and using a pure pony siphash implementation everywhere): https://gist.github.com/mfelsche/da13a70d6e4048bd70800a46f23b7ef4 TL;DR: hashing numbers is around 2 times slower, but that actually doesnt affect e.g. If those benchmarks seem odd to you, I can happily redo them. |
Do we have an understanding of what causes the perf difference for hashing numbers? Is it more about the overhead of keeping the state, or is it more about per-operation overhead (e.g. safe arithmetic vs unsafe arithmetic)? |
@jemc the simple reason is that it now uses siphash as does string and array hashing (it also did before but was using a C implementation in libponyrt). Previously numbers where hashed by this small function: var x = u64()
x = (not x) + (x << 21)
x = x xor (x >> 24)
x = (x + (x << 3)) + (x << 8)
x = x xor (x >> 14)
x = (x + (x << 2)) + (x << 4)
x = x xor (x >> 28)
x = x + (x << 31)
x I am unable to comment on the hashing quality of this. As all hashing should operate on the same HashState, i didnt dare to mess with the given tuples and basically do the same as above on, say the first element of the tuple. I am not experienced enough to judge on what i am doing to the hashing quality of the used siphash in that case. |
Let me argue in favor of this RFC in case it comes to voting and i am absent: So, while we hit a 2x performance penalty on hashing numbers, we gain a little on strings (i think because of llvm optimizations which might not be there in the compiled runtime), but most importantly we gain a means to do composite hashing of classes that has good quality, is not ad-hoc and error-prone, that is easy to use and teach, and, most importantly to me, it enables us to implement something like |
Given that However, if anyone on hand feels qualified to comment, I'd love to figure out if we could somehow get the best of both worlds, and do something like this thing that you said you didn't dare to do, due to lack of knowing how hashing quality would be affected:
I personally would not be qualified to make a decision like this either, without more time spent researching and understanding the topic. |
Based on discussion during sync, I think that the "how to teach it" needs to be updated to include documenting how to change the hash function for a given object so that if the number hashing is an issue (or hashing in general is an issue). I'm not sure what that should be. It might be part of stdlib documentation, it might be part of the tutorial. But I think it is important as part of the change given possible performance or quality impacts. I'd like to see this change made to the RFC and fleshed out before moving this to final comment period, |
Something else to help mitigate the performance issue that we discussed in the sync, which can also be used to help with teaching efforts: We could add a new standard library package that includes something like Note that the |
I had a look at how rust does it. There the I would honestly just do roughly the same, and have the I am sorry for not being able to participate in a tighter and more regular frequency. |
Rendered
Suggested changes: https://github.com/ponylang/ponyc/compare/hashing-builtin?expand=1#diff-3fcbdccf4a6378f88bdb211e3838307bR9