Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Hashable to builtin package and make it easier to make objects Hashable #157

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mfelsche
Copy link
Contributor

@mfelsche mfelsche commented Jan 6, 2020

@mfelsche mfelsche changed the title Add RFC for moving Hashable to builtin package Move Hashable to builtin package and make it easier to make objects Hashable Jan 6, 2020
@SeanTAllen
Copy link
Member

Is there anything in the standard library that is outside of the collections package that implements these interfaces?

@mfelsche
Copy link
Contributor Author

mfelsche commented Jan 6, 2020

Only builtin classes like the numeric types, String and Pointer implement a hash or hash64() function. Nothing else inside of stdlib.

end
```

It will contain such functions for 2 arguments up to 12 arguments, maybe more.
Copy link
Member

@jemc jemc Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative to this approach that I want to mention is something like what Crystal does.

Every object in Crystal has a hash function that takes one argument of type Hasher. By default, that argument is a new instance of Hasher, which is a class similar in concept to what you've proposed here with the Hashing primitive. The difference is that it is implemented not as a primitive, but as an immutable/persistent data structure (similar to those collections/persistent in Pony).

If we took a similar approach, someone could implement hashing with an example looking something like this:

class MyClass
  let foo: Foo
  let bar: Bar
  let baz: Baz

  fun hash(): U32 =>
    hash_with(Hasher.create()).result()

  fun hash_with(hasher': Hasher): Hasher =>
    var hasher = hasher'
    hasher = this.foo.hash_with(hasher)
    hasher = this.bar.hash_with(hasher)
    hasher = this.baz.hash_with(hasher)
    hasher

The advantage to this kind of approach is that it won't require us to define N different hash_1, hash_2, hash_3, [...] functions as proposed here.

This approach also adds the ability to hash a dynamic number of inner fields to hash, rather than being limited to a static function signature with N arguments. For example, an implementation of a List style of data structure might want to hash everything present in the list, and in this approach it could be as simple as this:

  fun hash_with(hasher': Hasher): Hasher =>
    var hasher = hasher'
    for item in values() do
      hasher = this.foo.hash_with(hasher)
    end
    hasher

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the beauty in this approach, but it also feels like a weird indirection for the user.

I would not see a requirement for keeping the Hasher immutable/persistent, but would rather make it mutable, as i did with Siphash24Streaming here: https://github.com/ponylang/ponyc/compare/hashing-builtin?expand=1#diff-3fcbdccf4a6378f88bdb211e3838307bR9

fun hash_with(hasher: Hasher ref): Hasher ref =>
  this.foo.hash_with(hasher)
  this.bar.hash_with(hasher)
  this.baz.hash_with(hasher)
  hasher

Whatever is faster here.

So, this approach would change the interface Hashable to also include the hash_with method and would require to add a Hasher interface and implementation.

Also notice, that my Hashing implementation could conveniently solve the case of dynamic number of fields by adding a hash_iter(iter: Iterator[Hashable box]) method.

Copy link
Contributor Author

@mfelsche mfelsche Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general i think both approaches are interesting and would be willing to rewrite the RFC to incorporate @jemc's suggestion as the RFCs main suggestion, if people tend towards it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note on performance - I'm not sure how the other differences would affect perforamance, but one difference that is worth noting is that the Hashing.hash_2(x, y) approach would need virtual calls, due to the calls being done on the Hashable interface arguments without a concrete type. The workaround for this would be to use type parameters to specialize for different types, but that leads to a much more verbose call site (i.e. Hashing.hash_2[TypeX, TypeY](x, y)).

I believe that the Hasher class approach wouldn't need any virtual calls, but maybe I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding mutable vs immutable Hasher class, I am not strongly opinionated, and would in general lean toward whichever approach is more performant. Probably the mutable approach is more performant in current Pony, since immutable objects need to allocated many more times, and in the absence of Heap-to-stack optimization, that's kind of a big deal.

Crystal addresses this by using a struct instead of a class for this type. In Crystal, a struct is passed "by value" instead of "by reference", so it's something like a Pony tuple, but with callable methods for it. This also explains why it is implemented as an immutable data structure, since that's the only sane way to deal with a "by value" object without major headaches.

Copy link
Member

@jemc jemc Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I mentioned If we wanted to do absolutely zero-allocation (and still zero-virtual-call), we could use a tuple ourselves. Something like:

type HashState is (U32, U32) // or whatever other kind of state we need...
primitive Hashing
  [...]
class MyClass
  let foo: Foo
  let bar: Bar
  let baz: Baz

  fun hash(): U32 =>
    Hashing.finish(hash_with(Hashing.begin()))

  fun hash_with(state': HashState): HashState =>
    var state = state'
    state = this.foo.hash_with(state)
    state = this.bar.hash_with(state)
    state = this.baz.hash_with(state)
    state

This would probably be the most performant solution (if I had to guess), while also being generally nice in terms of developer experience.

Copy link
Contributor Author

@mfelsche mfelsche Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going, @jemc !

Using SipHash-2-4 needs a state of (U64, U64, U64. U64, USize) in 64 bit systems, and (U32, U32, U32, U32, USize) on 32 bit systems. So we need some dispatching, or we work with USize all the way down. Anyway, it seems very much doable.

@mfelsche
Copy link
Contributor Author

mfelsche commented Jan 7, 2020

During sync it was mentioned, that keeping hashing free from allocation is very important as hashing often happens in hot-paths.

It was discussed that moving Hashable etc to builtin has some use cases outside of the collections package. A suggestion was made to move hashing related logic (Hashable etc. except HashMap) into a separate hashing package and use it from collections.
But as the RFC discussion in the comments already went along, given we use the hash_with approach, we would need the Hashing and HashState primitives/aliases available in builtin, if we want to apply this new approach to the builtin classes like numerics and String.

Will update the RFC accordingly.

@mfelsche
Copy link
Contributor Author

mfelsche commented Sep 7, 2020

I did a POC of this RFC in on the hashing_builtin branch of the pony repo and created a small benchmark to compare the different approaches to hashing (main difference is in hashing numbers and using a pure pony siphash implementation everywhere): https://gist.github.com/mfelsche/da13a70d6e4048bd70800a46f23b7ef4

TL;DR: hashing numbers is around 2 times slower, but that actually doesnt affect e.g. HashMap.update. Hashing Strings seems to be a bit faster.

If those benchmarks seem odd to you, I can happily redo them.
If those timings are not a blocker for anyone, I will go on and finalize this RFC and add the finishing changes.

@jemc
Copy link
Member

jemc commented Sep 8, 2020

Do we have an understanding of what causes the perf difference for hashing numbers?

Is it more about the overhead of keeping the state, or is it more about per-operation overhead (e.g. safe arithmetic vs unsafe arithmetic)?

@mfelsche
Copy link
Contributor Author

@jemc the simple reason is that it now uses siphash as does string and array hashing (it also did before but was using a C implementation in libponyrt). Previously numbers where hashed by this small function:

    var x = u64()
    x = (not x) + (x << 21)
    x = x xor (x >> 24)
    x = (x + (x << 3)) + (x << 8)
    x = x xor (x >> 14)
    x = (x + (x << 2)) + (x << 4)
    x = x xor (x >> 28)
    x = x + (x << 31)
    x

I am unable to comment on the hashing quality of this.
Whereas now it does full siphash 2-4 which consists of 6 rounds of this: https://github.com/ponylang/ponyc/compare/hashing-builtin#diff-3fcbdccf4a6378f88bdb211e3838307bR272-R286 and some more xoring and stuff.

As all hashing should operate on the same HashState, i didnt dare to mess with the given tuples and basically do the same as above on, say the first element of the tuple. I am not experienced enough to judge on what i am doing to the hashing quality of the used siphash in that case.

@mfelsche
Copy link
Contributor Author

Let me argue in favor of this RFC in case it comes to voting and i am absent:

So, while we hit a 2x performance penalty on hashing numbers, we gain a little on strings (i think because of llvm optimizations which might not be there in the compiled runtime), but most importantly we gain a means to do composite hashing of classes that has good quality, is not ad-hoc and error-prone, that is easy to use and teach, and, most importantly to me, it enables us to implement something like value classes or case classes (i can elaborate on that if necessary), with some simple syntax sugar.

@jemc
Copy link
Member

jemc commented Sep 11, 2020

Given that HashMap and other collection types allow specifying a custom hash function, I'm tentatively okay with the idea of moving ahead with this RFC despite the potential performance hit for number-hashing cases - those applications for which this turns out to be a problem could use a more simplified / weaker number hashing function if they found that to be appropriate.

However, if anyone on hand feels qualified to comment, I'd love to figure out if we could somehow get the best of both worlds, and do something like this thing that you said you didn't dare to do, due to lack of knowing how hashing quality would be affected:

As all hashing should operate on the same HashState, i didnt dare to mess with the given tuples and basically do the same as above on, say the first element of the tuple. I am not experienced enough to judge on what i am doing to the hashing quality of the used siphash in that case.

I personally would not be qualified to make a decision like this either, without more time spent researching and understanding the topic.

@SeanTAllen
Copy link
Member

@Theodus @sylvanc can either of you weigh in on this?

@Theodus I'm not sure if you know much about hasing strategies, but I think you might.

@SeanTAllen
Copy link
Member

SeanTAllen commented Sep 15, 2020

Based on discussion during sync, I think that the "how to teach it" needs to be updated to include documenting how to change the hash function for a given object so that if the number hashing is an issue (or hashing in general is an issue).

I'm not sure what that should be. It might be part of stdlib documentation, it might be part of the tutorial. But I think it is important as part of the change given possible performance or quality impacts.

I'd like to see this change made to the RFC and fleshed out before moving this to final comment period,

@jemc
Copy link
Member

jemc commented Sep 15, 2020

Something else to help mitigate the performance issue that we discussed in the sync, which can also be used to help with teaching efforts:

We could add a new standard library package that includes something like SimpleHash and SimpleHashIs (or whatever other name we choose), which will use the old logic for number hashing (and by extension, object identity hashing), which anyone who observes a performance regression could easily switch to using in their code. And our teaching example could show how to use this package if you want to use the "simple hash" logic.

Note that the SimpleHash function should be able to be implemented as a primitive with a type parameter, using iftype on the type argument to switch between blocks of code based on which integer type was used.

@mfelsche
Copy link
Contributor Author

mfelsche commented Nov 3, 2020

I had a look at how rust does it. There the hash method takes a trait Hasher as argument. Rust's default implementation uses SipHash 2-4 https://doc.rust-lang.org/std/hash/struct.SipHasher.html as we do. The key advantage of Siphash is that it can be initiated with a random key and thus safeguard against hash flooding attacks by hashing to different values depending on the initial random key used for the hasher. Rust addresses the problem of slow hashing on numbers and in general short sequences of bytes by making it possible to switch the Hasher to use when creating a HashMap which then passes it into the hash method of the thing to create a hash from. Rust namely advertises FNV to be used for hashing numbers. See here for some performance comparisons: https://cglab.ca/~abeinges/blah/hash-rs/

I would honestly just do roughly the same, and have the Hasher as a type parameter that defaults to SipHash. As discussed above, we should add a SipHasher and a FNVHasher and clearly document both in stdlib and in the tutorial when to use which. I am not sure, if we can have those hashers (that define their own state to drag along through calls of hash or hash_with) be primitives, but I would try to design it that way if possible.

I am sorry for not being able to participate in a tighter and more regular frequency.

Base automatically changed from master to main February 8, 2021 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants