-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up statistics #7390
Speed up statistics #7390
Conversation
7e13c45
to
1dc0632
Compare
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data/Statistics.enso
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to hear about the performance improvements and in general it is much nicer to have this kind of code in Enso and to hear that it's faster!
I have some reservations to the Accumulator
- the logic is pretty dense so it is not easy to understand it at first glance. After a thorough read it is making sense, since it's not a huge piece of logic - so I guess it can stay that way, although maybe slight refinement of names or perhaps a few comments explaining what the less obvious methods (comparator
, comparator_error
) do.
I think the most confusing part to me was the handling of the edge cases: []
and [Nothing]
- since we have a check early on counter.count == 0
that is actually decoupled from the Accumulator
- but that check is part of the Accumulators logic really. Could we maybe move it into it? I think it would make it easier to understand if this logic was in a single place. But I'm fine with keeping it as is. It would be nice to have it slightly clearer, but that is not the highest priority.
The only thing that I think needs to be done (either before merging or after if we can have a follow up PR), is ensuring there are no regression on a few edge cases that unfortunately were not tested before, as I noted in the comments.
- Handling of non-empty vectors of only missing values -
[Nothing, Nothing]
or[Number.nan]
etc. - Handling of the possibility of a comparator throwing a dataflow error other than
Incomparable_Values
(I think this might have been impossible in the past but that has changed).
Especially the (1) is a priority, because if I read the code correctly the old code was throwing Empty_Error
which makes sense, but now I think it will throw Incomparable_Values
- let's make sure this works correctly. We should check both compute
and running
. This seems to be the old behaviour that we should not lose:
[Nothing, Nothing] . compute Statistic.Minimum . should_fail_with Empty_Error
[Nothing, Nothing] . running Statistic.Minimum [Nothing, Nothing]
As for (2), it would be great to test it:
type My_Type
Value x
type My_Comparator
compare x y =
_ = [x, y]
Error.throw (Illegal_State.Error "TEST")
hash x = x
Comparable.from (_:My_Type) = My_Comparator
...
x = My_Type.Value 1
y = My_Type.Value 2
v = [x, y]
v.compute Statistic.Count . should_equal 2
# This is IMO the preferred behaviour:
v.compute Statistic.Maximum . should_fail_with Illegal_State
# This is the OLD behaviour:
v.compute Statistic.Maximum . should_fail_with Incomparable_Values
I guess it's ok if we are able to only reproduce the old behaviour, but while we are at it, IMO it would be ideal to ensure that examples like this one propagate the error that was actually thrown in the comparator and not convert it to Incomparable_Values
, essentially hiding the underlying issue and making this harder to debug.
I appreciate that custom comparators are rare and them throwing errors is even rarer occurrence, so if we need to proceed fast, I imagine we could do this as a separate ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving in case you need to proceed, just please add these few tests for (1) if possible.
The rest would be nice, but I think we can live with it for now (although I think it would be good to at least get a ticket for (2)).
The empty values check has been moved into the Accumulator and the error handling simplified as the accumulator will now fail on incomparable values. Hopefully this makes it clearer. |
…computed if needed.
14f8601
to
18195e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good
I finally got a bit of time to dig into this further. I believe we want Max (stats) be as fast as Max (by reduce) right?
If so, then we are more than ten times off. I've reported #7525 as a small result of my findings. Main FindingsOverall it seems that the graph for Max (stats) is a way bigger than the Max (by reduce). My eyes got attracted by compilation of
At the end I decided to use Visual VM and its polyglot profiler. Following snapshot shows the problem is caused more by various abstractions in the Clearly this amount of calls can only hardly be compiled into something faster than |
The |
Could you show me the code of the two approaches you were comparing? Were you using I think ideally we'd want If you want to look into this further, maybe it could be worth to try comparing:
I imagine |
Both tests are in Operations.enso and can be executed as:
or just
|
Thanks, so it looks like its the Enso implementation. Then I'm really curious what is the perf difference between the two vector zips I posted above - |
Pull Request Description
parse_to_columns
to take aRegex
object.pattern
to theRegex
object.column_names
to theRow
object.Also tried using a Ref approach for stats but as slower (7e13c45).
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
./run ide build
.