-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize #add_referenced #238
base: master
Are you sure you want to change the base?
Conversation
The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2. Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what acutally changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes. Co-authored-by: Jeremy Kirchhoff <Jeremy.Kirchhoff@appfolio.com>
@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use |
Using This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications. Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)? |
Hi! @BenMorganMY Have you already profiled your code and identified that it is still slow in this @boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers. |
I found a related perf improvement in this method, see #241 |
I assume this was fixed in #241 and should be closed? |
@boazsegev Not quite. I happened to touch the same line, but a different statement. I changed the definition and usage of the As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅 |
I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the However, to restate my previous comment – the SipHash, which is the underlying algorithm for Not that I believe that hash collisions are likely, but this could still happen and this means that @pkmiec , I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge. Thanks. |
Summary
The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).
Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.
Performance
The benchmark was done with the following script,
FYI ... we end up with 32053 pdf objects in @objects array.
Before
Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)
Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)
After
Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)
Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)