Optimize #add_referenced #238

pkmiec · 2024-02-02T17:08:20Z

Summary

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2 (and possibly earlier).

Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what actually changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes.

Performance

The benchmark was done with the following script,

require 'benchmark'
require 'combine_pdf'

puts "Ruby: #{RUBY_VERSION}"
puts "CombinePDF: #{CombinePDF::VERSION}"

files = []
68.times { |i| files << "/tmp/pdfs/#{i}.pdf" }
files = files * 10 # to exacerbate the effect

result_pdf = CombinePDF.new
files.each { |file| result_pdf << CombinePDF.parse(IO.read(file)) }
puts(Benchmark.measure do
    result_pdf.save('/tmp/combined.pdf')
end)

FYI ... we end up with 32053 pdf objects in @objects array.

Before

Ruby: 2.7.7
CombinePDF: 1.0.26
2.598427 0.011980 2.610407 ( 2.617881)

Ruby: 3.2.2
CombinePDF: 1.0.26
15.067833 0.026986 15.094819 ( 15.139298)

After

Ruby: 2.7.7
CombinePDF: 1.0.26 (with this PR)
2.768545 0.006937 2.775482 ( 2.786386)

Ruby: 3.2.2
CombinePDF: 1.0.26 (with this PR)
1.997242 0.016295 2.013537 ( 2.021782)

The #add_referenced kept track of existing object with a hash when the keys were the objects. This seemed to have been ok with Ruby 2.7, but became significantly slower in Ruby 3.2. Profiling showed that many of those objects are instances of Hash and Ruby uses #eql? method to compare Hash keys. It is not clear what acutally changed in Ruby, but we can work around the issue by using #hash so that the key is an integer. We just need to recompute that integer if the object changes. Co-authored-by: Jeremy Kirchhoff <Jeremy.Kirchhoff@appfolio.com>

BenMorganMY · 2024-07-11T18:16:09Z

@pkmiec our use of combine_pdf still experiences the slowdown on ruby 3.1. I'm wondering if you could explain the use .hash so that I can find other areas for improvement.

boazsegev · 2024-07-13T22:44:54Z

Using .hash instead of .eql? adds a new risk of Hash collisions – rare but definitely possible when squashing variable multi-byte objects into 64 bits (or, I believe that in Ruby, it would actually be limited to 62 bits).

This risk could silently corrupt PDF data, which could be a significant error and make CombinePDF unsuitable for some applications.

Is there another viable approach? Or perhaps it would be better to drop duplication detection instead? How would that affect performance (memory usage will be higher, but other than that...)?

pkmiec · 2024-07-15T20:56:43Z

Hi!

@BenMorganMY Have you already profiled your code and identified that it is still slow in this #add_reference method? I can imagine that given a different set of PDFs you may be hitting a slowness in a different part of the code. But to answer your question, the point of the method is to try to de-dup references to objects. I imagine that #eql needs to "walk" the object to compute whether it is the same as another object. Since the objects are Hashes then this walking becomes a recursive operation and thus more expensive. Perhaps the result of this "walking" was cached in earlier versions of Ruby and now it is not. Note, I could not find anything in Ruby's changelog, so I do not know for sure. Using #hash was a way to force that computation to produce a number and then compare numbers instead of Hashes.

@boazsegev That's a good question. I do not know the PDF spec well enough to say. When I was looking at it, I wanted to avoid changing the behavior of the method in order to avoid introducing some incompatibility with subsequent code or PDF readers.

amomchilov · 2024-11-02T19:56:44Z

I found a related perf improvement in this method, see #241

boazsegev · 2024-11-10T03:03:53Z

I assume this was fixed in #241 and should be closed?

amomchilov · 2024-11-10T14:32:10Z

@boazsegev Not quite. I happened to touch the same line, but a different statement.

I changed the definition and usage of the resolved hash, as in resolved[obj.object_id] = obj, but this PR is changing the existing Hash as in existing[obj.hash] = obj.

As an outsider, I did find this several;statements;per;line style a bit hard to grok, and I think this suggests I'm not the only one 😅

boazsegev · 2024-11-10T16:20:33Z

@amomchilov ,

I love how specific and targeted PR #241 had been. However, I assumed it was meant to deal with the hash issue as well. My apologies for that.

However, to restate my previous comment – the existing[obj.hash] approach appears incomplete (unless I'm missing something).

SipHash, which is the underlying algorithm for obj.hash, isn't cryptographic and is more likely to produce hash collision than a cryptographic hash. Moreover, Ruby uses 62 bits from the 64 bit SipHash result, slightly increasing the risk.

Not that I believe that hash collisions are likely, but this could still happen and this means that obj.hash requires additional validation. We need to verify that a hash collision doesn't lead to an error (i.e., validating a correct result by manually checking for equality up to a certain depth).

@pkmiec ,

I think we need to rebase this PR if we are going to give it a chance. I would also like to see how we protect against possible hash collisions before I merge.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize #add_referenced #238

Optimize #add_referenced #238

pkmiec commented Feb 2, 2024

BenMorganMY commented Jul 11, 2024

boazsegev commented Jul 13, 2024 •

edited

Loading

pkmiec commented Jul 15, 2024

amomchilov commented Nov 2, 2024

boazsegev commented Nov 10, 2024

amomchilov commented Nov 10, 2024

boazsegev commented Nov 10, 2024

Optimize #add_referenced #238

Are you sure you want to change the base?

Optimize #add_referenced #238

Conversation

pkmiec commented Feb 2, 2024

Summary

Performance

Before

After

BenMorganMY commented Jul 11, 2024

boazsegev commented Jul 13, 2024 • edited Loading

pkmiec commented Jul 15, 2024

amomchilov commented Nov 2, 2024

boazsegev commented Nov 10, 2024

amomchilov commented Nov 10, 2024

boazsegev commented Nov 10, 2024

boazsegev commented Jul 13, 2024 •

edited

Loading