Skip to content

Refactor to a regex-based algorithm #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 25, 2023
Merged

Refactor to a regex-based algorithm #4

merged 1 commit into from
Nov 25, 2023

Conversation

skryukov
Copy link
Owner

The primary goal of this PR is to enhance the speed of URI::IDNA::UTS46, bringing it up to par with the speed of the pure IDNA algorithm from Addressable. Achieving this could enable URI::IDNA::UTS46 to be used as a default IDNA-algorithm in Addressable without significant performance trade-offs.

To accomplish this, this PR introduces several changes to the internal implementation algorithm—it includes additional checks for ascii_only?, as suggested in the ruby/ruby#8966 for Ruby, and replaces loops over characters/code points with regular expressions, among other minor enhancements. Additionally. This PR also includes more unit-tests for validation logic, yay 🎉

To assess the performance improvements, I used a modified version of the addressable benchmark:

# /usr/bin/env ruby
# frozen_string_literal: true.

require "benchmark"
require "uri/idna"
require "addressable"
require "addressable/idna/pure"

simple = "google.com"
unicode = "fiᆵリ宠퐱卄.com"
punycode = "xn--fi-w1k207vk59a3qk9w9r.com"
N = 100_000

Benchmark.bmbm do |x|
  x.report("pure") { N.times { Addressable::IDNA::Pure.to_unicode(Addressable::IDNA::Pure.to_ascii(simple)) } }
  x.report("pure unicode") { N.times { Addressable::IDNA::Pure.to_unicode(Addressable::IDNA::Pure.to_ascii(unicode)) } }
  x.report("pure punycode") { N.times { Addressable::IDNA::Pure.to_unicode(Addressable::IDNA::Pure.to_ascii(punycode)) } }

  x.report("uri/idna") { N.times { URI::IDNA.to_unicode(URI::IDNA.to_ascii(simple)) } }
  x.report("uri/idna unicode") { N.times { URI::IDNA.to_unicode(URI::IDNA.to_ascii(unicode)) } }
  x.report("uri/idna punycode") { N.times { URI::IDNA.to_unicode(URI::IDNA.to_ascii(punycode)) } }
end

Results (MacBook Pro 13", M1, 2020, 16GB, ruby v3.2.2):

Rehearsal -----------------------------------------------------
pure                0.447617   0.003119   0.450736 (  0.453697)
pure unicode        4.123272   0.006671   4.129943 (  4.133575)
pure punycode       1.982393   0.002661   1.985054 (  1.987714)
uri/idna            0.773894   0.000712   0.774606 (  0.775704)
uri/idna unicode    3.867693   0.001920   3.869613 (  3.874109)
uri/idna punycode   5.087081   0.002383   5.089464 (  5.095687)
------------------------------------------- total: 16.299416sec

                        user     system      total        real
pure                0.443630   0.000954   0.444584 (  0.445314)
pure unicode        4.185360   0.007045   4.192405 (  4.198606)
pure punycode       2.000301   0.003139   2.003440 (  2.006636)
uri/idna            0.779944   0.000704   0.780648 (  0.781443)
uri/idna unicode    3.888171   0.002455   3.890626 (  3.902315)
uri/idna punycode   5.334571   0.026223   5.360794 (  5.613100)

@skryukov skryukov merged commit 3de8abe into main Nov 25, 2023
@skryukov skryukov deleted the regex-lookups branch November 25, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant