-
Notifications
You must be signed in to change notification settings - Fork 15
get_tld(), get_components(), and other #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…irements. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. 1. get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks. 2. get_tld_unsafe() rewritten to create minimal number of temporary objects 3. get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles. 4. get_components() and get_components_unsafe() method are introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD
…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.
…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.
…l (hopefully all) possible configuration of public-suffix list. Full consistent validation of results of get_tld, get_sld and get_components methods provided for each configuration. As get_sld and get_components method are based on get_tld, get_tld test cases are more generic (include more complex variants). The former ones check only the difference in the configuration from the previous variant.
…es. Eventually it should cover the full set of test cases and merged into README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vadym-t Thank you ++
That's a big PR and I reckon this would change the API and behavior.
I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)
Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?
Hi. I've worked with Vadym extensively over the last week+ on this logic. He is on our team and identified logical bugs and performance issues. I haven't yet code reviewed the latest version, but tested earlier version extensively w/o issue. To your point, though, didn't test python2.7 at all. The primary changes are two-fold:
Vadym asked what to do -- and I suggested fix the tests to what they really should be and do a PR. In addition, he added a new suite which starts from the published algorithm and addresses all the combinations. So overall, I think it should be merged and will be an upgrade for everyone. the functionality breaks will occur at the edge for very large-scale DNS type things where you'd see bad behavior and root queries. We plan to adopt in Infoblox Data Science. will re-review the code and documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main comments are on:
- documentation suggestions to add a few things,
- in the tests, where i'm expecting a strip to occur in the function get_sld(), it seems the tests do not. also, i don't see any tests for the unsafe versions --that's where i'd expect to see the none behavior.
am i missing something?
where the backward compatibility breaks are
- edge cases that are rare, e.g., having root return None instead of empty string -- don't have strong opinion on this.
- cases where the tests didn't match the published algorithm
This file describes exact behavior of methods for different edge cases and | ||
explains general logic. This description covers the behavior of get_tld, | ||
get_tld_unsafe, get_sld, get_sld_unsafe, split_domain, split_domain_unsafe | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vadym-t suggest to add here a why sentence:
Unsafe versions of the methods will significantly save resources on large-scale applications of the library where the data has already been converted to lowercase and missing data has a None value. This can be done in Spark/Dask, for example, and result in a significant reduction in computational resources. For adhoc usage, the original functions are sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
as leading dots are not processed. | ||
split_domain method is based on get_sld method - it returns everything in | ||
front of get_sld() as a prefix. | ||
Specifically to example above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vadym-t you might also add here a non-edge case example. the split_domain() method offers a new capability to the library-- one that folks might get from other libraries -- but your only example is the edge. suggest something like:
split_domain allows you to recover the host, or prefix, of an SLD, for use in aggregation or analysis based on the labels. e.g., split_domain('www.googl.com')
get_*_unsafe() does not perform if the input string is None and does not | ||
transforms it to the lower case. | ||
|
||
2. The listed above methods works only with non-canonical FQDN strings - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listed above means all or just the unsafe methods?
Edge cases and expected behavior | ||
The behavior of the library can be illustrated best on the small examples: | ||
(boolean arguments are omitted if does not affect behavior ) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested additional:
The library allows you to create a public suffix list from any file-like object, including a list. In the examples below, we construct the PSL with different lists to demonstrate the functionality of the library under different conditions. In particular, the examples show an empty list, list of two elements, and the use of negation in a list. Combinations of the method parameters are given with notes about the result. These results follow the logic of the Mozilla library as published on the psl.org website.
@@ -78,7 +81,8 @@ def test_get_sld_from_list_with_exception_rule(self): | |||
|
|||
def test_get_sld_from_list_with_fqdn(self): | |||
psl = publicsuffix.PublicSuffixList(['com']) | |||
assert 'example.com' == psl.get_sld('example.com.') | |||
# 'example.com.' -> <example>.<com>.<empty> -> None, empty labels are not allowed | |||
assert None == psl.get_sld('example.com.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't follow this comment. is this failing because of the trailing dot? I read the code as the get_sld() version will remove the trailing dot ... so why doesn't this go to ..
example.com -> example, com -> example.com
??
@@ -284,13 +297,15 @@ def test_get_sld_backward_compatibility_sld_for_empty_string(self): | |||
|
|||
def test_get_sld_backward_compatibility_sld_for_fqdn(self): | |||
psl = publicsuffix.PublicSuffixList() | |||
assert 'foo.com' == psl.get_sld('www.foo.com.') | |||
# 'www.foo.com.' -> <www>.<foo>.<com>.<empty> -> None, empty labels are not allowed | |||
assert None == psl.get_sld('www.foo.com.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question abut strip
i don't see any tests for the unsafe methods. shouldn't there be a few? this is what the big change is.
assert None == psl.get_sld('.', strict=True) | ||
assert '' == psl.get_sld('.', wildcard=False) | ||
assert None == psl.get_sld('.', wildcard=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not backward compatible but logically more accurate
@@ -583,40 +583,49 @@ def test_get_sld_Mixed_case3(self): | |||
assert 'example.com' == publicsuffix.get_sld('WwW.example.COM') | |||
|
|||
def test_get_sld_Leading_dot1(self): | |||
assert 'com' == publicsuffix.get_sld('.com') | |||
# '.com' -> <empty>.<com> -> None, empty labels are not allowed | |||
assert None == publicsuffix.get_sld('.com') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm going to continue having all the strip questions - so i won't repeat them.
|
||
def test_get_sld_Leading_dot2(self): | ||
assert 'example' == publicsuffix.get_sld('.example') | ||
# '.example' -> <empty>.<example> -> None, empty labels are not allowed | ||
assert None == publicsuffix.get_sld('.example') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the difference here is that example is not in the list. so in strict mode should return None and in non-strict mode would return example --- with the strip
correct?
|
||
def test_get_sld_Unlisted_sld1(self): | ||
assert 'example' == publicsuffix.get_sld('example') | ||
|
||
def test_get_sld_Unlisted_sld2(self): | ||
assert 'example' == publicsuffix.get_sld('example.example') | ||
# non-strict: TLD:example, SLD:example.example | ||
assert 'example.example' == publicsuffix.get_sld('example.example') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a change where the original logic was flawed -- didn't follow the algorithm complete-- so the test itself was incorrect
I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it). |
agree, and my brain hadn't noticed that. @pombredanne I would think the priorities would be in rough order:
agreed? given that, wondering if the merge can be split to get these. if the performance improvements suggested by @vadym-t only work with py3 idioms and don't use the trie, what about using an extra flag parameter to trigger usage of the new functions? i can take a closer look on the weekend and rerun the experiments i did with an earlier version. |
@vadym-t ping? |
Reworked the code of get_tld() to reduce the number of temporary objects (reduce the load on GC in the case of bulk operations). The requirement for the input domain name is updated - it allows to have consistent behavior of all methods. The updated requirement: input domain string is a FQDN without trailing dot. Empty labels are not allowed.