get_tld(), get_components(), and other #17

vadym-t · 2020-04-27T14:31:48Z

Reworked the code of get_tld() to reduce the number of temporary objects (reduce the load on GC in the case of bulk operations). The requirement for the input domain name is updated - it allows to have consistent behavior of all methods. The updated requirement: input domain string is a FQDN without trailing dot. Empty labels are not allowed.

get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks.
get_tld_unsafe() rewritten to create minimal number of temporary objects
get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles.
get_components() and get_components_unsafe() method introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD

…irements. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. 1. get_tld() is split into get_tld() and get_tld_unsafe(). The latter implements the entire logic and the former performs lower() conversion and None checks. 2. get_tld_unsafe() rewritten to create minimal number of temporary objects 3. get_sld() split info get_sld() and get_sld_unsafe(). The latter is based on get_tld_unsafe(), same principles. 4. get_components() and get_components_unsafe() method are introduced. These methods allow split the domain string into 3-tuple: (prefix, SLL, TLD) where SLL is second-level label and TLD is TLD or ETLD

…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.

…l (hopefully all) possible configuration of public-suffix list. Full consistent validation of results of get_tld, get_sld and get_components methods provided for each configuration. As get_sld and get_components method are based on get_tld, get_tld test cases are more generic (include more complex variants). The former ones check only the difference in the configuration from the previous variant.

…es. Eventually it should cover the full set of test cases and merged into README.

pombredanne

@vadym-t Thank you ++
That's a big PR and I reckon this would change the API and behavior.
I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)

Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?

pombredanne · 2020-04-28T08:47:27Z

@hiratara @KnitCode ping... your review is much welcomed.

KnitCode · 2020-04-28T11:01:10Z

@vadym-t Thank you ++
That's a big PR and I reckon this would change the API and behavior.
I would rather avoid doing so as it would break existing code. Would there be a way to achieve something similar without changing behavior? (e.g. adding new functions or optional args?)

Also you have introduced Python 3-only idioms that make the test fails. For now, I would like to keep things working on Python 2.7 too. Do you think this could be reworked?

Hi. I've worked with Vadym extensively over the last week+ on this logic. He is on our team and identified logical bugs and performance issues. I haven't yet code reviewed the latest version, but tested earlier version extensively w/o issue. To your point, though, didn't test python2.7 at all.

The primary changes are two-fold:

changes to the computation that reduces the creation of temporary objects. this has the same result as the previous code. In my testing, it was slightly faster than the old version on datasets that would fit in memory for get_tld() and get_sld(). for large distributed sets, the timing was the same but less objects created. this helps for Dask/Spark usage of the library.
test changes. vadym dug into the algorithm on the mozilla site and compared to the tests (theirs and ours). there were tests in their suite that are inconsistent and appear to be based on an old PSL list, but passed for us because of default logic in our existing code, which itself was faulty on edge cases. a number of the issues occur at the edge that most users, e.g, web page usage, wouldn't see but we see in very large DNS systems.

Vadym asked what to do -- and I suggested fix the tests to what they really should be and do a PR. In addition, he added a new suite which starts from the published algorithm and addresses all the combinations.

So overall, I think it should be merged and will be an upgrade for everyone. the functionality breaks will occur at the edge for very large-scale DNS type things where you'd see bad behavior and root queries. We plan to adopt in Infoblox Data Science.

will re-review the code and documentation

KnitCode

my main comments are on:

documentation suggestions to add a few things,
in the tests, where i'm expecting a strip to occur in the function get_sld(), it seems the tests do not. also, i don't see any tests for the unsafe versions --that's where i'd expect to see the none behavior.

am i missing something?

where the backward compatibility breaks are

edge cases that are rare, e.g., having root return None instead of empty string -- don't have strong opinion on this.
cases where the tests didn't match the published algorithm

KnitCode · 2020-04-28T11:05:11Z

UseCases.md

+This file describes exact behavior of methods for different edge cases and 
+explains general logic. This description covers the behavior of get_tld, 
+get_tld_unsafe, get_sld, get_sld_unsafe, split_domain, split_domain_unsafe
+


@vadym-t suggest to add here a why sentence:
Unsafe versions of the methods will significantly save resources on large-scale applications of the library where the data has already been converted to lowercase and missing data has a None value. This can be done in Spark/Dask, for example, and result in a significant reduction in computational resources. For adhoc usage, the original functions are sufficient.

KnitCode · 2020-04-28T11:10:07Z

UseCases.md

+as leading dots are not processed.
+split_domain method is based on get_sld method - it returns everything in 
+front of get_sld() as a prefix.
+Specifically to example above 


@vadym-t you might also add here a non-edge case example. the split_domain() method offers a new capability to the library-- one that folks might get from other libraries -- but your only example is the edge. suggest something like:
split_domain allows you to recover the host, or prefix, of an SLD, for use in aggregation or analysis based on the labels. e.g., split_domain('www.googl.com')

KnitCode · 2020-04-28T11:10:26Z

UseCases.md

+get_*_unsafe() does not perform if the input string is None and does not 
+transforms it to the lower case.
+
+2. The listed above methods works only with non-canonical FQDN strings - 


listed above means all or just the unsafe methods?

KnitCode · 2020-04-28T11:14:38Z

UseCases.md

+Edge cases and expected behavior
+The behavior of the library can be illustrated best on the small examples:
+(boolean arguments are omitted if does not affect behavior )
+


Suggested additional:
The library allows you to create a public suffix list from any file-like object, including a list. In the examples below, we construct the PSL with different lists to demonstrate the functionality of the library under different conditions. In particular, the examples show an empty list, list of two elements, and the use of negation in a list. Combinations of the method parameters are given with notes about the result. These results follow the logic of the Mozilla library as published on the psl.org website.

KnitCode · 2020-04-28T11:26:25Z

tests.py

@@ -78,7 +81,8 @@ def test_get_sld_from_list_with_exception_rule(self):

    def test_get_sld_from_list_with_fqdn(self):
        psl = publicsuffix.PublicSuffixList(['com'])
-        assert 'example.com' == psl.get_sld('example.com.')
+        # 'example.com.' -> <example>.<com>.<empty> -> None, empty labels are not allowed
+        assert None == psl.get_sld('example.com.')


didn't follow this comment. is this failing because of the trailing dot? I read the code as the get_sld() version will remove the trailing dot ... so why doesn't this go to ..
example.com -> example, com -> example.com
??

KnitCode · 2020-04-28T11:45:41Z

tests.py

@@ -284,13 +297,15 @@ def test_get_sld_backward_compatibility_sld_for_empty_string(self):

    def test_get_sld_backward_compatibility_sld_for_fqdn(self):
        psl = publicsuffix.PublicSuffixList()
-        assert 'foo.com' == psl.get_sld('www.foo.com.')
+        # 'www.foo.com.' -> <www>.<foo>.<com>.<empty> -> None, empty labels are not allowed
+        assert None == psl.get_sld('www.foo.com.')


same question abut strip

i don't see any tests for the unsafe methods. shouldn't there be a few? this is what the big change is.

KnitCode · 2020-04-28T11:46:14Z

tests.py

        assert None == psl.get_sld('.', strict=True)
-        assert '' == psl.get_sld('.', wildcard=False)
+        assert None == psl.get_sld('.', wildcard=False)


not backward compatible but logically more accurate

KnitCode · 2020-04-28T11:46:49Z

tests_mozilla.py

@@ -583,40 +583,49 @@ def test_get_sld_Mixed_case3(self):
        assert 'example.com' == publicsuffix.get_sld('WwW.example.COM')

    def test_get_sld_Leading_dot1(self):
-        assert 'com' == publicsuffix.get_sld('.com')
+        # '.com' -> <empty>.<com> -> None, empty labels are not allowed
+        assert None == publicsuffix.get_sld('.com')


i'm going to continue having all the strip questions - so i won't repeat them.

KnitCode · 2020-04-28T11:48:09Z

tests_mozilla.py


    def test_get_sld_Leading_dot2(self):
-        assert 'example' == publicsuffix.get_sld('.example')
+        # '.example' -> <empty>.<example> -> None, empty labels are not allowed
+        assert None == publicsuffix.get_sld('.example')


the difference here is that example is not in the list. so in strict mode should return None and in non-strict mode would return example --- with the strip

correct?

KnitCode · 2020-04-28T11:49:16Z

tests_mozilla.py


    def test_get_sld_Unlisted_sld1(self):
        assert 'example' == publicsuffix.get_sld('example')

    def test_get_sld_Unlisted_sld2(self):
-        assert 'example' == publicsuffix.get_sld('example.example')
+        # non-strict: TLD:example, SLD:example.example
+        assert 'example.example' == publicsuffix.get_sld('example.example')


this is a change where the original logic was flawed -- didn't follow the algorithm complete-- so the test itself was incorrect

hiratara · 2020-04-28T13:56:27Z

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/__init__.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

KnitCode · 2020-05-01T17:08:05Z

_lookup_node doesn't seem to be used anymore.

https://github.com/nexB/python-publicsuffix2/blob/e95f43d26d57692fc1f153c8c9bbde62dabeaae7/src/publicsuffix2/__init__.py#L201-L240

I'm all for improving the execution speed. However, we are cautious about changing existing behavior. The reason the specs of this library change so often is that the specs are so vague. It's hard to determine if this specification change is reasonable (although I'm not actively opposed to it).

agree, and my brain hadn't noticed that. @pombredanne I would think the priorities would be in rough order:

correcting faulty logic and unit tests based on the published algorithm; these corrections do a good job of documenting the algorithm and reveal faults in the original Mozilla tests that rely on the existence of certain things (e.g. example) being the psl which aren't. i like the demonstrated combinations using a set list in the use_caeses.md
ensuring backward compatibility, outside of being wrong :) ... so that the functions people are using today work as expected, and added functions give more controls, as needed,
py2.7 support for older use cases,
performance improvements.. that's always a fine thing.

agreed? given that, wondering if the merge can be split to get these. if the performance improvements suggested by @vadym-t only work with py3 idioms and don't use the trie, what about using an extra flag parameter to trigger usage of the new functions?

i can take a closer look on the weekend and rerun the experiments i did with an earlier version.

pombredanne · 2020-07-16T12:34:21Z

@vadym-t ping?

vadym-t added 5 commits April 27, 2020 05:40

Some test cases were updated to correspond to updated input requireme…

db483ee

…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.

Some test cases were updated to correspond to updated input requireme…

6ecba33

…nts. Input domain string is a FQDN without trailing dot. Empty labels are not allowed. Explatatory comments prepend modification when necessary.

this file describes the behavior of methods and some specific use cas…

e95f43d

…es. Eventually it should cover the full set of test cases and merged into README.

pombredanne requested changes Apr 28, 2020

View reviewed changes

KnitCode suggested changes Apr 28, 2020

View reviewed changes

Added the explanation of unsafe methods as suggested by @KnitCode

006e431

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_tld(), get_components(), and other #17

get_tld(), get_components(), and other #17

vadym-t commented Apr 27, 2020

pombredanne left a comment

pombredanne commented Apr 28, 2020

KnitCode commented Apr 28, 2020

KnitCode left a comment

KnitCode Apr 28, 2020

vadym-t Apr 30, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

KnitCode Apr 28, 2020

hiratara commented Apr 28, 2020

KnitCode commented May 1, 2020

pombredanne commented Jul 16, 2020

get_tld(), get_components(), and other #17

Are you sure you want to change the base?

get_tld(), get_components(), and other #17

Conversation

vadym-t commented Apr 27, 2020

pombredanne left a comment

Choose a reason for hiding this comment

pombredanne commented Apr 28, 2020

KnitCode commented Apr 28, 2020

KnitCode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiratara commented Apr 28, 2020

KnitCode commented May 1, 2020

pombredanne commented Jul 16, 2020