GitHub - emo-crab/tldextract-rs: tldextract-rs

Summary

tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.

Using

Usage: tldextract-cli [-s <source-uri>] [-j] [-l <list>] [--disable-private-domains] [-f <filter>] [-o <output>]

Reach new heights.

Options:
  -s, --source-uri  specific sources(local file path or remote url) to prefix
                    list,(eg. snapshot,remote)
  -j, --json        write output in json(lines) format
  -l, --list        list of sub(domains) to extract (file or stdin)
  --disable-private-domains
                    disable private domains
  -f, --filter      display filter result by field only (eg. -f
                    suffix,domain,subdomain,registered_domain)
  -o, --output      file to write output
  --help            display usage information

example

➜  tldextract-rs git:(main) ✗ tldextract-cli  -j -l mirrors.tuna.tsinghua.edu.cn
 {"subdomain":"mirrors.tuna","domain":"tsinghua","suffix":"edu.cn","registered_domain":"tsinghua.edu.cn"}

Implementation details

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple eTLDs like com, but not more complex ones like oseto.nagasaki.jp.

eTLD tries

tldextract-rs stores eTLDs in compressed tries.

Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`

The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw. Reversing the nodes gives the extracted eTLD nsw.edu.au.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.cargo		.cargo
.github		.github
dev-script/hooks		dev-script/hooks
examples		examples
src		src
tldextract-cli		tldextract-cli
tldextract-rs		tldextract-rs
.clippy.toml		.clippy.toml
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Implementation details

Why not split on "." and take the last element instead?

eTLD tries

Acknowledgements

About

Releases 2

Sponsor this project

Packages

Contributors 2

Languages

License

emo-crab/tldextract-rs

Folders and files

Latest commit

History

Repository files navigation

Summary

Implementation details

Why not split on "." and take the last element instead?

eTLD tries

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases 2

Sponsor this project

Packages 0

Contributors 2

Languages

Packages