-
-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subwords capability to ffuf_shortnames #2237
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #2237 +/- ##
=====================================
- Coverage 93% 93% -0%
=====================================
Files 378 378
Lines 29363 29443 +80
=====================================
+ Hits 27155 27227 +72
- Misses 2208 2216 +8 ☔ View full report in Codecov by Sentry. |
bbot/modules/ffuf_shortnames.py
Outdated
|
||
from bbot.modules.deadly.ffuf import ffuf | ||
|
||
|
||
class ffuf_shortnames(ffuf): | ||
watched_events = ["URL_HINT"] | ||
produced_events = ["URL_UNVERIFIED"] | ||
deps_pip = ["numpy"] | ||
deps_pip = ["numpy", "nltk"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage to using nltk over the builtin subword helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly just the size of the wordlist, which is massive, and should be well maintained being part of nltk
Also the functionality of that helper doesn't quite match the use, that finds all the subwords (as a list), whereas this is just checking for prefixes
bbot/modules/ffuf_shortnames.py
Outdated
self.debug("NLTK words data already present") | ||
except LookupError: | ||
self.debug("NLTK words data not found, downloading") | ||
nltk.download("words", download_dir=self.nltk_dir, quiet=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.helpers.wordlist("https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip")
Also adds ignore_case option to ffuf (useful for IIS where case doesn't matter)
Subwords uses python nltk (natural language toolkit) to try and find smaller words at the beginning of shortnames. If it does, it sends the remainder off to the predictor. This works well because web developers have a habit of making lots of "VerbAction" type two-word file names.