Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Deep Blacklisting #145

Open
nautbot opened this issue Mar 31, 2018 · 2 comments
Open

Feature: Deep Blacklisting #145

nautbot opened this issue Mar 31, 2018 · 2 comments
Labels
BIG Bigger features/tasks that will take some time to implement. enhancement New feature or request

Comments

@nautbot
Copy link
Contributor

nautbot commented Mar 31, 2018

Develop deep blacklisting job/script to consume and process XML/JSON feed created in #144 per undetermined technique

Reference issue #20 for original /u/nautbot functionality

@nautbot nautbot added enhancement New feature or request BIG Bigger features/tasks that will take some time to implement. labels Mar 31, 2018
@nautbot nautbot added this to the v.2 - Independent Release milestone Mar 31, 2018
@nautbot
Copy link
Contributor Author

nautbot commented Mar 31, 2018

Suggested solution per @jpleger:

it might actually be something that can be done in scrapy and splash
which isn't too much effort to setup either
https://scrapy.org/
https://github.com/scrapy-plugins/scrapy-splash
I haven't used either projects in the last couple years, but scrapy was pretty easy to work with and with splash, it adds js support
https://doc.scrapy.org/en/latest/topics/link-extractors.html#link-extractors
can use the link extractors to find all links on a website, and then write a simple middleware that logs all redirects after the frontier crawl
(first page that is)

@psineur
Copy link
Contributor

psineur commented Apr 5, 2018

This needs re-triage to v1 v1.1 MVP+ or v3.
v2 was tech-only stack/platform agnostic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BIG Bigger features/tasks that will take some time to implement. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants