Skip to content

Latest commit

 

History

History
328 lines (216 loc) · 8.39 KB

docs.md

File metadata and controls

328 lines (216 loc) · 8.39 KB

Open Crawler 1.0.0 - Documentation

Table Of Contents

Getting Started

Installation

Linux
git clone https://github.com/merwin-asm/OpenCrawler.git
cd OpenCrawler
chmod +x install.sh && ./install.sh
Windows

You need git, python3 and pip installed

git clone https://github.com/merwin-asm/OpenCrawler.git
cd OpenCrawler
pip install -r requirements.txt

Features

  • Cross Platform
  • Installer for linux
  • Related-CLI Tools (includes ,CLI access to tool, not that good search-tool xD, etc)
  • Memory efficient [ig]
  • Pool Crawling - Use multiple crawlers at same time
  • Supports Robot.txt
  • MongoDB [DB]
  • Language Detection
  • 18 + Checks / Offensive Content Check
  • Proxies
  • Multi Threading
  • Url Scanning
  • Keyword, Desc And recurring words Logging

Uses

Making A (Not that good) Search engine :

This can be easily done with verry less modifications if required

  • We also provide an inbuild search function , which may not be good enough but does do the thing ( the search topic be discussed below )

Osint Tool :

You can make use of the tool to crawl through sites related to someone and do osint by using the search utility or make custom code for it

Pentesting Tool :

Find all websites related to one site , this can be achieved using the connection tree command ( this topic be discussed below )

Crawler As It says..

Commands

Find Commands

To find the commands you can use any of these 2 methods,

warning : this only works in linux

man opencrawler

For Linux:

opencrawler help

For Windows:

python opencrawler help

About Commands

help

Shows the commands available

v

Shows the current version of opencrawler

crawl

This would start the normal crawler

forced_crawl <website>

Forcefully crawl a site , the site crawled is <website>

crawled_status

warning : the data shown aint exact

Gives the info on the mongoDB.. This will show the number of sites crawled and the avg ammount of storage used.

Show the info for both collections : (more info on the collections are given in the working section)

  • crawledsites
  • waitlist
search <search>

Uses basic filturing methods to search , this command aint meant for anything like search engine (the working of search be discussed in working section)

configure

Configures the opencrawler... The same is also used to re configure... It will ask all the info required to start the crawler and save it in json file (config.json) (more info in the config section)

Its ok if you are running crawl command without configs because it will ask you to .. xd

connection-tree <website> <no of layers>

A tree of websites connected to <website> be shown

<no of layers> is how deep you want to crawl a site. The default depth is 2

check_html <website>

Checks if a website is returning html

crawlable <website>

Checks if a website is allowed to be crawled It checks the robot.txt , to find if disallowed

dissallowed <website>

Shows the disallowed urls of a website The results are based on robots.txt

fix_db

Starts the fix db program This can be used to resolve bugs present in the code , which could contaminate the DB

re-install

Re installs the opencrawler

update

Installs new version of the opencrawler | reinstalls

install-requirements

Installs the requirements.. These requirements are mentioned in requirements.txt

Config File

The file is generated by the configure command , which will run the "config.py" file.

The file in json , "config.json"

The config file stores info regarding the crawling activity These Include :

  • MONGODB_PWD - pwd of mongoDB user
  • MONGODB_URI - uri for connecting to mongoDB
  • TIMEOUT - time out for get requests
  • MAX_THREADS - number of threads , set it as one if you don't wanna do multithreading
  • bad_words - the file containing list of bad words , which by default is bad_words.txt (bad_words.txt is provided)
  • USE_PROXIES - bool - if the crawler should use proxy (proxy wont be used even if set True for robot.txt scanning)
  • Scan_Bad_Words - bool - if you want to save the bad / offensive text score
  • Scan_Top_Keywords - bool - if you want to save the top keywords found in the html txt
  • urlscan_key - the url scan API key , if you are not use the feature leave it empty
  • URL_SCAN - bool - if you want to scan url using UrlScan API

Working

Files :

Filename Type Use
opencrawler python The main file which get called on using command opencrawler
crawler.py python The file which do the crawling
requirments.txt text The file containing names of python modules , to be installed
search.py python Does the search
opencrawler.1 roff The user manual
mongo_db.py python Handles mongoDB
installer.py python Installer for linux, which will be ran by install.sh
install.sh shell Install basic requirements like python3, for linux use only
fix_db.py python Fixes the DB
connection_tree.py python Makes the connection tree
config.py python Configures the OpenCrawler
bad_words.txt text Contains bad words used for predicting the bad/offensive text score

MongoDB Collections

There are two collections used :

  • waitlist - Used for storing sites which is to be crawled
  • crawledsites - Used to store crawled sites and collected info about them

How is data stored in mongoDB

Structure in which data is stored in the collections...

crawledsites :
######### Crawled Info are stored in Mongo DB as #####
Crawled sites = [ 
                {
                    "website" : "<website>"
                    
                    "time" : "<last_crawled_in_epoch_time>",
                    "mal" : Val/None, # malicious or not
                    "offn" : Val/None, # 18 +/ Offensive language
                    "ln" : "<language>",
                    
                    "keys" : [<meta-keywords>],
                    "desc" : "<meta-desc>",
                    
                    "recc" : [<recurring words>]/None,
                }
]
waitlist :
waitlist = [
           {
               "website" : "<website>"
           }
]

Connection Tree

By default depth is 2

The command tree works by getting all urls found in a site, then doing the same with the urls found, the number of times this happens deppends on the depth

Search

The search command uses the data stored in the crawledsites.

For each word of query it will check for sites containing them in,

  • website URL
  • desc
  • keywords
  • top recurring words

The results are sorted with the ones with most number of words from the query

    url = list(_DB().Crawledsites.find({"$or" : [
    {"recc": {"$regex": re.compile(word, re.IGNORECASE)}},
    {"keys": {"$regex":  re.compile(word, re.IGNORECASE)}},
    {"desc": {"$regex": re.compile(word, re.IGNORECASE)}},
    {"website" : {"$regex": re.compile(word, re.IGNORECASE)}}
]}))

Note

  • Proxy doesn't work for robot.txt scans while you are crawling , this is because the urlib.robotparser doesnt allow the use of proxy
  • If you have any issues with pymongo not working try installing versions preffered for the specific python version
  • If you get errors regarding pymongo also make sure you give read and write perms to the user
  • You can use local mongoDB
  • Search function aint making use of all possible filtures to find a site
  • installer.py and install.sh aint same , install.sh also installs python and pip then runs installer.py
  • installer.py and install.sh is only for linux use
  • we use proxyscrape API for geting free proxies
  • we use Virus Total's API for scanning websites , if required