Skip to content

bluedotiya/web_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker Build

Python web crawler

Recursive WebCrawler, Feed it one URL and the crawler will return all the websites that are related to that address at a chosen depth

Kubernetes Crawler

K8s Flavor of web crawler is custom built to serve as a native k8s app, its provide scalability, stability and high performance.

Installation prerequisites

  1. Apt packages: setfacl, git
  2. Kubectl cli
  3. Helm cli
  4. Single/multi node K8s cluster (tested on Kubeadm)
  5. Root access (Duh :))

How to install

  1. Clone this repo using
git clone https://github.com/bluedotiya/web_crawler.git
  1. Change to the new git directory
cd web_crawler
  1. Run bash install & wait for installation to complete
bash installer.sh -o install
  1. Installation complete you should be able to access your neo4j DB.
Example: Deployment Done you can connect Neo4j Browser on: http://<YOUR_K8S_NODE_IP_HERE>:30074
Example: Database Port is: 30087

How to use

  1. To init a search run the following query (you can replace url & depth values to your own)
curl -X POST http://<YOUR_K8S_NODE_IP_HERE>:30080 -H 'Content-Type: application/json' -d '{"url":"https://www.google.com","depth":2}'
  1. You can now see your data from the native neo4j browser or your favorite Neo4j DB Viewer app

Recommendation

Use Neo4j Desktop app along side GraphXR for the best graph viewing and search experience

GraphXR Visualization: image