Skip to content

suhanth94/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Web Crawler

This application is a web crawler i.e. given a root URL, it all does crawl all web URL's within the root URL's domain and returns a HTTP response returning the textual link map and relations between them.

Tools

  • Java 8+
  • Spring Boot
  • Maven Build - 3.3.9 version
  • Jsoup (for HTML extractions)
  • Mockito (for tests)

Startup Instructions

  • From Command Line:

    • Make sure maven and java is installed and added to system path
    • Run Command - mvn clean install
    • Run Command - java -jar target/crawler-0.0.1-SNAPSHOT.jar from the application root folder
  • From IDE (like IntelliJ)

    • Import the project using pom.xml
    • Run as SpringBootApplication/ Go to main class CrawlerApplication -> Right Click -> Run As

API

There are two API's defined in the controller. One is for healthcheck and another one is for crawling.

Configuration Properties

There are few configuration properties/settings for crawler defined in application.properties

  • threadCount : This is to mention number of threads needed to intialize the executor service to run parallely.
  • includeExternal: This flag is to include URL's outside the root URL domain during the crawling process
  • maxTimeout: This property is the maximum timeout the call can wait to be polled from queue before ending the process.
  • maxResults: This is to limit number of the results/keys in the textual sitemap response.

Implementation Details

  • A queue is maintained for all the links coming in order and to recursively crawl them once after the other (like breadth first search in a graph).
  • A visited set is maintained to mark a link/URL as visited so as to not repeat the crawl process again once completed.
  • A hashmap is maintain inorder to store the link URL and its adjacent links to return as output (similar to adjacency list representation of a graph).
  • Crawling is processed by executor service tasks parallely based on the configuration parameters.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages