Async Web Crawler With Rust 🦀

Install | Usage | Final Comments

This repository is meant be used as a starting point for building more complex Web Crawler.

This repo contains the following features:

Basic HTTP Web Server and API implementation using the Tide library. It contains examples for post and get http methods.
Simple Spider module which has all the necessary tools to do the crawling.
Mechanism to archive(download) crawled websites

Install

Nightly rust is required in order to build and run this project. All the necessary dependencies will be installed once cargo check or cargo run are called.

For general information about how to install Rust look here.

Usage

Server

The server is started by executing the following command inside the root project folder:

$ cargo run

This should hopefully run the HTTP Web Server and the following message should be visible inside the terminal:

tide::log Logger started
    level Info
tide::server Server listening on http://127.0.0.1:8080

Client

The server exposes three API that are available to be called.

HTTP Post /spider

This API takes as an input a JSON object that contains a domain address and several optional parameters. On success, it outputs a JSON object which contains the id the of the crawled domain. That ID can be later used to get the list or count of links that were crawled.

Input

{
    "address": "https://www.google.com" # This parameter is required.
    "max_depth": 2                      # This parameter is optional.
                                          This refers to how far down into a
                                          website's page hierarchy the spider
                                          will crawl. If left unset, no limit
                                          will be applied.
    "max_pages": 2                      # This parameter is optional.
                                          This refers how many pages will the
                                          spider crawl before it stops. If
                                          left unset, no limit will be applied.
    "robots_txt": false                 # This parameter is optional.
                                          If enabled, the spider will slow
                                          the speed of crawling and/or ignore
                                          certain subdomain. This
                                          parameter is currently unused.
    "archive_pages": false              # This parameter is optional.
                                          If enabled, the spider will archive
                                          (download) the crawled web pages
                                          If left unset, the default value is
                                          used which is false.
}

Output

{
    "id": "ABCDEFGHT"   # MDA5 hash that is used as an ID.
}

Example

If the server is running, run the following command inside a new terminal:

$ curl localhost:8080/spider -d '{ "address": "http://www.zadruga-podolski.hr" }'
{ "id":"e0436759bf33e12eb53ae0b97f790991" }

HTTP GET /spider/:id/list

This API return the list of crawled web pages for a specific domain. The ID is retrieved by calling post /spider.

Full Example

If the server is running, run the following command inside a new terminal:

$ curl localhost:8080/spider -d '{ "address": "http://www.zadruga-podolski.hr" }'
{ "id":"e0436759bf33e12eb53ae0b97f790991" }

$ curl localhost:8080/spider/e0436759bf33e12eb53ae0b97f790991/list
["http://www.zadruga-podolski.hr/kontakt.html","http://www.zadruga-podolski.hr/muškat-žuti.html","http://www.zadruga-podolski.hr/diplome-i-priznanja.html","http://www.zadruga-podolski.hr/chardonnay.html","http://www.zadruga-podolski.hr/o-nama.html","http://www.zadruga-podolski.hr/index.html","http://www.zadruga-podolski.hr/graševina-ledeno-vino.html","http://www.zadruga-podolski.hr/tradicija-i-običaji.html","http://www.zadruga-podolski.hr/križevci.html","http://www.zadruga-podolski.hr/pinot-sivi.html","http://www.zadruga-podolski.hr/pinot-bijeli.html","http://www.zadruga-podolski.hr","http://www.zadruga-podolski.hr/graševina.html"]

HTTP GET /spider/:id/count

This API return the count of crawled web pages for a specific domain. The ID is retrieved by calling post /spider.

Full Example

If the server is running, run the following command inside a new terminal:

$ curl localhost:8080/spider -d '{ "address": "http://www.zadruga-podolski.hr" }'
{ "id":"e0436759bf33e12eb53ae0b97f790991" }

$ curl localhost:8080/spider/e0436759bf33e12eb53ae0b97f790991/count
{ "count": 13 }

Final comments

In order to keep this project simple and small, certain features were intentionally unimplemented like checking for the "robots.txt" file or checking the header for the <base> tag. Fell free to implement those features by yourself as a kind of exercise. Also, this project heavily relies on async code so anyone who is a async-first-timer should definitely check out this two videos video1 video2

TODO

Add more tests (and rename current ones)
Check if there is a chance for a deadlock inside Spider
Benchmark solution

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.vscode		.vscode
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Async Web Crawler With Rust 🦀

Install | Usage | Final Comments

Install

Usage

Server

Client

HTTP Post /spider

Input

Output

Example

HTTP GET /spider/:id/list

Full Example

HTTP GET /spider/:id/count

Full Example

Final comments

TODO

About

Releases

Packages

Languages

markopoloparadox/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Async Web Crawler With Rust 🦀

Install | Usage | Final Comments

Install

Usage

Server

Client

HTTP Post /spider

Input

Output

Example

HTTP GET /spider/:id/list

Full Example

HTTP GET /spider/:id/count

Full Example

Final comments

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages