A simple crawler that fetches all pages in a given website and prints the links between them.
📣 Note that this project was purpose-built for a coding challenge (see problem statement) and is not meant for production use (unless you aren't web scale yet).
Before you run this app, make sure you have Node.js installed. yarn
is recommended, but can be used interchangeably with npm
. If you'd prefer running everything inside a Docker container, see the Docker setup section.
git clone https://github.com/paambaati/websight
cd websight
yarn install && yarn build
yarn start <website>
yarn run coverage
docker build -t websight .
docker run -ti websight <website>
yarn bundle && yarn binary
This produces standalone executable binaries for both Linux and macOS.
+---------------------+
| Link Extractor |
| +-----------------+ |
| | | |
| | URL Resolver | |
| | | |
| +-----------------+ |
+-----------------+ | +-----------------+ | +-----------------+
| | | | | | | |
| Crawler +---->+ | Fetcher | +---->+ Sitemap |
| | | | | | | |
+-----------------+ | +-----------------+ | +-----------------+
| +-----------------+ |
| | | |
| | Parser | |
| | | |
| +-----------------+ |
+---------------------+
The Crawler
class runs a fast non-deterministic fetch of all pages (via LinkExtractor
) & the URLs in them recursively and saves them in Sitemap
. When crawling is complete[1], the sitemap is printed as a ASCII tree.
The LinkExtractor
class is a thin orchestrating wrapper around 3 core components —
URLResolver
includes logic for resolving relative URLs and normalizing them. It also includes utility methods for filtering out external URLs.Fetcher
takes a URL, fetches it and returns the response as aStream
. This is better because streams can be read in small buffered chunks, avoiding holding very large HTMLs in memory.Parser
parses the HTML stream (returned byFetcher
) in chunks and emits thelink
event on each page URL and theasset
event on each static asset found in the HTML.
1 Crawler.crawl()
is an async
function that never resolves because it is technically impossible to detect when we've finished crawling. In most runtimes, we'd have to implement some kind of idle polling to detect completion; however, in Node.js, as soon as the event loop has no more tasks to execute, the main process will run to completion. This is why we finally print the sitemap in the Process.beforeExit
event. ↩
-
Streams all the way down.
The key workloads in this system are HTTP fetches (I/O-bound) and HTML parses (CPU-bound), and either can be time-consuming and/or high on memory usage. To better parallelize the crawls and use as little memory as possible,
got
library's streaming API and the very fasthtmlparser2
have been used. -
Keep-Alive connections.
The
Fetcher
class uses a globalkeepAlive
agent to reuse sockets as we're only crawling a single domain. This helps avoid re-establishing TCP connections for each request.
When ramping up for scale, this design exposes a few of its limitations —
-
No rate-limiting.
Most modern and large websites have some sort of throttling set up to block bots. A production-grade crawler should implement some politeness policy to make sure it doesn't inadverdently bring down a website, and so it doesn't run into permanent bans &
429
error responses. -
In-memory state management.
Sitemap().sitemap
is an unboundMap
, and can quickly grow and possibly cause the runtime to run out of memory & crash when crawling very large websites. In a production-grade crawler, there should an external scheduler that holds URLs to crawl next.