The goal of this project is to create a multi-threaded web crawler. A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be to implement a multi-threaded downloader that can download multi websites at the same time. The plan is to implement this in C or C++.
- Part 1: Multi-threaded web-crawler
- Part 2: (Extended scope) Web Ranking
- Single Threaded Web Crawler
- MULTITHREADED Web Crawler
- Website domain ranking algorithms
- Graphs
- Libraries used
- Future extension of project
- Repo structure
- How to reuse this repo
- Credits
- Contributors
- HTTP website downloader
- using socket library
- HTTPs websites downloader
- using openssl library
- HTML file parser
- using regex
- Domain extractor
- using regex
- Crawler loop
- Website ranker
- using a simple counter
...
while(!mainQueue.empty() && totalVisitedPages < pagesLimit)
{
currWebsite = mainQueue.pop()
html = websiteDownloader(currWebsite)
linkWebsite = htmlParser(html)
update(discoveredWebsites, mainQueue, totalVisitedPages)
}
...
- use
make
to compile the program maxlinks
,pagelimit
can be given as argument in withmake
command.- For e.g.
make maxlinks=100 pagelimit=100
- Here the arguments are:
maxlinks
: Maximum number of links to be extracted from a websitepagelimit
: Maximum number of websites to be downloaded while crawling
- For e.g.
- Crawler as a thread controller
- Child thread
- HTML downloader
- Link extractor
- Domain extractor
- Ranking using different algorithms
...
while(1){
if(pagesLimitReached || visitedpages>=pagesLimit){
pagesLimitReached = true;
if(w_threads){
gotosleep();
}
else {
break;
}
}
else{
if (w_threads < maxThreads && queue_size>0){
createThread();
}
else if(w_threads == 0){
break;
}
else{
gotosleep();
}
}
}
...
...
download(url);
parse(url);
update(queue, visitedLinks, ranking);
if(pagesLimitReached){
if(workingThreads == 0){
wake_parent();
}
}
else{
wake_parent();
}
...
-
- Having one lock for all of our shared variables
- Pros Easy to implement
- Cons Need to keep the lock acquired for more amount of time which may result in serial processing only
-
Using THREAD SAFE DATA STRUCTURE technique
- For each of the data structure, having a single lock or RW lock if required
- Waiting time distributed over different locks
- Different thread safe data structures:
- Thread safe integer
- Thread safe queue
- Thread safe map
- Pros No need to keep the lock acquired for more amount of time. Hence concurrency can be efficiently achieved in multi processor CPUs
- Cons Overhead due to multiple locks
- use
make
to compile the program maxlinks
,pagelimit
,threads
can be given as argument in withmake
command.- For e.g.
make maxlinks=100 pagelimit=100 threads=20
- Here the arguments are:
maxlinks
: Maximum number of links to be extracted from a websitepagelimit
: Maximum number of websites to be downloaded while crawlingthreads
: Maximum number of threads to be createdrankerFlag
: Flag to choose which ranking algorithm to be executed (n
= simple counter based web ranker,sp
= pagerank with sampling,ip
= pagerank with iteration)
- For e.g.
-
Simple counter based The intuition behind this approach of ranking is to increase the rank of a domain name whenever we visit it.
-
PageRank algorithm The intuition behind our pagerank algorithm is as follows. Suppose there is a random surfer which randomly chooses a starting website. After that it chooses next website from all the linked website with current chosen website with probability of 0.85 or it chooses next website from all available websites with a probability of 0.15.
In this way, the importance of a website is measured by how repetitively a random surfer visits a website. Hence, a website is important if it is linked to more number of important websites.
There are two ways to implement this algorithm
...
corpus = read(csv_file)
for website in corpus.keys():
for x in corpus[website]:
rank[website]+=1
...
------------------------------------------------
Domain Name rankings using counter
------------------------------------------------
................................................
Domain Name Rank
................................................
1 . mygov.in 43
2 . main.ayush.gov.in 36
3 . tourism.gov.in 24
4 . digitalindia.gov.in 19
5 . asi.nic.in 16
------------------------------------------------------------
In this approach, we have asked random surfer for certain number of sample times to choose a website from all websites which are weighted as per the pagerank algorithm intuition.
...
DAMPING = 0.85
SAMPLES = 10000
corpus = read(csv_file)
for i in range(1,SAMPLES):
for x in corpus.keys():
if (x in corpus[page]):
model[x] = ((DAMPING * (1/number_of_linked_websites)) + ((1-DAMPING)* (1/total_websites)))
else:
model[x] = ((1-DAMPING)* (1/total_websites))
x = random.choices(websites, weights=model.values())[0]
pagerrank[x] += (1/n)
return pagerrank
...
-------------------------------------------------------------
Domain Name ranking using PageRank: Sampling (n = 10000)
-------------------------------------------------------------
................................................
Domain Name Rank
................................................
1 . haryana.mygov.in 0.1290000000000021
2 . chhattisgarh.mygov.in 0.11000000000000212
3 . blog.mygov.in 0.07730000000000119
4 . mygov.in 0.07260000000000105
5 . aatmanirbharbharat.mygov.in 0.04840000000000036
------------------------------------------------------------
The intuition behind ranking using iterative pagerank algorithm is as follows. We will intialize the probability of surfer visiting a given website to 1/total_websites. We will update the probability of every website according to below formula.
We will stop iterating when the difference between old and updated probabilities is less than certain threashold.
...
DAMPING = 0.85
THRESHOLD = 0.001
corpus = read(csv_file)
while(1):
before = [pagerrank[v] for v in pagerrank.keys()]
for x in corpus.keys():
for v in linkedWebsites:
pagerrank[x] += (DAMPING*(pagerank[v]/total_linkedWebsites))
pagerank[x] += ((1-DAMPING)/number_of_websites)
if (change(before, [pagerrank[v] for v in pagerrank.keys()]) <= THRESHOLD):
break
return pagerrank
...
----------------------------------------------
Domain Name ranking using PageRank: Iteration
----------------------------------------------
................................................
Domain Name Rank
................................................
1 . india.gov.in 0.01762736346840192
2 . digitalindia-gov.zoom.us 0.017587054793058533
3 . tourism.gov.in 0.016866581191734113
4 . digitalindiaawards.gov.in 0.014974653859083446
5 . mygov.in 0.0122561916194452
------------------------------------------------------------
- We have made a graph between number of threads vs time. According to this graph, we can infer as follows:
- When number of threads are very low, time required for crawling is large
- Time increases when number of threads becomes huge becuase of locking overhead
- When we use optimal number of threads, concurrent crawling is useful.
** Above graph varies whenever your run the code. Above all inferences are made according to general trend in graph which is seen for most of times
- We have made a graph by varying one of our parameters pagelimit vs time. According to this graph, we can infer as follows:
- As pagelimit increases, crawler time increases
- Multithreaded with single locking works better than single threaded becuase of concurrency
- Multithreaded with thread safe data structures works worst than other two appraoches because of locking overheads. Because in this approach, we have used individual locks for each data structures. And while crawling we needed to acquire and release locks back to back to update each data structure. This increases lot of overhead. As a result, time increases singnificantly in this approach
- Sockets
- OpenSSL
- Pthread library
- For concurrency and synchronization techniques
- Locks
- Single locks
- Reader Writer locks
- Condition Variables
- Locks
- For concurrency and synchronization techniques
- Matplotlib
- Plotting the graphs
- Efficient downloader to download all websites
- Efficient parser to parse large websites
- Creating a search engine by using web indexing and web scraping
- Three main branches:
main
: contains the multithreaded using threadsafe data structure web crawlerMT_singlelock
: contains the multithreaded using single lock web crawlersingle_threaded
: contains the single threaded web crawler
fork
the repo from top right corner of this page- Run following in your terminal
$ git clone https://github.com/[USERNAME]/Web-Crawler
$ cd Web-Crawler
$ make
-
Pull requests are highly appreciated
-
All the codes are well commented, in case of query feel free to contact
- How to write a multi-threaded webcrawler
- DOWNLOADING A WEB PAGE IN C USING A SOCKET
- std::condition_variable
- Scrapy
- PageRank
Anupam Kumar |
Preeti Chiluveru |
Shruti Katpara |
Vivek Modi |