Skip to content

Larry Page's original search algorithm for Google.

Notifications You must be signed in to change notification settings

arielfayol37/PageRank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PageRank Algorithm

Welcome to the PageRank algorithm project! In this project, we'll explore the PageRank algorithm used by search engines like Google to rank web pages based on their importance. The algorithm determines the importance of a page based on the number and quality of incoming links it receives from other pages.

Background

The primary goal of the PageRank algorithm is to identify "important" pages, which are more likely to be linked by other important pages. This approach provides a more robust measure of importance compared to simply counting the number of incoming links.

To understand PageRank, we'll use two models:

  1. Random Surfer Model: In this model, we consider a hypothetical surfer who navigates the internet by randomly clicking on links. The probability that this surfer is on a particular page at any given time corresponds to its PageRank.

  2. Iterative Algorithm: The PageRank can also be defined mathematically using a recursive expression. The rank of a page is calculated iteratively, where each page's rank is updated based on the ranks of pages linking to it.

Random Surfer Model

Imagine a surfer on a web page who randomly chooses links to follow. This model defines the PageRank as the probability that the surfer is on a particular page at any given time. The more incoming links a page has, the higher the likelihood of the surfer ending up on that page.

Iterative Algorithm

PageRank can also be defined using a recursive mathematical expression. Let PR(p) be the PageRank of a page p, i.e., the probability that a random surfer ends up on that page. We define PR(p) as follows:

PR(p) = (1 - d) / N + d * Σ(PR(i) / NumLinks(i))

Here, d is the damping factor, N is the total number of pages in the corpus, i ranges over all pages that link to page p, and NumLinks(i) is the number of links present on page i.

The iterative algorithm starts by assuming that the PageRank of each page is initially equal (1 / N), and then it repeatedly calculates new PageRank values for each page based on the previous values. This process continues until the PageRank values converge, i.e., their changes become negligible.

Getting Started

To use the PageRank algorithm:

  1. Clone this repository to your local machine using git clone.

  2. Open pagerank.py to explore the provided functions and the implementation of the PageRank algorithm.

  3. Run the program and provide the directory containing the corpus of web pages as a command-line argument. e.g. python pagerank.py corpus0

  4. The program will calculate PageRank using both the random surfer model and the iterative algorithm.

Functions

In pagerank.py, the following functions have been completed:

transition_model(corpus, page, damping_factor)

This function returns a dictionary representing the probability distribution over which page a random surfer would visit next.

  • corpus: A Python dictionary mapping a page name to a set of all pages linked to by that page.
  • page: A string representing which page the random surfer is currently on.
  • damping_factor: A floating-point number representing the damping factor to be used when generating the probabilities.

The return value is a Python dictionary with one key for each page in the corpus. Each key is mapped to a value representing the probability that a random surfer would choose that page next. The values in this returned probability distribution sum to 1.

sample_pagerank(corpus, damping_factor, n)

This function estimates the PageRank of each page by sampling.

  • corpus: A Python dictionary mapping a page name to a set of all pages linked to by that page.
  • damping_factor: A floating-point number representing the damping factor to be used by the transition model.
  • n: An integer representing the number of samples that should be generated to estimate PageRank values.

The return value is a Python dictionary with one key for each page in the corpus. Each key is mapped to a value representing that page's estimated PageRank (i.e., the proportion of all the samples that corresponded to that page). The values in this dictionary sum to 1.

The first sample is generated by choosing from a page at random. For each of the remaining samples, the next sample is generated from the previous sample based on the previous sample's transition model.

iterate_pagerank(corpus, damping_factor)

This function calculates PageRanks based on the iterative formula.

  • corpus: A Python dictionary mapping a page name to a set of all pages linked to by that page.
  • damping_factor: A floating-point number representing the damping factor to be used in the PageRank formula.

The return value is a Python dictionary with one key for each page in the corpus. Each key is mapped to a value representing that page's PageRank. The values in this dictionary sum to 1.

The function begins by assigning each page a rank of 1 / N, where N is the total number of pages in the corpus. It then repeatedly calculates new rank values based on all of the current rank values, according to the PageRank formula. This process repeats until no PageRank value changes by more than 0.001 between the current rank values and the new rank values.

License

This project is open-source and released under the MIT License.

Discover the power of the PageRank algorithm and explore how it determines the importance of web pages. Have fun! 🌐🔍

About

Larry Page's original search algorithm for Google.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published