Skip to content

PageRank Calculations Performed With MATLAB Engine for C.

Notifications You must be signed in to change notification settings

Dipto9999/Google_PageRank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google PageRank : MATLAB Engine for C

Contents

Overview

I implemented a simplified version of the Google PageRank algorithm. This program was developed using the MATLAB Engine API for C.

A matrix multiplication project was built to become familiar with the MATLAB Engine.

Installations

Visual Studio 2019

The source and header files were written to be compiled and run in the Visual Studio 2019 IDE. The (CPSC259_Lab5.sln) solution file can be opened in the Visual Studio software.

Configuration Properties in the VS Solution Explorer :

  • Configuration Properties->Debugging->Environment :
    PATH=...\MATLAB\R2021a\bin\win64\$(LocalDebuggerEnvironment)
  • C/C++->General->Additional Include Directories :
    ...\MATLAB\R2021a\extern\include;%(AdditionalIncludeDirectories)
  • Linker->General->Additional Library Directories :
    ...\MATLAB\R2021a\extern\lib\win64\microsoft;%(AdditionalLibraryDirectories)

MATLAB 2021a

MATLAB 2021a was installed in order to perform matrix calculations in the MATLAB Engine. I made use of the MATLAB commands in the (pagerank.c) source file in order to calculate the PageRank Algorithm. To ensure the intended performance of our C program, I also tested the PageRank Algorithm in the MATLAB Command Window.

Demonstration

A video in the Demonstrations directory shows the PageRank calculations when running the program on Visual Studio. I have embedded a low resolution compressed version below.

PageRank.mp4

PageRank Algorithm

I have designed a simplified version of Google's PageRank Algorithm. This was developed by Larry Page as a means of identifying the relevant webpages on the Internet.

Markov Chain

The PageRank theory is based on the assumption that a hypothetical web surfer randomly clicks on hyperlinks to jump between webpages. This theoretical random walk is referred to as a Markov Chain. I have mathematically accounted for the probability that the random walk will follow the hyperlinks through a variable p. This is pivotal to calculate the PageRank through the Power Method in the MATLAB Engine.

Connectivity Matrix

The connectivity matrix for the PageRank algorithm is parsed from the (web.txt) file. The input contains a number of lines of binary values, where outgoing links from webpages are arranged in column-wise fashion.

In Essence :
1 values indicate the presence of an outgoing link from page j to page i.
0 values indicate the absence of an outgoing link from page j to page i.

I performed the parsing functionality in the (websolver.c) source file.

Initial Approximation

I calculated the PageRank approximation by running the MATLAB command x = mldivide((I - p * M * D), e);.

Considering the MATLAB Variables :
I represents the identity matrix.
p represents the probability of following a hyperlink.
M represents the connectivity matrix.
D represents the sparse matrix with the probabilities of following an individual link from each webpage.
e represents a 6x1 ones array, which mldivide is called on to left divide the matrix result of (I - p * M * D).

In mathematical representation, we are solving for the variable x where (I - p * M * D) * x = e.

After normalizing x with the MATLAB command x = x/ sum(x);, this yields the PageRank approximation for the probability of a random web surfer accessing each of the webpages.

The complete MATLAB output can be view in the (Approximation.pdf) file.

Power Method

In the Power Method I also account for the probability 1-p that the web surfer comes across the webpage without accessing a hyperlink. Even though the model is composed of a finite number of webpages, n (i.e. n = 6), this is the practical approach to calculating the PageRank for a very large number of pages (i.e. n >> 6).

Transition Matrix

This method involves multiplying the PageRank by a Transition Matrix, which is generated through the MATLAB command :

A = p * M * D + e * (((1 - p) * (colSums ~= 0) + (colSums == 0) )/ dim);

Considering the MATLAB Variables :
p represents the probability of following a hyperlink.
M represents the connectivity matrix.
D represents the sparse matrix with the probabilities of following an individual link.
e represents a 6X1 ones array which I use to calculate the probability of randomly choosing a webpage.
colSums represents the total number of outgoing links from each webpage.
dim represents the row-size and column-size of the square connectivity matrix.

The Power Method accounts for the Markov Chain in the calculation by including the probability of choosing a random webpage. This multiplication is performed in an iterative fashion, until the PageRank stops changing. This yields the limiting probability that an infinitely dedicated web surfer accesses a given webpage.

In order to do this, I executed the MATLAB command :

while norm(ldivide(dim, (xCurr - xPrev))) > 0.01 xPrev = xCurr; xCurr = A * xCurr; end;

The current and previous PageRank values are compared in order to perform the necessary number of PageRank iterations.

The complete MATLAB output can be view in the (Power_Method.pdf) file.

Principal Eigenvector

Repeatedly multiplying the Transition Matrix by the PageRank can be mathematically represented as :

A * x = A * A * x = A * A * A * x = lambda * x = x'

This is the mathematical formula for an eigenvector. From this, I deduced that the PageRank Algorithm, as calculated by the Power Method, represents a system that evolves after every step. The Transition Matrix pulls the PageRank towards the principal eigenvector. This is best viewed in the following visual. This is sourced from the Eigenvectors and Eigenvalues web post.

To evaluate this, I calculated the principal eigenvector in the MATLAB Engine, as shown below.

The complete MATLAB output can be view in the (Principal_Eigenvector.pdf) file.

Credit

This was originally completed as a final project for CPSC 259 - Data Structures and Algorithms for Electrical Engineers, which is a course at the University of British Columbia. Unlike the other labs for this course, this program was created entirely from scratch.

However, this project was heavily modified to deviate from the original submission and to respect course policies. In its current form, it is largely a personal exploration of the C language features as well as the mathematics of the PageRank Algorithm.

The MATLAB commands executed to perform the PageRank calculations were heavily inspired by the CPSC 259 course instructions, as well as from the following sources:

The mathematical rationality behind eigenvectors powering the PageRank algorithm was further explored from the following sources:

Policy

I have read and understood the plagiarism policies at http://www.cs.ubc.ca/~tmm/courses/cheat.html and I understand that no excuse for plagiarism will be accepted.

About

PageRank Calculations Performed With MATLAB Engine for C.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages