I implemented a simplified version of the Google PageRank algorithm. This program was developed using the MATLAB Engine API for C.
A matrix multiplication project was built to become familiar with the MATLAB Engine.
The source and header files were written to be compiled and run in the Visual Studio 2019 IDE. The (CPSC259_Lab5.sln
) solution file can be opened in the Visual Studio software.
Configuration Properties in the VS Solution Explorer :
- Configuration Properties->Debugging->Environment :
PATH=...\MATLAB\R2021a\bin\win64\$(LocalDebuggerEnvironment) - C/C++->General->Additional Include Directories :
...\MATLAB\R2021a\extern\include;%(AdditionalIncludeDirectories) - Linker->General->Additional Library Directories :
...\MATLAB\R2021a\extern\lib\win64\microsoft;%(AdditionalLibraryDirectories)
MATLAB 2021a was installed in order to perform matrix calculations in the MATLAB Engine. I made use of the MATLAB commands in the (pagerank.c
) source file in order to calculate the PageRank Algorithm. To ensure the intended performance of our C program, I also tested the PageRank Algorithm in the MATLAB Command Window.
A video in the Demonstrations
directory shows the PageRank calculations when running the program on Visual Studio. I have embedded a low resolution compressed version below.
PageRank.mp4
I have designed a simplified version of Google's PageRank Algorithm. This was developed by Larry Page as a means of identifying the relevant webpages on the Internet.
The PageRank theory is based on the assumption that a hypothetical web surfer randomly clicks on hyperlinks to jump between webpages. This theoretical random walk is referred to as a Markov Chain. I have mathematically accounted for the probability that the random walk will follow the hyperlinks through a variable p. This is pivotal to calculate the PageRank through the Power Method in the MATLAB Engine.
The connectivity matrix for the PageRank algorithm is parsed from the (web.txt
) file. The input contains a number of lines of binary values, where outgoing links from webpages are arranged in column-wise fashion.
In Essence :
1 values indicate the presence of an outgoing link from page j to page i.
0 values indicate the absence of an outgoing link from page j to page i.
I performed the parsing functionality in the (websolver.c
) source file.
I calculated the PageRank approximation by running the MATLAB command x = mldivide((I - p * M * D), e);
.
Considering the MATLAB Variables :
I
represents the identity matrix.
p
represents the probability of following a hyperlink.
M
represents the connectivity matrix.
D
represents the sparse matrix with the probabilities of following an individual link from each webpage.
e
represents a 6x1 ones array, which mldivide
is called on to left divide the matrix result of (I - p * M * D)
.
In mathematical representation, we are solving for the variable x where (I - p * M * D) * x = e.
After normalizing x with the MATLAB command x = x/ sum(x);
, this yields the PageRank approximation for the probability of a random web surfer accessing each of the webpages.
The complete MATLAB output can be view in the (Approximation.pdf
) file.
In the Power Method I also account for the probability 1-p that the web surfer comes across the webpage without accessing a hyperlink. Even though the model is composed of a finite number of webpages, n (i.e. n = 6), this is the practical approach to calculating the PageRank for a very large number of pages (i.e. n >> 6).
This method involves multiplying the PageRank by a Transition Matrix, which is generated through the MATLAB command :
A = p * M * D + e * (((1 - p) * (colSums ~= 0) + (colSums == 0) )/ dim);
Considering the MATLAB Variables :
p
represents the probability of following a hyperlink.
M
represents the connectivity matrix.
D
represents the sparse matrix with the probabilities of following an individual link.
e
represents a 6X1 ones array which I use to calculate the probability of randomly choosing a webpage.
colSums
represents the total number of outgoing links from each webpage.
dim
represents the row-size and column-size of the square connectivity matrix.
The Power Method accounts for the Markov Chain in the calculation by including the probability of choosing a random webpage. This multiplication is performed in an iterative fashion, until the PageRank stops changing. This yields the limiting probability that an infinitely dedicated web surfer accesses a given webpage.
In order to do this, I executed the MATLAB command :
while norm(ldivide(dim, (xCurr - xPrev))) > 0.01 xPrev = xCurr; xCurr = A * xCurr; end;
The current and previous PageRank values are compared in order to perform the necessary number of PageRank iterations.
The complete MATLAB output can be view in the (Power_Method.pdf
) file.
Repeatedly multiplying the Transition Matrix by the PageRank can be mathematically represented as :
A * x = A * A * x = A * A * A * x = lambda * x = x'
This is the mathematical formula for an eigenvector. From this, I deduced that the PageRank Algorithm, as calculated by the Power Method, represents a system that evolves after every step. The Transition Matrix pulls the PageRank towards the principal eigenvector. This is best viewed in the following visual. This is sourced from the Eigenvectors and Eigenvalues web post.
To evaluate this, I calculated the principal eigenvector in the MATLAB Engine, as shown below.
The complete MATLAB output can be view in the (Principal_Eigenvector.pdf
) file.
This was originally completed as a final project for CPSC 259 - Data Structures and Algorithms for Electrical Engineers, which is a course at the University of British Columbia. Unlike the other labs for this course, this program was created entirely from scratch.
However, this project was heavily modified to deviate from the original submission and to respect course policies. In its current form, it is largely a personal exploration of the C language features as well as the mathematics of the PageRank Algorithm.
The MATLAB commands executed to perform the PageRank calculations were heavily inspired by the CPSC 259 course instructions, as well as from the following sources:
The mathematical rationality behind eigenvectors powering the PageRank algorithm was further explored from the following sources:
I have read and understood the plagiarism policies at http://www.cs.ubc.ca/~tmm/courses/cheat.html and I understand that no excuse for plagiarism will be accepted.