-
Notifications
You must be signed in to change notification settings - Fork 0
/
PageRank.py
370 lines (265 loc) · 14.5 KB
/
PageRank.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
# coding: utf-8
# # PageRank
# In this notebook, you'll build on your knowledge of eigenvectors and eigenvalues by exploring the PageRank algorithm.
# The notebook is in two parts, the first is a worksheet to get you up to speed with how the algorithm works - here we will look at a micro-internet with fewer than 10 websites and see what it does and what can go wrong.
# The second is an assessment which will test your application of eigentheory to this problem by writing code and calculating the page rank of a large network representing a sub-section of the internet.
# ## Part 1 - Worksheet
# ### Introduction
#
# PageRank (developed by Larry Page and Sergey Brin) revolutionized web search by generating a
# ranked list of web pages based on the underlying connectivity of the web. The PageRank algorithm is
# based on an ideal random web surfer who, when reaching a page, goes to the next page by clicking on a
# link. The surfer has equal probability of clicking any link on the page and, when reaching a page with no
# links, has equal probability of moving to any other page by typing in its URL. In addition, the surfer may
# occasionally choose to type in a random URL instead of following the links on a page. The PageRank is
# the ranked order of the pages from the most to the least probable page the surfer will be viewing.
#
# In[46]:
# Before we begin, let's load the libraries.
get_ipython().magic('pylab notebook')
import numpy as np
import numpy.linalg as la
from readonly.PageRankFunctions import *
np.set_printoptions(suppress=True)
# ### PageRank as a linear algebra problem
# Let's imagine a micro-internet, with just 6 websites (**A**vocado, **B**ullseye, **C**atBabel, **D**romeda, **e**Tings, and **F**aceSpace).
# Each website links to some of the others, and this forms a network as shown,
#
# ![A Micro-Internet](readonly/internet.png "A Micro-Internet")
#
# The design principle of PageRank is that important websites will be linked to by important websites.
# This somewhat recursive principle will form the basis of our thinking.
#
# Imagine we have 100 *Procrastinating Pat*s on our micro-internet, each viewing a single website at a time.
# Each minute the Pats follow a link on their website to another site on the micro-internet.
# After a while, the websites that are most linked to will have more Pats visiting them, and in the long run, each minute for every Pat that leaves a website, another will enter keeping the total numbers of Pats on each website constant.
# The PageRank is simply the ranking of websites by how many Pats they have on them at the end of this process.
#
# We represent the number of Pats on each website with the vector,
# $$\mathbf{r} = \begin{bmatrix} r_A \\ r_B \\ r_C \\ r_D \\ r_E \\ r_F \end{bmatrix}$$
# And say that the number of Pats on each website in minute $i+1$ is related to those at minute $i$ by the matrix transformation
#
# $$ \mathbf{r}^{(i+1)} = L \,\mathbf{r}^{(i)}$$
# with the matrix $L$ taking the form,
# $$ L = \begin{bmatrix}
# L_{A→A} & L_{B→A} & L_{C→A} & L_{D→A} & L_{E→A} & L_{F→A} \\
# L_{A→B} & L_{B→B} & L_{C→B} & L_{D→B} & L_{E→B} & L_{F→B} \\
# L_{A→C} & L_{B→C} & L_{C→C} & L_{D→C} & L_{E→C} & L_{F→C} \\
# L_{A→D} & L_{B→D} & L_{C→D} & L_{D→D} & L_{E→D} & L_{F→D} \\
# L_{A→E} & L_{B→E} & L_{C→E} & L_{D→E} & L_{E→E} & L_{F→E} \\
# L_{A→F} & L_{B→F} & L_{C→F} & L_{D→F} & L_{E→F} & L_{F→F} \\
# \end{bmatrix}
# $$
# where the columns represent the probability of leaving a website for any other website, and sum to one.
# The rows determine how likely you are to enter a website from any other, though these need not add to one.
# The long time behaviour of this system is when $ \mathbf{r}^{(i+1)} = \mathbf{r}^{(i)}$, so we'll drop the superscripts here, and that allows us to write,
# $$ L \,\mathbf{r} = \mathbf{r}$$
#
# which is an eigenvalue equation for the matrix $L$, with eigenvalue 1 (this is guaranteed by the probabalistic structure of the matrix $L$).
#
# Complete the matrix $L$ below, we've left out the column for which websites the *FaceSpace* website (F) links to.
# Remember, this is the probability to click on another website from this one, so each column should add to one (by scaling by the number of links).
# In[47]:
# Replace the ??? here with the probability of clicking a link to each website when leaving Website F (FaceSpace).
L = np.array([[0, 1/2, 1/3, 0, 0, 0 ],
[1/3, 0, 0, 0, 1/2, 0 ],
[1/3, 1/2, 0, 1, 0, 1/2 ],
[1/3, 0, 1/3, 0, 1/2, 1/2 ],
[0, 0, 0, 0, 0, 0 ],
[0, 0, 1/3, 0, 0, 0 ]])
# In principle, we could use a linear algebra library, as below, to calculate the eigenvalues and vectors.
# And this would work for a small system. But this gets unmanagable for large systems.
# And since we only care about the principal eigenvector (the one with the largest eigenvalue, which will be 1 in this case), we can use the *power iteration method* which will scale better, and is faster for large systems.
#
# Use the code below to peek at the PageRank for this micro-internet.
# In[48]:
eVals, eVecs = la.eig(L) # Gets the eigenvalues and vectors
order = np.absolute(eVals).argsort()[::-1] # Orders them by their eigenvalues
eVals = eVals[order]
eVecs = eVecs[:,order]
r = eVecs[:, 0] # Sets r to be the principal eigenvector
100 * np.real(r / np.sum(r)) # Make this eigenvector sum to one, then multiply by 100 Procrastinating Pats
# We can see from this list, the number of Procrastinating Pats that we expect to find on each website after long times.
# Putting them in order of *popularity* (based on this metric), the PageRank of this micro-internet is:
#
# **C**atBabel, **D**romeda, **A**vocado, **F**aceSpace, **B**ullseye, **e**Tings
#
# Referring back to the micro-internet diagram, is this what you would have expected?
# Convince yourself that based on which pages seem important given which others link to them, that this is a sensible ranking.
#
# Let's now try to get the same result using the Power-Iteration method that was covered in the video.
# This method will be much better at dealing with large systems.
#
# First let's set up our initial vector, $\mathbf{r}^{(0)}$, so that we have our 100 Procrastinating Pats equally distributed on each of our 6 websites.
# In[49]:
r = 100 * np.ones(6) / 6 # Sets up this vector (6 entries of 1/6 × 100 each)
r # Shows it's value
# Next, let's update the vector to the next minute, with the matrix $L$.
# Run the following cell multiple times, until the answer stabilises.
# In[50]:
r = L @ r # Apply matrix L to r
r # Show it's value
# Re-run this cell multiple times to converge to the correct answer.
# We can automate applying this matrix multiple times as follows,
# In[51]:
r = 100 * np.ones(6) / 6 # Sets up this vector (6 entries of 1/6 × 100 each)
for i in np.arange(100) : # Repeat 100 times
r = L @ r
r
# Or even better, we can keep running until we get to the required tolerance.
# In[52]:
r = 100 * np.ones(6) / 6 # Sets up this vector (6 entries of 1/6 × 100 each)
lastR = r
r = L @ r
i = 0
while la.norm(lastR - r) > 0.01 :
lastR = r
r = L @ r
i += 1
print(str(i) + " iterations to convergence.")
r
# See how the PageRank order is established fairly quickly, and the vector converges on the value we calculated earlier after a few tens of repeats.
#
# Congratulations! You've just calculated your first PageRank!
# ### Damping Parameter
# The system we just studied converged fairly quickly to the correct answer.
# Let's consider an extension to our micro-internet where things start to go wrong.
#
# Say a new website is added to the micro-internet: *Geoff's* Website.
# This website is linked to by *FaceSpace* and only links to itself.
# ![An Expanded Micro-Internet](readonly/internet2.png "An Expanded Micro-Internet")
#
# Intuitively, only *FaceSpace*, which is in the bottom half of the page rank, links to this website amongst the two others it links to,
# so we might expect *Geoff's* site to have a correspondingly low PageRank score.
#
# Build the new $L$ matrix for the expanded micro-internet, and use Power-Iteration on the Procrastinating Pat vector.
# See what happens…
# In[53]:
# We'll call this one L2, to distinguish it from the previous L.
L2 = np.array([[0, 1/2, 1/3, 0, 0, 0, 0 ],
[1/3, 0, 0, 0, 1/2, 0, 0 ],
[1/3, 1/2, 0, 1, 0, 1/3, 0 ],
[1/3, 0, 1/3, 0, 1/2, 1/3, 0 ],
[0, 0, 0, 0, 0, 0, 0 ],
[0, 0, 1/3, 0, 0, 0, 0 ],
[0, 0, 0, 0, 0, 1/3, 1 ]])
# In[54]:
r = 100 * np.ones(7) / 7 # Sets up this vector (6 entries of 1/6 × 100 each)
lastR = r
r = L2 @ r
i = 0
while la.norm(lastR - r) > 0.01 :
lastR = r
r = L2 @ r
i += 1
print(str(i) + " iterations to convergence.")
r
# That's no good! *Geoff* seems to be taking all the traffic on the micro-internet, and somehow coming at the top of the PageRank.
# This behaviour can be understood, because once a Pat get's to *Geoff's* Website, they can't leave, as all links head back to Geoff.
#
# To combat this, we can add a small probability that the Procrastinating Pats don't follow any link on a webpage, but instead visit a website on the micro-internet at random.
# We'll say the probability of them following a link is $d$ and the probability of choosing a random website is therefore $1-d$.
# We can use a new matrix to work out where the Pat's visit each minute.
# $$ M = d \, L + \frac{1-d}{n} \, J $$
# where $J$ is an $n\times n$ matrix where every element is one.
#
# If $d$ is one, we have the case we had previously, whereas if $d$ is zero, we will always visit a random webpage and therefore all webpages will be equally likely and equally ranked.
# For this extension to work best, $1-d$ should be somewhat small - though we won't go into a discussion about exactly how small.
#
# Let's retry this PageRank with this extension.
# In[55]:
d = 0.5 # Feel free to play with this parameter after running the code once.
M = d * L2 + (1-d)/7 * np.ones([7, 7]) # np.ones() is the J matrix, with ones for each entry.
# In[56]:
r = 100 * np.ones(7) / 7 # Sets up this vector (6 entries of 1/6 × 100 each)
lastR = r
r = M @ r
i = 0
while la.norm(lastR - r) > 0.01 :
lastR = r
r = M @ r
i += 1
print(str(i) + " iterations to convergence.")
r
# This is certainly better, the PageRank gives sensible numbers for the Procrastinating Pats that end up on each webpage.
# This method still predicts Geoff has a high ranking webpage however.
# This could be seen as a consequence of using a small network. We could also get around the problem by not counting self-links when producing the L matrix (an if a website has no outgoing links, make it link to all websites equally).
# We won't look further down this route, as this is in the realm of improvements to PageRank, rather than eigenproblems.
#
# You are now in a good position, having gained an understanding of PageRank, to produce your own code to calculate the PageRank of a website with thousands of entries.
#
# Good Luck!
# ## Part 2 - Assessment
# In this assessment, you will be asked to produce a function that can calculate the PageRank for an arbitrarily large probability matrix.
# This, the final assignment of the course, will give less guidance than previous assessments.
# You will be expected to utilise code from earlier in the worksheet and re-purpose it to your needs.
#
# ### How to submit
# Edit the code in the cell below to complete the assignment.
# Once you are finished and happy with it, press the *Submit Assignment* button at the top of this notebook.
#
# Please don't change any of the function names, as these will be checked by the grading script.
#
# If you have further questions about submissions or programming assignments, here is a [list](https://www.coursera.org/learn/linear-algebra-machine-learning/discussions/weeks/1/threads/jB4klkn5EeibtBIQyzFmQg) of Q&A. You can also raise an issue on the discussion forum. Good luck!
# In[63]:
# PACKAGE
# Here are the imports again, just in case you need them.
# There is no need to edit or submit this cell.
import numpy as np
import numpy.linalg as la
from readonly.PageRankFunctions import *
np.set_printoptions(suppress=True)
# In[64]:
# GRADED FUNCTION
# Complete this function to provide the PageRank for an arbitrarily sized internet.
# I.e. the principal eigenvector of the damped system, using the power iteration method.
# (Normalisation doesn't matter here)
# The functions inputs are the linkMatrix, and d the damping parameter - as defined in this worksheet.
# (The damping parameter, d, will be set by the function - no need to set this yourself.)
def pageRank(linkMatrix, d) :
n = linkMatrix.shape[0]
M = d * linkMatrix + (1-d)/n * np.ones([n, n])
r = 100 * np.ones(n) / n
lastR = r
r = M @ r
while la.norm(lastR - r) > 0.01:
lastR = r
r = M @ r
return r
# ## Test your code before submission
# To test the code you've written above, run the cell (select the cell above, then press the play button [ ▶| ] or press shift-enter).
# You can then use the code below to test out your function.
# You don't need to submit this cell; you can edit and run it as much as you like.
# In[65]:
# Use the following function to generate internets of different sizes.
generate_internet(5)
# In[66]:
# Test your PageRank method against the built in "eig" method.
# You should see yours is a lot faster for large internets
L = generate_internet(10)
# In[68]:
pageRank(L, 1)
# In[69]:
# Do note, this is calculating the eigenvalues of the link matrix, L,
# without any damping. It may give different results that your pageRank function.
# If you wish, you could modify this cell to include damping.
# (There is no credit for this though)
eVals, eVecs = la.eig(L) # Gets the eigenvalues and vectors
order = np.absolute(eVals).argsort()[::-1] # Orders them by their eigenvalues
eVals = eVals[order]
eVecs = eVecs[:,order]
r = eVecs[:, 0]
100 * np.real(r / np.sum(r))
# In[70]:
# You may wish to view the PageRank graphically.
# This code will draw a bar chart, for each (numbered) website on the generated internet,
# The height of each bar will be the score in the PageRank.
# Run this code to see the PageRank for each internet you generate.
# Hopefully you should see what you might expect
# - there are a few clusters of important websites, but most on the internet are rubbish!
get_ipython().magic('pylab notebook')
r = pageRank(generate_internet(100), 0.9)
plt.bar(arange(r.shape[0]), r);
# In[ ]:
# In[ ]:
# In[ ]: