Project III 2026-27 (AW)

Project III 2026–27

PageRank and random walks on the web

Supervisor: Andrew Wade

Introduction

The structure of the world-wide web can be represented as a directed graph, in which nodes represent web pages and directed edges represent hyperlinks that allow you to navigate from one page to another. At the heart of Google's search engine is (or at least used to be) the PageRank algorithm, which is based on a particular type of random walk on the world-wide-web graph. At each step in discrete time, a random walk on a directed graph chooses its next node uniformly at random from the endpoints of the out-pointing directed edges leading from its current node. The ranking in the PageRank algorithm is based on the probability distribution on nodes obtained by running such a random walk with occasional re-starting. Several variations of the basic algorithm exist. Mathematically, propeties of the PageRank algorithm boils down to some linear algebra, exploiting the theory of Markov chains.

This project will investigate the basic PageRank algorithm, its connection to random walks on directed graphs, and its Markov chain formulation. For the Individual Project several avenues of investigation are then available, including variations on the basic algorithm (including personalized PageRank), theoretical properties of the random surfer and related processes, computational efficiency, and so on, as well as applications of PageRank, or its relatives, to other ranking problems. There will scope for simulation and implementation of PageRank-style algorithms on suitable data, and there is scope for comparison of PageRank and other ranking procedures.

For some background on what may be involved, you should:

revise material on random walks and Markov chains from previous courses;
look at some of the recommended literature or other literature you find to see which look most helpful;
search for "pagerank random walk", "pagerank Markov chain", or "random surfer" on the web.

Here are a couple of good places to start:

How Google Finds Your Needle in the Web's Haystack;
K. Bryan and T. Leise. The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google. SIAM Review 48 (2006) 569–581.

Prerequisites

1H Linear Algebra, 1H Probability, and 1H Programming are essential.

2H Probability is essential. The material on Markov Chains will be particularly relevant.

Students taking the 3H Stochastic Processes course may find it helpful, but it is not essential for the project.

Group Project

Goals and expectations

Have a thorough understanding of the relevant concepts from Markov chain theory, realised in the context of the PageRank Markov chain, including state space, transition matrix, Chapman–Kolmogorov relation, multi-step transition probabilities via diagonalization, stationary distribution, communication of states, and convergence.
Find the basic description of the PageRank Markov chain and explain it, using appropriate concepts from the theory of directed graphs. In particular, understand the mathematical definitions and their interpretation for the hyperlink matrix, teleportation and the "Google matrix".
Appreciate computational issues with the PageRank scores and be able to implement R or python code to perform the power method. Explain the Markov chain interpretation in terms of mixing times and rates of convergence.
Generate PageRank rankings for suitable data sets. The goal is to do this for the Mathematics Genealogy Project, where nodes are mathematicians and a directed edge indicates that one mathematician advised another for their PhD. Does PageRank produce a sensible measure of "influence" of mathematicians?

Mode of operation and evidence of learning

The first part of the project will be directed reading on the definitions and background of the PageRank algorithm. The group will be encouraged to frame their own description of the alghorithm using the language of Markov chains from Probability II.
The group will code up small "toy" implementations of PageRank scoring in R or python as proof of concept for small graphs.
The last part of the project will be implementation at scale on the mathematical genealogy data. To ensure that this is completed before the presentations, we will start with coding early on.

Resources

Reading list. Many books on probability will have something on Markov chains and several books treat random walks on graphs. Here are some more specific references about PageRank and search engines:

A.N. Langville and C.D. Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings, 2006. Hardcopy in the library.
A. Bonato, A Course on the Web Graph, 2008. Hardcopy in the library.
C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, 2008. eBook via the library.
M. Berry and M. Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd Edition, 2005. eBook via the library.
I. McCulloch, H. Armstrong, and A. Johnson, Social Network Analysis with Applications, 2013. eBook via the library.

For general background on Markov chains and random walks on graphs, in addition to texts recommended for 2H Probability, we suggest

G.R. Grimmett and D. Stirzaker. Probability and Random Processes. 4th ed., Oxford University Press, 2020. Hardcopy in the library.
D.A. Levin, Y. Peres, and E.L. Wilmer. Markov Chains and Mixing Times. 2nd ed., American Mathematical Society, 2017. Hardcopy in the library.
G.R. Grimmett. Probability on Graphs. 2nd ed., Cambridge University Press, 2018. eBook via the library.

For general background on relevant linear algebra, we suggest

C.D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000. eBook via the library.

Data. The package igraph for R is good at handling graphical data. The data from the Mathematics Genealogy Project can be found here.

Individual Project

Potential advanced topics

The project will build on the basic understanding developed during the group project, in one of the following directions:

Going deeper into PageRank by, for example, examining parameter sensitivity: how are the rankings affected by the teleportation probability, for example?
There are many variations of PageRank in the literature, often motivated by particular classes of application, for example personalized PageRank. A project might describe and implement one or more of these variations, and undertake comparisons.
The project might be focused on a particular application away from the web, provided suitable data is available.
There are of course many other ranking procedures that might be suitable for the application in hand, and appealing to some statistics of ranked data might be helpful for drawing comparisons.
Purely theoretical projects are possible, but even here some implementation on representative examples will be very helpful for the exposition.

Goals and expectations

The individual project should take what was learned in the group project as the starting point, but then have its own well-defined focus within the scope outlined above, or another mutually agreed specialization.
The report should include relevant background material where this is important for the narrative.

Mode of operation and evidence of learning

The first two weeks will be spent discussing avenues for exploration and the supervisor will offer guidance on what is feasible.
Students will follow their chosen specialization over the remaining time of the project, with the supervisor available for guidance and advice.
For projects focused on an application, it is important to get to grips with coding early to leave time for analysis and writing up.

Resources

Reading list. As a sample of how PageRank ideas can be applied beyond the setting of the Web, see e.g.

D.F. Gleich. PageRank beyond the web. SIAM Review, 57 (2015) 321–363.
S. Allesina and M. Pascual. Googling food webs: Can an eigenvector measure species' importance for coextinctions? PLoS Comput. Biol. 5 (2009) e1000494.
C. Winter et al. Google goes cancer: Improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS Comput. Biol. 8 (2012) e1002511.
A.Y. Govan and C.D. Meyer, Ranking national football league teams using Google's PageRank. In AA Markov Anniversary Meeting. 2006.
L. Zack, R. Lamb, and S. Ball. An application of Google's PageRank to NFL rankings. Involve 5 (2012) 463–471.

Data. The package igraph for R is good at handling graphical data. Web graph data can be found, e.g. here.

Get in touch if you would be interested in doing some simulations and/or have any questions!

email: Andrew Wade

Back