IntroductionThe structure of the world-wide web can be represented as a directed graph, in which nodes represent web pages and directed edges represent hyperlinks that allow you to navigate from one page to another. At the heart of Google's search engine is (or at least used to be) the PageRank algorithm, which is based on a particular type of random walk on the world-wide-web graph. At each step in discrete time, a random walk on a directed graph chooses its next node uniformly at random from the endpoints of the out-pointing directed edges leading from its current node. The ranking in the PageRank algorithm is based on the probability distribution on nodes obtained by running such a random walk with occasional re-starting. Several variations of the basic algorithm exist. Mathematically, propeties of the PageRank algorithm boils down to some linear algebra, exploiting the theory of Markov chains. This project will investigate the basic PageRank algorithm, its connection to random walks on directed graphs, and its Markov chain formulation. For the Individual Project several avenues of investigation are then available, including variations on the basic algorithm (including personalized PageRank), theoretical properties of the random surfer and related processes, computational efficiency, and so on, as well as applications of PageRank, or its relatives, to other ranking problems. There will scope for simulation and implementation of PageRank-style algorithms on suitable data, and there is scope for comparison of PageRank and other ranking procedures. For some background on what may be involved, you should:
Prerequisites1H Linear Algebra, 1H Probability, and 1H Programming are essential. 2H Probability is essential. The material on Markov Chains will be particularly relevant. Students taking the 3H Stochastic Processes course may find it helpful, but it is not essential for the project. |
Group ProjectGoals and expectations
Mode of operation and evidence of learning
ResourcesReading list. Many books on probability will have something on Markov chains and several books treat random walks on graphs. Here are some more specific references about PageRank and search engines:
For general background on Markov chains and random walks on graphs, in addition to texts recommended for 2H Probability, we suggest
For general background on relevant linear algebra, we suggest
Data. The package igraph for R is good at handling graphical data. The data from the Mathematics Genealogy Project can be found here. |
Individual ProjectPotential advanced topicsThe project will build on the basic understanding developed during the group project, in one of the following directions:
Goals and expectations
Mode of operation and evidence of learning
ResourcesReading list. As a sample of how PageRank ideas can be applied beyond the setting of the Web, see e.g.
Data. The package igraph for R is good at handling graphical data. Web graph data can be found, e.g. here. Get in touch if you would be interested in doing some simulations and/or have any questions! |
email: Andrew Wade