Statistical Analysis of Network Data

Jonathan Cumming

Statistics
Network science
Reading
Programming (R)
Simulation
Data analysis
Statistical modelling

Many modern datasets describe relationships rather than individual measurements. Examples include social networks of friendships or interactions, collaboration networks between researchers, citation networks linking scientific papers, and biological networks representing interactions between genes or proteins.

In such situations the fundamental data are not attributes of individual observations but the connections between them. These systems are naturally represented as networks (or graphs), where entities are represented by nodes and relationships between them by edges.

Statistical network analysis seeks to explain patterns of connections between entities through models of how networks are formed.
Star Wars interaction network Community structure in the network

Example network based on interactions between characters in the Star Wars films (left), with nodes coloured by detected communities (right). Networks often exhibit clustered structure and highly connected “hub” nodes. Statistical models aim to explain how such patterns arise in real-world networks.

The project begins with a common exploration of how relational data can be represented and analysed statistically. Networks are typically stored as adjacency matrices or edge lists describing which nodes are connected. From this representation we can compute a range of descriptive summaries, such as the number of connections a node has (its degree), measures of centrality or importance, and indicators of clustering or community structure.

These exploratory tools provide an initial statistical view of the structure of a network. Many real networks contain groups of nodes that interact more frequently with each other than with the rest of the system. Such patterns suggest the presence of hidden structure, which can be formalised through statistical models.

A particularly important class of models are stochastic block models, which assume that nodes belong to latent groups and that the probability of a connection between two nodes depends on the groups to which they belong. These models provide a statistical framework for understanding community structure and hidden organisation in network data.

Possible directions for investigation

Expected outcomes

Mode of Operation and Evidence of Learning

This project revolves around developing statistical understanding of network data through reading, discussion, implementation in R, and analysis of simulated and real networks. The emphasis is on connecting structural features of relational data with statistical models that attempt to explain how such networks arise.

Understanding will be demonstrated through the ability to move fluently between graph-based representations of data, statistical summaries, computational modelling, and substantive interpretation of network structure. At Level 4, the emphasis is on independent critical judgement: selecting appropriate methods, recognising what patterns can and cannot be explained by a given model, and articulating the strengths and weaknesses of different statistical perspectives on network data.

Pre-requisites

Resources

Email

Jonathan Cumming