4H Project: Introduction to Reasoning in Large Language Models

Reinforcement Learning provides a framework for learning through interaction, feedback, and delayed reward. In recent years it has also become central to the development of large language models, where RL-style methods are used to improve reasoning, align model behaviour with human preferences, and train agents that can plan, use tools, and solve multi-step tasks. This project will introduce the foundations of reinforcement learning and then examine how these ideas appear in modern reasoning systems and language models.

Project Description

In this project we will first cover the main concepts of reinforcement learning: Markov decision processes, policies, value functions, Bellman equations, Q-learning, policy gradients, and actor-critic methods. We will use selected material from the Hugging Face Reinforcement Learning course, together with more advanced lectures such as Sergey Levine’s reinforcement learning lectures and standard textbook references.

The second part of the project will focus on reinforcement learning for reasoning and language models. We will discuss topics such as reinforcement learning from human feedback, reward models, preference optimisation, process supervision, reasoning traces, and the difficulty of designing rewards for long multi-step reasoning tasks. The central question will be: how can a system learn not only to produce an answer, but to improve the strategy by which it reaches that answer?

Mode of Operation and Evidence of Learning

The group will work through assigned readings, online lectures, and computational exercises. Students will meet regularly to discuss the material, divide preparation tasks, explain concepts to each other, and, where appropriate, prepare small Python demonstrations or experiments. These may include tabular reinforcement learning examples, simple policy-gradient methods, or toy models illustrating reward design and preference-based learning.

Evidence of learning will include participation in group discussions, weekly diary entries recording progress and contributions, short student-led explanations of the material, Python demonstrations or computational experiments, the final group presentation, and the oral examination. Students will be expected to explain not only whether an algorithm works, but why it behaves as it does, what assumptions it makes, and how reward design influences the resulting behaviour.

Your path

In the second term you will be able to persue your own intersts related to RL. Possible topics include reinforcement learning from human feedback, direct preference optimisation, reward modelling, reinforcement learning for mathematical reasoning, agentic tool use, process versus outcome rewards, self-play and curriculum generation, or failure modes such as reward hacking and over-optimisation. You may choose either a theoretical, computational, or literature-based direction.

Mode of Operation and Evidence of Learning

Students will choose an individual direction, in consultation with the supervisor, and investigate it through a mixture of reading, analysis, and, where appropriate, computational experiments. Evidence of learning will be provided through the final written project, which should demonstrate understanding of the selected topic, explain the relevant reinforcement learning ideas, critically discuss their role in reasoning systems or language models, and include code or experimental analysis where this is appropriate.

Pre-/co-requisites

Students should have a good background in probability, linear algebra, and Python programming. Previous exposure to machine learning or neural networks is strongly recommended.

Resources

Sutton and Barto, Reinforcement Learning: An Introduction; Hugging Face Deep Reinforcement Learning Course; Sergey Levine’s Reinforcement Learning lectures; David Silver’s Reinforcement Learning lectures; selected papers on reinforcement learning from human feedback, preference optimisation, reward modelling, and reasoning in large language models.